4
Most read
5
Most read
6
Most read
Apache Flume: Data Collection System for 
HADOOP
Outline 
Overview of Flume 
Flume Sources 
Channels & Sinks 
Flume Topology 
Production Architecture 
Monitoring & Performance
Overview of Flume 
Collection, Aggregation of streaming Event Data 
Typically used for log data 
Significant advantages over ad-hoc solutions 
Reliable, Scalable, Manageable, Customizable 
and High Performance 
Declarative, Dynamic Configuration 
Contextual Routing 
Feature rich 
Fully extensible
Core Concepts: Event 
An Event is the fundamental unit of data transported by 
Flume from its point of origination to its final destination. 
Event is a byte array payload accompanied by optional 
headers. 
Payload is opaque to Flume 
Headers are specified as an unordered collection of string key-value 
pairs, with keys being unique across the collection 
Headers can be used for contextual routing
Core Concepts: Client 
An entity that generates events and sends them to 
one or more Agents. 
Example 
Flume log4j Appender 
Custom Client using Client SDK (org.apache.flume.api) 
Decouples Flume from the system where event data is consumed from 
Not needed in all cases
Core Concepts: Agent 
A container for hosting Sources, Channels, Sinks and other 
components that enable the transportation of events from one 
place to another. 
Fundamental part of a Flume flow 
Provides Configuration, Life-Cycle Management, and Monitoring 
Support for hosted components
Typical Aggregation Flow 
[Client]+  Agent [ Agent]*  Destination
Core Concepts: Source 
An active component that receives events from a 
specialized location or mechanism and places it on one 
or Channels. 
Different Source types: 
Specialized sources for integrating with well-known 
systems. Example: Syslog, Netcat 
Auto-Generating Sources: Exec, SEQ 
IPC sources for Agent-to-Agent communication: Avro 
Require at least one channel to function
Source 
Reads data from the source system and passes onto the next hop or to the 
final destination. 
Flume Sources: 
Avro Source 
Exec Source 
JMS Source 
Spooling Directory Source
Core Concepts: Channel 
A passive component that buffers the incoming 
events until they are drained by Sinks. 
Different Channels offer different levels of persistence: 
Memory Channel: volatile 
File Channel: backed by WAL implementation 
JDBC Channel: backed by embedded Database 
Channels are fully transactional 
Provide weak ordering 
guarantees 
Can work with any number of Sources and Sinks.
Core Concepts: Sink 
An active component that removes events from 
a Channel and transmits them to their next hop 
destination. 
Different types of Sinks: 
Terminal sinks that deposit events to their final 
destination. For example: HDFS, HBase 
IPC sink for Agent-to-Agent communication: Avro 
Require exactly one channel to function
Sinks 
Writes data to the next hop or to the final destination. 
Flume Sinks: 
Avro Sink 
HDFS Sink 
HBASE Sink 
File Sink 
Null Sink 
Logger Sink
What is the source in Flume
Fanout
Flume Channels 
Memory Channel 
Recommended if data loss due to crashes are 
ok 
File Channel 
Recommended channel. 
JDBC Channel 
Persistent store of data but introduces 
bottleneck and single point of failure.
Memory Channel 
Events stored on heap 
Limited capacity 
No persistence after a system/process crash 
Very fast 
3 config parameters: 
capacity: Maximum # of events that can be in the channel 
transactionCapacity: Maximum # of events in one txn. 
keepAlive: how long to wait to put/take an event
File Channel
Current Flume Flow
Monitoring: protocol support 
Several monitoring protocols supported out of the box 
JMX 
Ganglia 
HTTP (JSON) 
Java opts must be set in flume-env.sh to configure monitoring 
Ganglia and HTTP monitoring are mutually exclusive

More Related Content

PDF
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
PPTX
VMware dmz design
PDF
Zabbix para iniciantes
PDF
Reverse Mapping (rmap) in Linux Kernel
PPT
Raid : Redundant Array of Inexpensive Disks
PDF
[IBM 김상훈] 오브젝트스토리지 | 늘어만 가는 데이터 저장문제로 골 아프신가요? (자료를 다운로드하시면 고화질로 보실 수 있습니다.)
PPT
Named Data Networking Operational Aspects - IoT as a Use-case
PPTX
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
VMware dmz design
Zabbix para iniciantes
Reverse Mapping (rmap) in Linux Kernel
Raid : Redundant Array of Inexpensive Disks
[IBM 김상훈] 오브젝트스토리지 | 늘어만 가는 데이터 저장문제로 골 아프신가요? (자료를 다운로드하시면 고화질로 보실 수 있습니다.)
Named Data Networking Operational Aspects - IoT as a Use-case
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드

What's hot (20)

PPTX
Real-time Stream Processing with Apache Flink
PDF
MapR Tutorial Series
PDF
Apache ZooKeeper
PDF
TC Flower Offload
PDF
Linux Instrumentation
PDF
Re-Engineering PostgreSQL as a Time-Series Database
PDF
Embedded Android : System Development - Part III (Audio / Video HAL)
PDF
Something About Dynamic Linking
PDF
Best practices for MySQL High Availability Tutorial
PPTX
DNS Security Presentation ISSA
PDF
Introduction to Apache Flink
PDF
Facebook chat architecture
PPTX
FlashSystem 7300 Midrange Enterprise for Hybrid Cloud L2 Sellers Presentation...
PDF
Monitoring with Ganglia
PDF
DASK and Apache Spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Next Generation Memory Forensics
PDF
BlueStore: a new, faster storage backend for Ceph
PDF
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
PDF
DDoS Mitigation Tools and Techniques
Real-time Stream Processing with Apache Flink
MapR Tutorial Series
Apache ZooKeeper
TC Flower Offload
Linux Instrumentation
Re-Engineering PostgreSQL as a Time-Series Database
Embedded Android : System Development - Part III (Audio / Video HAL)
Something About Dynamic Linking
Best practices for MySQL High Availability Tutorial
DNS Security Presentation ISSA
Introduction to Apache Flink
Facebook chat architecture
FlashSystem 7300 Midrange Enterprise for Hybrid Cloud L2 Sellers Presentation...
Monitoring with Ganglia
DASK and Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Next Generation Memory Forensics
BlueStore: a new, faster storage backend for Ceph
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
DDoS Mitigation Tools and Techniques
Ad

Viewers also liked (20)

PPT
AdminCMS
PPTX
Secrets St.James/Wild Orchid Pictures
PPTX
Prsentation
PPTX
Secrets The Vine
PPTX
Jana
PPTX
Amanda Hawaii
PPTX
Kidology Experience - Resume
PPTX
All American Marketing Experience - Resume
PPTX
Jill Cozumel
PPTX
Wedding Options
PPTX
Kelley Antigua
PDF
I........................you
PPT
Kriminologia
PPTX
Hard Rock Riviera Maya
PPT
Media pitch
PPTX
Paulina Jamaica Options
PPTX
Pam Curacao
ODP
Vanessasaggiorogagliazzo.let'stalkaboutenglish
PPTX
Secrets Akumal
PPTX
Liz Puerto Vallarta
AdminCMS
Secrets St.James/Wild Orchid Pictures
Prsentation
Secrets The Vine
Jana
Amanda Hawaii
Kidology Experience - Resume
All American Marketing Experience - Resume
Jill Cozumel
Wedding Options
Kelley Antigua
I........................you
Kriminologia
Hard Rock Riviera Maya
Media pitch
Paulina Jamaica Options
Pam Curacao
Vanessasaggiorogagliazzo.let'stalkaboutenglish
Secrets Akumal
Liz Puerto Vallarta
Ad

Similar to Flume (20)

PDF
Avvo fkafka
PPTX
Centralized logging with Flume
PPTX
Flume basic
PPTX
Cloudera's Flume
PPTX
Flume lspe-110325145754-phpapp01
PPTX
Apache flume
PDF
Apache flume by Swapnil Dubey
PPTX
Session 09 - Flume
PPTX
Flume DS -JSP.pptx
PDF
Introduction to Flume
PDF
Data persistency (draco, cygnus, sth comet, quantum leap)
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PDF
Apache NiFi: latest developments for flow management at scale
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
Deploying Apache Flume to enable low-latency analytics
PDF
FIWARE Tech Summit - FIWARE Cygnus and STH-Comet
PPTX
Flume vs. kafka
PPTX
Apache flume - an Introduction
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
Web Service
Avvo fkafka
Centralized logging with Flume
Flume basic
Cloudera's Flume
Flume lspe-110325145754-phpapp01
Apache flume
Apache flume by Swapnil Dubey
Session 09 - Flume
Flume DS -JSP.pptx
Introduction to Flume
Data persistency (draco, cygnus, sth comet, quantum leap)
Open Source Big Data Ingestion - Without the Heartburn!
Apache NiFi: latest developments for flow management at scale
GOTO Night Amsterdam - Stream processing with Apache Flink
Deploying Apache Flume to enable low-latency analytics
FIWARE Tech Summit - FIWARE Cygnus and STH-Comet
Flume vs. kafka
Apache flume - an Introduction
QCon London - Stream Processing with Apache Flink
Web Service

More from Chirag Ahuja (10)

PDF
Deploy hadoop cluster
PDF
Word count example in hadoop mapreduce using java
PDF
Big data introduction
PPTX
PPTX
PPTX
Hive : WareHousing Over hadoop
PPTX
Mapreduce advanced
PPTX
MapReduce basic
PPTX
PPTX
Hadoop introduction
Deploy hadoop cluster
Word count example in hadoop mapreduce using java
Big data introduction
Hive : WareHousing Over hadoop
Mapreduce advanced
MapReduce basic
Hadoop introduction

Recently uploaded (20)

PPT
2011 HCRP presentation-final.pptjrirrififfi
PPTX
Introduction to Fundamentals of Data Security
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
lung disease detection using transfer learning approach.pptx
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
langchainpptforbeginners_easy_explanation.pptx
PPTX
GPS sensor used agriculture land for automation
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPTX
ifsm.pptx, institutional food service management
PPT
Classification methods in data analytics.ppt
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPT
What is life? We never know the answer exactly
PDF
Mcdonald's : a half century growth . pdf
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
2011 HCRP presentation-final.pptjrirrififfi
Introduction to Fundamentals of Data Security
1 hour to get there before the game is done so you don’t need a car seat for ...
Hushh Hackathon for IIT Bombay: Create your very own Agents
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
lung disease detection using transfer learning approach.pptx
machinelearningoverview-250809184828-927201d2.pptx
langchainpptforbeginners_easy_explanation.pptx
GPS sensor used agriculture land for automation
DATA ANALYTICS COURSE IN PITAMPURA.pptx
ifsm.pptx, institutional food service management
Classification methods in data analytics.ppt
Chapter security of computer_8_v8.1.pptx
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPT for Diseases.pptx, there are 3 types of diseases
What is life? We never know the answer exactly
Mcdonald's : a half century growth . pdf
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc

Flume

  • 1. Apache Flume: Data Collection System for HADOOP
  • 2. Outline Overview of Flume Flume Sources Channels & Sinks Flume Topology Production Architecture Monitoring & Performance
  • 3. Overview of Flume Collection, Aggregation of streaming Event Data Typically used for log data Significant advantages over ad-hoc solutions Reliable, Scalable, Manageable, Customizable and High Performance Declarative, Dynamic Configuration Contextual Routing Feature rich Fully extensible
  • 4. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. Payload is opaque to Flume Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection Headers can be used for contextual routing
  • 5. Core Concepts: Client An entity that generates events and sends them to one or more Agents. Example Flume log4j Appender Custom Client using Client SDK (org.apache.flume.api) Decouples Flume from the system where event data is consumed from Not needed in all cases
  • 6. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. Fundamental part of a Flume flow Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components
  • 7. Typical Aggregation Flow [Client]+  Agent [ Agent]*  Destination
  • 8. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. Different Source types: Specialized sources for integrating with well-known systems. Example: Syslog, Netcat Auto-Generating Sources: Exec, SEQ IPC sources for Agent-to-Agent communication: Avro Require at least one channel to function
  • 9. Source Reads data from the source system and passes onto the next hop or to the final destination. Flume Sources: Avro Source Exec Source JMS Source Spooling Directory Source
  • 10. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. Different Channels offer different levels of persistence: Memory Channel: volatile File Channel: backed by WAL implementation JDBC Channel: backed by embedded Database Channels are fully transactional Provide weak ordering guarantees Can work with any number of Sources and Sinks.
  • 11. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. Different types of Sinks: Terminal sinks that deposit events to their final destination. For example: HDFS, HBase IPC sink for Agent-to-Agent communication: Avro Require exactly one channel to function
  • 12. Sinks Writes data to the next hop or to the final destination. Flume Sinks: Avro Sink HDFS Sink HBASE Sink File Sink Null Sink Logger Sink
  • 13. What is the source in Flume
  • 15. Flume Channels Memory Channel Recommended if data loss due to crashes are ok File Channel Recommended channel. JDBC Channel Persistent store of data but introduces bottleneck and single point of failure.
  • 16. Memory Channel Events stored on heap Limited capacity No persistence after a system/process crash Very fast 3 config parameters: capacity: Maximum # of events that can be in the channel transactionCapacity: Maximum # of events in one txn. keepAlive: how long to wait to put/take an event
  • 19. Monitoring: protocol support Several monitoring protocols supported out of the box JMX Ganglia HTTP (JSON) Java opts must be set in flume-env.sh to configure monitoring Ganglia and HTTP monitoring are mutually exclusive