SlideShare a Scribd company logo
5
Most read
10
Most read
13
Most read
Apache Flume: Data Collection System for 
HADOOP
Outline 
Overview of Flume 
Flume Sources 
Channels & Sinks 
Flume Topology 
Production Architecture 
Monitoring & Performance
Overview of Flume 
Collection, Aggregation of streaming Event Data 
Typically used for log data 
Significant advantages over ad-hoc solutions 
Reliable, Scalable, Manageable, Customizable 
and High Performance 
Declarative, Dynamic Configuration 
Contextual Routing 
Feature rich 
Fully extensible
Core Concepts: Event 
An Event is the fundamental unit of data transported by 
Flume from its point of origination to its final destination. 
Event is a byte array payload accompanied by optional 
headers. 
Payload is opaque to Flume 
Headers are specified as an unordered collection of string key-value 
pairs, with keys being unique across the collection 
Headers can be used for contextual routing
Core Concepts: Client 
An entity that generates events and sends them to 
one or more Agents. 
Example 
Flume log4j Appender 
Custom Client using Client SDK (org.apache.flume.api) 
Decouples Flume from the system where event data is consumed from 
Not needed in all cases
Core Concepts: Agent 
A container for hosting Sources, Channels, Sinks and other 
components that enable the transportation of events from one 
place to another. 
Fundamental part of a Flume flow 
Provides Configuration, Life-Cycle Management, and Monitoring 
Support for hosted components
Typical Aggregation Flow 
[Client]+  Agent [ Agent]*  Destination
Core Concepts: Source 
An active component that receives events from a 
specialized location or mechanism and places it on one 
or Channels. 
Different Source types: 
Specialized sources for integrating with well-known 
systems. Example: Syslog, Netcat 
Auto-Generating Sources: Exec, SEQ 
IPC sources for Agent-to-Agent communication: Avro 
Require at least one channel to function
Source 
Reads data from the source system and passes onto the next hop or to the 
final destination. 
Flume Sources: 
Avro Source 
Exec Source 
JMS Source 
Spooling Directory Source
Core Concepts: Channel 
A passive component that buffers the incoming 
events until they are drained by Sinks. 
Different Channels offer different levels of persistence: 
Memory Channel: volatile 
File Channel: backed by WAL implementation 
JDBC Channel: backed by embedded Database 
Channels are fully transactional 
Provide weak ordering 
guarantees 
Can work with any number of Sources and Sinks.
Core Concepts: Sink 
An active component that removes events from 
a Channel and transmits them to their next hop 
destination. 
Different types of Sinks: 
Terminal sinks that deposit events to their final 
destination. For example: HDFS, HBase 
IPC sink for Agent-to-Agent communication: Avro 
Require exactly one channel to function
Sinks 
Writes data to the next hop or to the final destination. 
Flume Sinks: 
Avro Sink 
HDFS Sink 
HBASE Sink 
File Sink 
Null Sink 
Logger Sink
What is the source in Flume
Fanout
Flume Channels 
Memory Channel 
Recommended if data loss due to crashes are 
ok 
File Channel 
Recommended channel. 
JDBC Channel 
Persistent store of data but introduces 
bottleneck and single point of failure.
Memory Channel 
Events stored on heap 
Limited capacity 
No persistence after a system/process crash 
Very fast 
3 config parameters: 
capacity: Maximum # of events that can be in the channel 
transactionCapacity: Maximum # of events in one txn. 
keepAlive: how long to wait to put/take an event
File Channel
Current Flume Flow
Monitoring: protocol support 
Several monitoring protocols supported out of the box 
JMX 
Ganglia 
HTTP (JSON) 
Java opts must be set in flume-env.sh to configure monitoring 
Ganglia and HTTP monitoring are mutually exclusive

More Related Content

PDF
Server Management
Dell World
 
PPTX
Transport layer
reshmadayma
 
PPT
Rpc Case Studies (Distributed computing)
Sri Prasanna
 
PPTX
User datagram protocol (udp)
Ramola Dhande
 
PPTX
SPINS: Security Protocols for Sensor Networks
Abhijeet Awade
 
PDF
TFTP - Trivial File Transfer Protocol
Peter R. Egli
 
PPTX
HTTP & WWW
RazanAlsaif
 
Server Management
Dell World
 
Transport layer
reshmadayma
 
Rpc Case Studies (Distributed computing)
Sri Prasanna
 
User datagram protocol (udp)
Ramola Dhande
 
SPINS: Security Protocols for Sensor Networks
Abhijeet Awade
 
TFTP - Trivial File Transfer Protocol
Peter R. Egli
 
HTTP & WWW
RazanAlsaif
 

What's hot (20)

PDF
Lec 8(FTP Protocol)
maamir farooq
 
PPT
Vlan
Mayank Saxena
 
PPTX
Unit 4 - Transport Layer
KalpanaC14
 
PPTX
Distributed Systems
naveedchak
 
PDF
MQTT and CoAP
ITVoyagers
 
PPT
Pipelining
AJAL A J
 
PDF
Transport layer services
Melvin Cabatuan
 
PPTX
VLAN
Varsha Honde
 
PPTX
6 understanding DHCP
Hameda Hurmat
 
PPTX
SYBSC IT COMPUTER NETWORKS UNIT I Network Models
Arti Parab Academics
 
PPTX
Design Goals of Distributed System
Ashish KC
 
PPT
Linux file system
Midaga Mengistu
 
PDF
Checksum explaination
Mohammed Fuzail
 
PPSX
Framing Protocols
selvakumar_b1985
 
PPT
Presentation of the IEEE 802.11a MAC Layer
Mahdi Ahmed Jama
 
PPT
Chapter 15
Faisal Mehmood
 
PPT
Mobile computing unit2,SDMA,FDMA,CDMA,TDMA Space Division Multi Access,Frequ...
Pallepati Vasavi
 
PPTX
Transport layer
Mukesh Chinta
 
PPTX
communication-protocols
Ali Kamil
 
PPTX
Transmission Control Protocol (TCP)
k33a
 
Lec 8(FTP Protocol)
maamir farooq
 
Unit 4 - Transport Layer
KalpanaC14
 
Distributed Systems
naveedchak
 
MQTT and CoAP
ITVoyagers
 
Pipelining
AJAL A J
 
Transport layer services
Melvin Cabatuan
 
6 understanding DHCP
Hameda Hurmat
 
SYBSC IT COMPUTER NETWORKS UNIT I Network Models
Arti Parab Academics
 
Design Goals of Distributed System
Ashish KC
 
Linux file system
Midaga Mengistu
 
Checksum explaination
Mohammed Fuzail
 
Framing Protocols
selvakumar_b1985
 
Presentation of the IEEE 802.11a MAC Layer
Mahdi Ahmed Jama
 
Chapter 15
Faisal Mehmood
 
Mobile computing unit2,SDMA,FDMA,CDMA,TDMA Space Division Multi Access,Frequ...
Pallepati Vasavi
 
Transport layer
Mukesh Chinta
 
communication-protocols
Ali Kamil
 
Transmission Control Protocol (TCP)
k33a
 
Ad

Viewers also liked (20)

PPT
AdminCMS
AB Design
 
PPTX
Secrets St.James/Wild Orchid Pictures
chglat
 
PPTX
Prsentation
darja18
 
PPTX
Secrets The Vine
chglat
 
PPTX
Jana
chglat
 
PPTX
Amanda Hawaii
chglat
 
PPTX
Kidology Experience - Resume
Brandon Maddux
 
PPTX
All American Marketing Experience - Resume
Brandon Maddux
 
PPTX
Jill Cozumel
chglat
 
PPTX
Wedding Options
chglat
 
PPTX
Kelley Antigua
chglat
 
PDF
I........................you
Tanhatairn
 
PPT
Kriminologia
INA33
 
PPTX
Hard Rock Riviera Maya
chglat
 
PPT
Media pitch
GiggleMeTimbers
 
PPTX
Paulina Jamaica Options
chglat
 
PPTX
Pam Curacao
chglat
 
ODP
Vanessasaggiorogagliazzo.let'stalkaboutenglish
vanessasagli
 
PPTX
Secrets Akumal
chglat
 
PPTX
Liz Puerto Vallarta
chglat
 
AdminCMS
AB Design
 
Secrets St.James/Wild Orchid Pictures
chglat
 
Prsentation
darja18
 
Secrets The Vine
chglat
 
Jana
chglat
 
Amanda Hawaii
chglat
 
Kidology Experience - Resume
Brandon Maddux
 
All American Marketing Experience - Resume
Brandon Maddux
 
Jill Cozumel
chglat
 
Wedding Options
chglat
 
Kelley Antigua
chglat
 
I........................you
Tanhatairn
 
Kriminologia
INA33
 
Hard Rock Riviera Maya
chglat
 
Media pitch
GiggleMeTimbers
 
Paulina Jamaica Options
chglat
 
Pam Curacao
chglat
 
Vanessasaggiorogagliazzo.let'stalkaboutenglish
vanessasagli
 
Secrets Akumal
chglat
 
Liz Puerto Vallarta
chglat
 
Ad

Similar to Flume (20)

PPTX
Flume DS -JSP.pptx
Jayesh Patil
 
PPTX
Flume lspe-110325145754-phpapp01
joahp
 
PPTX
Cloudera's Flume
Cloudera, Inc.
 
PPTX
Apache flume
Ramakrishna kapa
 
PPTX
Flume basic
Uday Vakalapudi
 
PDF
Apache Flume
Arinto Murdopo
 
PPTX
Centralized logging with Flume
Ratnakar Pawar
 
PPTX
Apache flume - an Introduction
Erik Schmiegelow
 
PPTX
Apache Flume
PrabhuSundarraj1
 
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Yahoo Developer Network
 
PDF
Apache flume by Swapnil Dubey
Swapnil Dubey
 
PPTX
Session 09 - Flume
AnandMHadoop
 
PDF
Inside Flume
Cloudera, Inc.
 
PPTX
Spark+flume seattle
Hari Shreedharan
 
PDF
Introduction to Flume
Rupak Roy
 
PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
PPTX
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
PDF
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
 
PPTX
Chicago Data Summit: Flume: An Introduction
Cloudera, Inc.
 
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
Flume DS -JSP.pptx
Jayesh Patil
 
Flume lspe-110325145754-phpapp01
joahp
 
Cloudera's Flume
Cloudera, Inc.
 
Apache flume
Ramakrishna kapa
 
Flume basic
Uday Vakalapudi
 
Apache Flume
Arinto Murdopo
 
Centralized logging with Flume
Ratnakar Pawar
 
Apache flume - an Introduction
Erik Schmiegelow
 
Apache Flume
PrabhuSundarraj1
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Yahoo Developer Network
 
Apache flume by Swapnil Dubey
Swapnil Dubey
 
Session 09 - Flume
AnandMHadoop
 
Inside Flume
Cloudera, Inc.
 
Spark+flume seattle
Hari Shreedharan
 
Introduction to Flume
Rupak Roy
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
 
Chicago Data Summit: Flume: An Introduction
Cloudera, Inc.
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 

More from Chirag Ahuja (10)

PDF
Deploy hadoop cluster
Chirag Ahuja
 
PDF
Word count example in hadoop mapreduce using java
Chirag Ahuja
 
PDF
Big data introduction
Chirag Ahuja
 
PPTX
Hbase
Chirag Ahuja
 
PPTX
Pig
Chirag Ahuja
 
PPTX
Hive : WareHousing Over hadoop
Chirag Ahuja
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PPTX
MapReduce basic
Chirag Ahuja
 
PPTX
Hdfs
Chirag Ahuja
 
PPTX
Hadoop introduction
Chirag Ahuja
 
Deploy hadoop cluster
Chirag Ahuja
 
Word count example in hadoop mapreduce using java
Chirag Ahuja
 
Big data introduction
Chirag Ahuja
 
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Mapreduce advanced
Chirag Ahuja
 
MapReduce basic
Chirag Ahuja
 
Hadoop introduction
Chirag Ahuja
 

Recently uploaded (20)

PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 

Flume

  • 1. Apache Flume: Data Collection System for HADOOP
  • 2. Outline Overview of Flume Flume Sources Channels & Sinks Flume Topology Production Architecture Monitoring & Performance
  • 3. Overview of Flume Collection, Aggregation of streaming Event Data Typically used for log data Significant advantages over ad-hoc solutions Reliable, Scalable, Manageable, Customizable and High Performance Declarative, Dynamic Configuration Contextual Routing Feature rich Fully extensible
  • 4. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. Payload is opaque to Flume Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection Headers can be used for contextual routing
  • 5. Core Concepts: Client An entity that generates events and sends them to one or more Agents. Example Flume log4j Appender Custom Client using Client SDK (org.apache.flume.api) Decouples Flume from the system where event data is consumed from Not needed in all cases
  • 6. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. Fundamental part of a Flume flow Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components
  • 7. Typical Aggregation Flow [Client]+  Agent [ Agent]*  Destination
  • 8. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. Different Source types: Specialized sources for integrating with well-known systems. Example: Syslog, Netcat Auto-Generating Sources: Exec, SEQ IPC sources for Agent-to-Agent communication: Avro Require at least one channel to function
  • 9. Source Reads data from the source system and passes onto the next hop or to the final destination. Flume Sources: Avro Source Exec Source JMS Source Spooling Directory Source
  • 10. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. Different Channels offer different levels of persistence: Memory Channel: volatile File Channel: backed by WAL implementation JDBC Channel: backed by embedded Database Channels are fully transactional Provide weak ordering guarantees Can work with any number of Sources and Sinks.
  • 11. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. Different types of Sinks: Terminal sinks that deposit events to their final destination. For example: HDFS, HBase IPC sink for Agent-to-Agent communication: Avro Require exactly one channel to function
  • 12. Sinks Writes data to the next hop or to the final destination. Flume Sinks: Avro Sink HDFS Sink HBASE Sink File Sink Null Sink Logger Sink
  • 13. What is the source in Flume
  • 15. Flume Channels Memory Channel Recommended if data loss due to crashes are ok File Channel Recommended channel. JDBC Channel Persistent store of data but introduces bottleneck and single point of failure.
  • 16. Memory Channel Events stored on heap Limited capacity No persistence after a system/process crash Very fast 3 config parameters: capacity: Maximum # of events that can be in the channel transactionCapacity: Maximum # of events in one txn. keepAlive: how long to wait to put/take an event
  • 19. Monitoring: protocol support Several monitoring protocols supported out of the box JMX Ganglia HTTP (JSON) Java opts must be set in flume-env.sh to configure monitoring Ganglia and HTTP monitoring are mutually exclusive