© 2014, Conversant, Inc. All rights reserved.
PRESENTED BY
May 18, 2016
APACHE FLUME OR
APACHE KAFKA?
HOW ABOUT BOTH?
Apache North America Big Data Conference - May 10, 2016
Jayesh Thakrar (jthakrar@conversantmedia.com)
© 2014, Conversant, Inc. All rights reserved.2
 Conversant (www.conversantmedia.com)
• Adserving - real-time bidding
• Intelligent messaging using online and offline activity without using
personally identifiable information (PII)
 Hadoop Engineering
• Designs, builds, and manages clusters running Hadoop, HBase, Spark,
Storm, Kafka, Cassandra, OpenTSDB, etc.
• Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
© 2014, Conversant, Inc. All rights reserved.3
AGENDA
 History and Evolution of Conversant's Data Pipeline
 Flume Customization
 Compare Flume and Kafka
 Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.4
Conversant Data
Pipeline Overview
© 2014, Conversant, Inc. All rights reserved.5
INTER-DATACENTER DATA PIPELINE
InternetAd
Exchanges
Web Sites
(Publishers)
Users
U.S. East Coast
Data Center
European
Data Center
Chicago
Data Center (EDW)
U.S. West Coast
Data Center
© 2014, Conversant, Inc. All rights reserved.6
 Home-grown log collection system in PERL, shell and Python
 15-20 billion log lines
 Comma or tab separated log format, implicit schema
DATA PIPELINE VERSION 1
(PRIOR TO SEPT 2013)
AdServer
Application
AdServer
Application
AdServer
Applications
AdServer
Application
AdServer
Application
Data Center
Local Log
Manager
AdServer
Application
Chicago Log
Aggregator
Data
Warehouse
© 2014, Conversant, Inc. All rights reserved.7
 Non-trivial operational and recovery effort during
• Network/WAN outage
• Planned/unplanned server maintenance
 Difficult file format/schema evolution
 Delayed reporting and metrics (2-3 hours)
 Scaling and storage utilization on local log manager
DATA PIPELINE VERSION 1
© 2014, Conversant, Inc. All rights reserved.8
DATA PIPELINE VERSION 2
(SEP 2013 - MAR 2015)
 Application logging in Avro format
 50-80+ billion daily log lines
 3-hop flume pipeline
 Flume event schema : event header, event payload
• Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines
• Payload = byte array = Avro file
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.9
DATA PIPELINE VERSION 2
 Explicit application log schema
 Version tagged payload = easier log file schema evolution
 No manual recovery during network outages and server maintenance
 Detailed, explicit metrics in real-time
© 2014, Conversant, Inc. All rights reserved.10
DATA PIPELINE VERSION 3
(MAR 2015-JUN 2015)
 Switch from dedicated MapR cluster to CDH cluster (new EDW)
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
Enterprise
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.11
DATA PIPELINE VERSION 3
 About 4-5k file creation/sec by Flume - Namenode overwhelmed!!
 Manual intervention for data recovery - painful reminder of version 1
© 2014, Conversant, Inc. All rights reserved.12
DATA PIPELINE VERSION 4
(JUNE 2015+ )
 Embedded flume agents in applications
 Kafka to "buffer/self-regulate" data flow
 Camus mapreduce framework to land data
AdServer
Application
AdServer
Application
AdServer
Applications
with Embedded
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Enterprise Hadoop
Cluster + Camus
Mapreduce
AdServer
Application
AdServer
ApplicationKafka
© 2014, Conversant, Inc. All rights reserved.13
DATA PIPELINE VERSION 4
 Kafka + Flume = Hadoop decoupling and data redundancy
 Additional metrics and visibility from Kafka
 In future, allows for data sniffing/sampling and real-time stream processing of
log data
© 2014, Conversant, Inc. All rights reserved.14
Flume Customization
© 2014, Conversant, Inc. All rights reserved.15
ADSTACK DATA CENTER BUILDING BLOCK
 Multi-threaded application flushes
batched Avro log lines through
embedded Flume agent based on
time and/or line count thresholds
 Compressor agent compresses
data and sends downstream to
Chicago
• Custom Flume interceptor =
compression and filtering
• Custom Flume selector = event
forwarding to specific channel
© 2014, Conversant, Inc. All rights reserved.16
CHICAGO DEDUPING AND BIFURCATING AGENTS
 Landing Flume Agent
• Custom Interceptor = Check
HBase for UUID, forward if
absent (check-and-forward)
and insert into HBase
• Custom selector = forward
every Nth event to QA flow
(dedicated channel and sink)
© 2014, Conversant, Inc. All rights reserved.17
INTO THE DATA WAREHOUSE
© 2014, Conversant, Inc. All rights reserved.18
KEY POINTS
 Batch of application log lines = "logical log file"
= 1 Flume event = Kafka message
 Application created custom header key/value pairs in Flume events -
log type, server-id, UUID, log version, # of log lines, timestamp, etc.
 Events compressed at remote data center
 Events deduped using HBase lookup (check-and-forward) in Chicago
 Data pipeline resilient to server and network outages and system
maintenance
© 2014, Conversant, Inc. All rights reserved.19
Flume and Kafka
or
Flume v/s Kafka
© 2014, Conversant, Inc. All rights reserved.20
FLUME IN A NUTSHELL: ARCHITECTURE
Source
Channel 1
Channel 2
Sink 2
Sink 1
Flume Agent
interceptor
selector
Source or Sink
© 2014, Conversant, Inc. All rights reserved.21
FLUME IN A NUTSHELL: ECOSYSTEM
Pre-canned Flume Sources
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Exec (Unix pipe/stdout)
 Kafka
 Netcat
 HTTP
 Spooling Directory
 Custom Code
Pre-canned Flume Sinks
 HDFS
 Hive
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Kafka
 File Roll
(Output spooling directory)
 HBase
 Solr
 Elastic Search
 Custom Code
Pre-canned Channels
 Memory Channel
 File Channel
 Kafka Channel
 JDBC Channel
© 2014, Conversant, Inc. All rights reserved.22
KAFKA IN A NUTSHELL: ARCHITECTURE
oldest data latest data
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Kafka Broker
Producer
Consumer A Consumer B
© 2014, Conversant, Inc. All rights reserved.23
KAFKA IN A NUTSHELL: SCALABILITY
Producer 1 Producer 2
© 2014, Conversant, Inc. All rights reserved.24
FLUME AND KAFKA: DATA PIPELINE BLOCKS
Data
Source
Flume or
Kafka
Data
Destination
Flume or
Kafka
Data
Source
Data
Source
Data
Destination
Data
Destination
Data
Destination
Data
Destination
Flume or
Kafka
© 2014, Conversant, Inc. All rights reserved.25
FLUME V/S KAFKA: DATA AND ROUTING
 Data pipeline block philosophy
• Flume = buffered pipeline => transfer and forget
• Kafka = buffered temporal log => transfer and remember (short-term)
 Data introspection, manipulation and conditional routing/multiplexing
• Flume = Can intercept and manipulate events (source/sink interceptor)
Flume = Conditional routing, multiplexing of events (source/sink selector)
• Kafka = Pass-through only
 Changes in data destination or data source
• Flume = requires re-configuration
• Kafka = N/A, source (producer) and destination (consumer) agnostic
© 2014, Conversant, Inc. All rights reserved.26
FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM
 Server Outage
• Flume = Flume-to-flume or incoming flow = failover to backup flume agent
Outgoing flow = buffered in channel in flume agent
• Kafka = Producer/consumer failover to another broker (replica partition)
 Scalability
• Flume = add agents, re-configure (re-wire) data flows
• Kafka = add brokers, increase topic partitions and (re)distribute partitions
 Ecosystem
• Flume = Pre-canned sources, sinks and channels
• Kafka = Kafka Connect and Kafka Streams
© 2014, Conversant, Inc. All rights reserved.27
Administration,
Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.28
ADMINISTRATION
 No UI in either of them
 Flume: agent stop/start shell script
 Kafka
• Stop/start brokers
• Create/delete/view/manage topics, partitions and topic configuration
• Other utilities - e.g. view log data, stress testing, etc.
© 2014, Conversant, Inc. All rights reserved.29
FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
© 2014, Conversant, Inc. All rights reserved.30
METRICS
 Kafka - JMX and API
• Broker network traffic
• Topic and partition traffic
• Replication and consumer lag
© 2014, Conversant, Inc. All rights reserved.31
MONITORING AND ALERTING
 Flume Key Health Indicators
• Flume listener port
• Incoming traffic rate and errors
• Outgoing traffic rate and errors
• Channel capacity utilization
 Kafka Key Health Indicators
• Broker listener port
• Under-replicated partitions (In-sync replica and amount of replica lag)
• Consumer lag
© 2014, Conversant, Inc. All rights reserved.32
MONITORING & METRICS @ CONVERSANT
TSDB Graph of
Flume Events
Across Data Centers
The blip is a rolling
restart of servers for a
software deploy
Legend
Chicago
East Coast
West Coast
Europe
© 2014, Conversant, Inc. All rights reserved.33
MONITORING & METRICS IN GRAFANA DASHBOARDS
© 2014, Conversant, Inc. All rights reserved.34
MORE INFO ON CONVERSANT DATA PIPELINE
 Conversant Blog
http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere
 Sample GitHub Project
https://github.com/mbkeane/BigDataTechCon
 Chicago Area Kafka Enthusiasts (CAKE)
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
© 2014, Conversant, Inc. All rights reserved.35
Questions?

ApacheCon-Flume-Kafka-2016

  • 1.
    © 2014, Conversant,Inc. All rights reserved. PRESENTED BY May 18, 2016 APACHE FLUME OR APACHE KAFKA? HOW ABOUT BOTH? Apache North America Big Data Conference - May 10, 2016 Jayesh Thakrar ([email protected])
  • 2.
    © 2014, Conversant,Inc. All rights reserved.2  Conversant (www.conversantmedia.com) • Adserving - real-time bidding • Intelligent messaging using online and offline activity without using personally identifiable information (PII)  Hadoop Engineering • Designs, builds, and manages clusters running Hadoop, HBase, Spark, Storm, Kafka, Cassandra, OpenTSDB, etc. • Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
  • 3.
    © 2014, Conversant,Inc. All rights reserved.3 AGENDA  History and Evolution of Conversant's Data Pipeline  Flume Customization  Compare Flume and Kafka  Metrics and Monitoring
  • 4.
    © 2014, Conversant,Inc. All rights reserved.4 Conversant Data Pipeline Overview
  • 5.
    © 2014, Conversant,Inc. All rights reserved.5 INTER-DATACENTER DATA PIPELINE InternetAd Exchanges Web Sites (Publishers) Users U.S. East Coast Data Center European Data Center Chicago Data Center (EDW) U.S. West Coast Data Center
  • 6.
    © 2014, Conversant,Inc. All rights reserved.6  Home-grown log collection system in PERL, shell and Python  15-20 billion log lines  Comma or tab separated log format, implicit schema DATA PIPELINE VERSION 1 (PRIOR TO SEPT 2013) AdServer Application AdServer Application AdServer Applications AdServer Application AdServer Application Data Center Local Log Manager AdServer Application Chicago Log Aggregator Data Warehouse
  • 7.
    © 2014, Conversant,Inc. All rights reserved.7  Non-trivial operational and recovery effort during • Network/WAN outage • Planned/unplanned server maintenance  Difficult file format/schema evolution  Delayed reporting and metrics (2-3 hours)  Scaling and storage utilization on local log manager DATA PIPELINE VERSION 1
  • 8.
    © 2014, Conversant,Inc. All rights reserved.8 DATA PIPELINE VERSION 2 (SEP 2013 - MAR 2015)  Application logging in Avro format  50-80+ billion daily log lines  3-hop flume pipeline  Flume event schema : event header, event payload • Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines • Payload = byte array = Avro file AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster
  • 9.
    © 2014, Conversant,Inc. All rights reserved.9 DATA PIPELINE VERSION 2  Explicit application log schema  Version tagged payload = easier log file schema evolution  No manual recovery during network outages and server maintenance  Detailed, explicit metrics in real-time
  • 10.
    © 2014, Conversant,Inc. All rights reserved.10 DATA PIPELINE VERSION 3 (MAR 2015-JUN 2015)  Switch from dedicated MapR cluster to CDH cluster (new EDW) AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster Enterprise Hadoop Cluster
  • 11.
    © 2014, Conversant,Inc. All rights reserved.11 DATA PIPELINE VERSION 3  About 4-5k file creation/sec by Flume - Namenode overwhelmed!!  Manual intervention for data recovery - painful reminder of version 1
  • 12.
    © 2014, Conversant,Inc. All rights reserved.12 DATA PIPELINE VERSION 4 (JUNE 2015+ )  Embedded flume agents in applications  Kafka to "buffer/self-regulate" data flow  Camus mapreduce framework to land data AdServer Application AdServer Application AdServer Applications with Embedded Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Enterprise Hadoop Cluster + Camus Mapreduce AdServer Application AdServer ApplicationKafka
  • 13.
    © 2014, Conversant,Inc. All rights reserved.13 DATA PIPELINE VERSION 4  Kafka + Flume = Hadoop decoupling and data redundancy  Additional metrics and visibility from Kafka  In future, allows for data sniffing/sampling and real-time stream processing of log data
  • 14.
    © 2014, Conversant,Inc. All rights reserved.14 Flume Customization
  • 15.
    © 2014, Conversant,Inc. All rights reserved.15 ADSTACK DATA CENTER BUILDING BLOCK  Multi-threaded application flushes batched Avro log lines through embedded Flume agent based on time and/or line count thresholds  Compressor agent compresses data and sends downstream to Chicago • Custom Flume interceptor = compression and filtering • Custom Flume selector = event forwarding to specific channel
  • 16.
    © 2014, Conversant,Inc. All rights reserved.16 CHICAGO DEDUPING AND BIFURCATING AGENTS  Landing Flume Agent • Custom Interceptor = Check HBase for UUID, forward if absent (check-and-forward) and insert into HBase • Custom selector = forward every Nth event to QA flow (dedicated channel and sink)
  • 17.
    © 2014, Conversant,Inc. All rights reserved.17 INTO THE DATA WAREHOUSE
  • 18.
    © 2014, Conversant,Inc. All rights reserved.18 KEY POINTS  Batch of application log lines = "logical log file" = 1 Flume event = Kafka message  Application created custom header key/value pairs in Flume events - log type, server-id, UUID, log version, # of log lines, timestamp, etc.  Events compressed at remote data center  Events deduped using HBase lookup (check-and-forward) in Chicago  Data pipeline resilient to server and network outages and system maintenance
  • 19.
    © 2014, Conversant,Inc. All rights reserved.19 Flume and Kafka or Flume v/s Kafka
  • 20.
    © 2014, Conversant,Inc. All rights reserved.20 FLUME IN A NUTSHELL: ARCHITECTURE Source Channel 1 Channel 2 Sink 2 Sink 1 Flume Agent interceptor selector Source or Sink
  • 21.
    © 2014, Conversant,Inc. All rights reserved.21 FLUME IN A NUTSHELL: ECOSYSTEM Pre-canned Flume Sources  Avro Flume Sink (for daisy chaining)  Thrift  Exec (Unix pipe/stdout)  Kafka  Netcat  HTTP  Spooling Directory  Custom Code Pre-canned Flume Sinks  HDFS  Hive  Avro Flume Sink (for daisy chaining)  Thrift  Kafka  File Roll (Output spooling directory)  HBase  Solr  Elastic Search  Custom Code Pre-canned Channels  Memory Channel  File Channel  Kafka Channel  JDBC Channel
  • 22.
    © 2014, Conversant,Inc. All rights reserved.22 KAFKA IN A NUTSHELL: ARCHITECTURE oldest data latest data Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Kafka Broker Producer Consumer A Consumer B
  • 23.
    © 2014, Conversant,Inc. All rights reserved.23 KAFKA IN A NUTSHELL: SCALABILITY Producer 1 Producer 2
  • 24.
    © 2014, Conversant,Inc. All rights reserved.24 FLUME AND KAFKA: DATA PIPELINE BLOCKS Data Source Flume or Kafka Data Destination Flume or Kafka Data Source Data Source Data Destination Data Destination Data Destination Data Destination Flume or Kafka
  • 25.
    © 2014, Conversant,Inc. All rights reserved.25 FLUME V/S KAFKA: DATA AND ROUTING  Data pipeline block philosophy • Flume = buffered pipeline => transfer and forget • Kafka = buffered temporal log => transfer and remember (short-term)  Data introspection, manipulation and conditional routing/multiplexing • Flume = Can intercept and manipulate events (source/sink interceptor) Flume = Conditional routing, multiplexing of events (source/sink selector) • Kafka = Pass-through only  Changes in data destination or data source • Flume = requires re-configuration • Kafka = N/A, source (producer) and destination (consumer) agnostic
  • 26.
    © 2014, Conversant,Inc. All rights reserved.26 FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM  Server Outage • Flume = Flume-to-flume or incoming flow = failover to backup flume agent Outgoing flow = buffered in channel in flume agent • Kafka = Producer/consumer failover to another broker (replica partition)  Scalability • Flume = add agents, re-configure (re-wire) data flows • Kafka = add brokers, increase topic partitions and (re)distribute partitions  Ecosystem • Flume = Pre-canned sources, sinks and channels • Kafka = Kafka Connect and Kafka Streams
  • 27.
    © 2014, Conversant,Inc. All rights reserved.27 Administration, Metrics and Monitoring
  • 28.
    © 2014, Conversant,Inc. All rights reserved.28 ADMINISTRATION  No UI in either of them  Flume: agent stop/start shell script  Kafka • Stop/start brokers • Create/delete/view/manage topics, partitions and topic configuration • Other utilities - e.g. view log data, stress testing, etc.
  • 29.
    © 2014, Conversant,Inc. All rights reserved.29 FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
  • 30.
    © 2014, Conversant,Inc. All rights reserved.30 METRICS  Kafka - JMX and API • Broker network traffic • Topic and partition traffic • Replication and consumer lag
  • 31.
    © 2014, Conversant,Inc. All rights reserved.31 MONITORING AND ALERTING  Flume Key Health Indicators • Flume listener port • Incoming traffic rate and errors • Outgoing traffic rate and errors • Channel capacity utilization  Kafka Key Health Indicators • Broker listener port • Under-replicated partitions (In-sync replica and amount of replica lag) • Consumer lag
  • 32.
    © 2014, Conversant,Inc. All rights reserved.32 MONITORING & METRICS @ CONVERSANT TSDB Graph of Flume Events Across Data Centers The blip is a rolling restart of servers for a software deploy Legend Chicago East Coast West Coast Europe
  • 33.
    © 2014, Conversant,Inc. All rights reserved.33 MONITORING & METRICS IN GRAFANA DASHBOARDS
  • 34.
    © 2014, Conversant,Inc. All rights reserved.34 MORE INFO ON CONVERSANT DATA PIPELINE  Conversant Blog http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere  Sample GitHub Project https://github.com/mbkeane/BigDataTechCon  Chicago Area Kafka Enthusiasts (CAKE) http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
  • 35.
    © 2014, Conversant,Inc. All rights reserved.35 Questions?