ApacheCon-Flume-Kafka-2016

© 2014, Conversant, Inc. All rights reserved.
PRESENTED BY
May 18, 2016
APACHE FLUME OR
APACHE KAFKA?
HOW ABOUT BOTH?
Apache North America Big Data Conference - May 10, 2016
Jayesh Thakrar (jthakrar@conversantmedia.com)

© 2014, Conversant, Inc. All rights reserved.2
 Conversant (www.conversantmedia.com)
• Adserving - real-time bidding
• Intelligent messaging using online and offline activity without using
personally identifiable information (PII)
 Hadoop Engineering
• Designs, builds, and manages clusters running Hadoop, HBase, Spark,
Storm, Kafka, Cassandra, OpenTSDB, etc.
• Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.

AGENDA
 History and Evolution of Conversant's Data Pipeline
 Flume Customization
 Compare Flume and Kafka
 Metrics and Monitoring

Conversant Data
Pipeline Overview

INTER-DATACENTER DATA PIPELINE
InternetAd
Exchanges
Web Sites
(Publishers)
Users
U.S. East Coast
Data Center
European
Data Center
Chicago
Data Center (EDW)
U.S. West Coast
Data Center

 Home-grown log collection system in PERL, shell and Python
 15-20 billion log lines
 Comma or tab separated log format, implicit schema
DATA PIPELINE VERSION 1
(PRIOR TO SEPT 2013)
AdServer
Application
AdServer
Application
AdServer
Applications
AdServer
Application
AdServer
Application
Data Center
Local Log
Manager
AdServer
Application
Chicago Log
Aggregator
Data
Warehouse

 Non-trivial operational and recovery effort during
• Network/WAN outage
• Planned/unplanned server maintenance
 Difficult file format/schema evolution
 Delayed reporting and metrics (2-3 hours)
 Scaling and storage utilization on local log manager

(SEP 2013 - MAR 2015)
 Application logging in Avro format
 50-80+ billion daily log lines
 3-hop flume pipeline
 Flume event schema : event header, event payload
• Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines
• Payload = byte array = Avro file
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster

 Explicit application log schema
 Version tagged payload = easier log file schema evolution
 No manual recovery during network outages and server maintenance
 Detailed, explicit metrics in real-time

(MAR 2015-JUN 2015)
 Switch from dedicated MapR cluster to CDH cluster (new EDW)
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
Enterprise
Hadoop Cluster

 About 4-5k file creation/sec by Flume - Namenode overwhelmed!!
 Manual intervention for data recovery - painful reminder of version 1

(JUNE 2015+ )
 Embedded flume agents in applications
 Kafka to "buffer/self-regulate" data flow
 Camus mapreduce framework to land data
AdServer
Application
AdServer
Application
AdServer
Applications
with Embedded
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Enterprise Hadoop
Cluster + Camus
Mapreduce
AdServer
Application
AdServer
ApplicationKafka

 Kafka + Flume = Hadoop decoupling and data redundancy
 Additional metrics and visibility from Kafka
 In future, allows for data sniffing/sampling and real-time stream processing of
log data

Flume Customization

ADSTACK DATA CENTER BUILDING BLOCK
 Multi-threaded application flushes
batched Avro log lines through
embedded Flume agent based on
time and/or line count thresholds
 Compressor agent compresses
data and sends downstream to
Chicago
• Custom Flume interceptor =
compression and filtering
• Custom Flume selector = event
forwarding to specific channel

CHICAGO DEDUPING AND BIFURCATING AGENTS
 Landing Flume Agent
• Custom Interceptor = Check
HBase for UUID, forward if
absent (check-and-forward)
and insert into HBase
• Custom selector = forward
every Nth event to QA flow
(dedicated channel and sink)

INTO THE DATA WAREHOUSE

KEY POINTS
 Batch of application log lines = "logical log file"
= 1 Flume event = Kafka message
 Application created custom header key/value pairs in Flume events -
log type, server-id, UUID, log version, # of log lines, timestamp, etc.
 Events compressed at remote data center
 Events deduped using HBase lookup (check-and-forward) in Chicago
 Data pipeline resilient to server and network outages and system
maintenance

Flume and Kafka
or
Flume v/s Kafka

FLUME IN A NUTSHELL: ARCHITECTURE
Source
Channel 1
Channel 2
Sink 2
Sink 1
Flume Agent
interceptor
selector
Source or Sink

FLUME IN A NUTSHELL: ECOSYSTEM
Pre-canned Flume Sources
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Exec (Unix pipe/stdout)
 Kafka
 Netcat
 HTTP
 Spooling Directory
 Custom Code
Pre-canned Flume Sinks
 HDFS
 Hive
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Kafka
 File Roll
(Output spooling directory)
 HBase
 Solr
 Elastic Search
 Custom Code
Pre-canned Channels
 Memory Channel
 File Channel
 Kafka Channel
 JDBC Channel

KAFKA IN A NUTSHELL: ARCHITECTURE
oldest data latest data
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Kafka Broker
Producer
Consumer A Consumer B

KAFKA IN A NUTSHELL: SCALABILITY
Producer 1 Producer 2

FLUME AND KAFKA: DATA PIPELINE BLOCKS
Data
Source
Flume or
Kafka
Data
Destination
Flume or
Kafka
Data
Source
Data
Source
Data
Destination
Data
Destination
Data
Destination
Data
Destination
Flume or
Kafka

FLUME V/S KAFKA: DATA AND ROUTING
 Data pipeline block philosophy
• Flume = buffered pipeline => transfer and forget
• Kafka = buffered temporal log => transfer and remember (short-term)
 Data introspection, manipulation and conditional routing/multiplexing
• Flume = Can intercept and manipulate events (source/sink interceptor)
Flume = Conditional routing, multiplexing of events (source/sink selector)
• Kafka = Pass-through only
 Changes in data destination or data source
• Flume = requires re-configuration
• Kafka = N/A, source (producer) and destination (consumer) agnostic

FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM
 Server Outage
• Flume = Flume-to-flume or incoming flow = failover to backup flume agent
Outgoing flow = buffered in channel in flume agent
• Kafka = Producer/consumer failover to another broker (replica partition)
 Scalability
• Flume = add agents, re-configure (re-wire) data flows
• Kafka = add brokers, increase topic partitions and (re)distribute partitions
 Ecosystem
• Flume = Pre-canned sources, sinks and channels
• Kafka = Kafka Connect and Kafka Streams

Administration,
Metrics and Monitoring

ADMINISTRATION
 No UI in either of them
 Flume: agent stop/start shell script
 Kafka
• Stop/start brokers
• Create/delete/view/manage topics, partitions and topic configuration
• Other utilities - e.g. view log data, stress testing, etc.

FLUME METRICS: JMX AND HTTP/JSON ENDPOINT

METRICS
 Kafka - JMX and API
• Broker network traffic
• Topic and partition traffic
• Replication and consumer lag

MONITORING AND ALERTING
 Flume Key Health Indicators
• Flume listener port
• Incoming traffic rate and errors
• Outgoing traffic rate and errors
• Channel capacity utilization
 Kafka Key Health Indicators
• Broker listener port
• Under-replicated partitions (In-sync replica and amount of replica lag)
• Consumer lag

MONITORING & METRICS @ CONVERSANT
TSDB Graph of
Flume Events
Across Data Centers
The blip is a rolling
restart of servers for a
software deploy
Legend
Chicago
East Coast
West Coast
Europe

MONITORING & METRICS IN GRAFANA DASHBOARDS

MORE INFO ON CONVERSANT DATA PIPELINE
 Conversant Blog
http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere
 Sample GitHub Project
https://github.com/mbkeane/BigDataTechCon
 Chicago Area Kafka Enthusiasts (CAKE)
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233

Questions?

ApacheCon-Flume-Kafka-2016

More Related Content

What's hot

Viewers also liked

Similar to ApacheCon-Flume-Kafka-2016

More from Jayesh Thakrar

ApacheCon-Flume-Kafka-2016