SlideShare a Scribd company logo
Apache Kafka Best
Practices
Manikumar Reddy
@omkreddy
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Kafka
 Core APIs
– The Producer API
– The Consumer API
– The Connector API
– The Streams API
 Broad classes of applications
– Building real-time streaming data pipelines
– Building real-time streaming applications
– core building block in other data systems
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Key Concepts and Terminology
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Component Layout
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hardware Guidance
Cluster Size Memory CPU Storage
Kafka Brokers 3+
24G+ (for small)
64GB+ (for large)
Multi- core
processors( 12 CPU+
core), Hyper
threading enabled
6+ x 1TB dedicated
disks( RAID or JBOD)
Zookeeper
3 (for small)
5 (for large)
8GB+ (for small)
24GB+ (for large)
2 core +
SSD for Transaction
logs
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
OS Tuning
 OS Page Cache
– Ex: Allocate to hold all the active segments of the log.
 File descriptor limits : >100k
 less swapping
 Tcp tuning
 JVM Configs
– Java 8 with G1 Collector
– 6-8 GB heap
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Disk Storage
 Use multiple disk spindles, dedicated to kafka
 JBOD vs RAID10
 JBOD
– Gives all the disk I/O
 JBOD Limitations
– any disk failure causes an unclean shutdown and requires lengthy recovery
– data is not distributed consistently across disks
– Multiple directories
 KIP-112/113
– necessary tools for users to manage JBOD
– Intelligent partition assignment
– On disk failure, broker can serve replicas on the good disks
– re-assign replicas between disks of the same broker
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
RAID
 RAID10
– Can survive single disk failure
– Performance and protection
– balance load across disks
– Single mount point
– Performance hit and reduces the space
 File System
– EXT or XFS
– SSD
– Issues on NFS.
– SAN, NAS
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Basic Monitoring
 CPU Load
 Network Metrics
 File Handle Usage
 Disk Space
 Disk I/O Performance
 Garbage Collection
 ZooKeeper Monitoring
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Replication
 Partition has replicas – Leader replica, Follower replicas
 Leader maintains in-sync-replicas (ISR)
– replica.lag.time.max.ms, num.replica.fetchers
– min.insync.replica – used by producer to ensure greater durability
https://www.slideshare.net/junrao/kafka-replication-apachecon2013
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Under Replicated Partitions
 Number of partitions which are not fully replicated within the cluster
 Mbean - kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
 ISR Shrink/Expand Rate
 Under Replicated Partitions
– Lost Broker?
– Controller Issues
– Zookeeper Issues
– Network Issues
 Solutions
– Tune the ISR settings
– Expand brokers
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Controller
 Manages Partitions Life cycle
 Avoid controller's ZK session expires
– Soft failures – ISR Churn/Under replicated partitions
– ZK Server performance
– Long GC pauses on Broker
– Bad network configuration
 Monitoring
– Mbean : kafka.controller:type=KafkaController,name=ActiveControllerCount
– only one broker in the cluster should have 1
– LeaderElectionRate
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Unclean leader election
 Enable replicas not in the ISR set to be elected as leader
 Availability vs correctness
– By-default kafka chooses availability
 Monitoring
– Mbean : kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
 Default will be changed in next release
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Broker Configs
 log.retention.{ms, minutes, hours} , log.retention.bytes
 message.max.bytes, replica.fetch.max.bytes
 delete.topic.enable
 unclean.leader.election.enable = false
 min.insync.replicas = 2
 replica.lag.time.max.ms, num.replica.fetchers
 replica.fetch.response.max.bytes
 zookeeper.session.timeout.ms = 30s
 num.io.threads
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Sizing
 Broker Sizing
– Partition count on each broker (<2K)
– Keep partition size on disk manageable (under 25GB per partition )
 Cluster Size (no. of brokers)
– how much retention we need
– how much traffic cluster is getting
 Cluster Expansion
– Disk usage on the log segments partition should stay under 60%
– Network usage on each broker should stay under 75%
 Cluster Monitoring
– Keep cluster balanced
– Ensure that partitions of a topic are fairly distributed across brokers
– Ensure that nodes in a cluster are not running out of disk and network
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Broker Monitoring
 Partition Counts
– Mbean: kafka.server:type=ReplicaManager,name=PartitionCount
 Leader replica counts
– Mbean: kafka.server:type=ReplicaManager,name=LeaderCount
 ISR Shrink Rate/ISR expansion rate
– kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
 Message in rate/Byte in rate/Byte out rate
 NetworkProcessorAvgIdlePercent
 RequestHandlerAvgIdlePercent
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Topic Sizing
 No. of partitions
– Have at least as many partitions as there are consumers in the largest group
– topic is very busy – more partitions
– Keep partition size on disk manageable (under 25GB per partition )
– Take into account any other application requirements
– Special use cases – single partition
 Keyed messages
– enough partitions to deal with future growth
 expanding partitions
– whenever the size of the partition on disk is larger than threshold
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Choosing Partitions
 Based on throughput requirements one can pick a rough number of partitions.
– Lets call the throughput from producer to a single partition is P
– Throughput from a single partition to a consumer is C
– Target throughput is T
– At least max (T/P, T/C)
 More Partitions
– More open file handles
– May increase unavailability
– May increase end-to-end latency
– More memory for clients
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Quotas
 Protect from bad clients and maintain SLAs
 byte-rate thresholds on produce and fetch requests
 can be applied to (user, client-id), user or client-id groups.
 Server delays the responses
 Broker Metrics for monitoring – throttle-rate, byte-rate
 replica.fetch.response.max.bytes
– Limit memory usage of replica fetch response
 Limiting bandwidth usage during data migration
– kafka-reassign-partitions.sh -- -throttle option
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Producer
 User new java based clients
 Test in your Environment
– kafka-producer-perf-test.sh
 Memory
 CPU
 Batch Compression
 Avoid large messages
– creates more memory pressure
– slows down the brokers
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Critical Configs
 batch.size
– size based batching
– larger size -> high throughput, higher latency
 linger.ms
– time based batching
– larger size -> high throughput, higher latency
 max.in.flight.requests.per.connection
– Better throughput, affects ordering
 compression.type
– adding more user threads can help throughput
 acks
– Affects message durability
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance tuning
 If throughput < network capacity
– Add more user threads
– Increase batch size
– Add more producers instances
– Add more partitions
 Latency when acks = -1
– Increase num.replica.fetchers
 Cross datacenter data transfer
– Tune socket buffer settings, OS tcp buffer settings
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Producer Monitoring
 batch-size-avg
 compression-rate-avg
 waiting-threads
 buffer-available-bytes
 record-queue-time-max
 record-send-rate
 records-per-request-avg
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Consumer
 Test in your Environment
– kafka-consumer-perf-test.sh
 Throughput Issues
– not enough partitions
– OS Page Cache - allocate enough to hold all the messages for your consumers for say, 30s
– Application/Processing logic
 Offsets topic
– __consumer_offsets
– offsets.topic.replication.factor
– offsets.retention.minutes
– Monitor ISR, topic size
 Slow offset commits
– commit async, manual commits
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Consumer Configs
 fetch.min.bytes and fetch.max.wait.ms
 max.poll.interval.ms
 max.poll.records
 session.timeout.ms
 Consumer Rebalance
– check timeouts
– check processing times/logic
– GC Issues
 Tune network settings
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Consumer Monitoring
 Whether or not the consumer is keeping up with the messages that are being produced
 Consumer Lag: Difference between the end of the log and the consumer offset
 Monitoring
– Metrics Monitoring - records-lag-max
– bin/kafka-consumer-groups.sh
– LinkedIn’s Burrow for consumer monitoring
 Decreasing Lag
– Analyze consumer - GC Issues, hung instance
– Add more consumer Instances
– increase the number of partitions and consumers
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
No data loss settings
 Producer
– block.on.buffer.full=true
– retries=Long.MAX_VALUE
– acks=all
– max.in.flight.requests.per.connection=1
– close producer
 Broker
– replication factor >= 3
– min.insync.replicas=2
– disable unclean leader election
 Consumer
– disable auto.offset.commit
– Commit offsets only after the messages are processed
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Authorizer - Ranger Auditing
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker
 Tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker
 Run multiple mirroring processes
– high fault-tolerance
– high throughput
 --num.streams option to specify the number of consumer threads
– no.of threads in num.streams
31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker
 Consumer and source cluster socket buffer sizes
– high value for the socket buffer size
– consumer's fetch size
– OS networking Tuning
 Source and Target Clusters are independent entities
– Can be different numbers of partitions
– offsets will not be the same.
– partitioning order is preserved on a per-key basis.
 Create topics in target cluster
 Monitor whether a mirror is keeping up
– Consumer Lag
 Running In Secure Clusters
– We recommend to use SSL
– We can run MM on source cluster
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Open source Operational Tools
 Ambari Metrics
– https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-
guide/content/grafana_kafka_dashboards.html
 Removing brokers and rebalancing partitions in a cluster
– https://github.com/linkedin/kafka-tools
 Consumer Lag Monitoring
– Burrow (https://github.com/linkedin/Burrow)
 Kafka Manager - https://github.com/yahoo/kafka-manager
33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Kafka 0.10.2 release
 Includes 15 KIPs, over 200 bug fixes and improvements
 The newest Java Clients now support older brokers (0.10.0 and higher)
 Separation of Internal and External traffic
 Create Topic Policy
 Security Improvements
– Support for SASL/SCRAM mechanisms
– Dynamic JAAS configuration for Kafka clients
– Support for authentication of multiple Kafka clients in single JVM
 Producer and Consumer Improvements
 Connect API & Streams API improvements
34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
References
 http://kafka.apache.org/documentation.html
 https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
 https://www.slideshare.net/JiangjieQin/producer-performance-tuning-for-apache-
kafka-63147600
 https://www.slideshare.net/ToddPalino/tuning-kafka-for-fun-and-profit
 https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-
49753844
 https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
 https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-
kafka-cluster/

More Related Content

What's hot (20)

PDF
Kafka 101 and Developer Best Practices
confluent
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PPTX
Kafka 101
Clement Demonchy
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
Kafka at Peak Performance
Todd Palino
 
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PPTX
kafka
Amikam Snir
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PPTX
Apache Kafka
emreakis
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PPTX
Envoy and Kafka
Adam Kotwasinski
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
Kafka 101 and Developer Best Practices
confluent
 
Introduction to Apache Kafka
AIMDek Technologies
 
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka - Martin Podval
Martin Podval
 
Stream processing using Kafka
Knoldus Inc.
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Kafka 101
Clement Demonchy
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Kafka at Peak Performance
Todd Palino
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
ksqlDB: A Stream-Relational Database System
confluent
 
Apache Kafka
emreakis
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Introduction to Apache Kafka
Shiao-An Yuan
 
An Introduction to Apache Kafka
Amir Sedighi
 
Envoy and Kafka
Adam Kotwasinski
 
Kafka replication apachecon_2013
Jun Rao
 
Kafka presentation
Mohammed Fazuluddin
 

Viewers also liked (20)

PPTX
Deep Dive into Apache Kafka
confluent
 
PDF
Kafka internals
David Groozman
 
PPTX
Intro to Apache Kafka
Jason Hubbard
 
PPTX
MapR Streams and MapR Converged Data Platform
MapR Technologies
 
PPTX
Microservices in the Apache Kafka Ecosystem
confluent
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
PPTX
Processing IoT Data with Apache Kafka
Matthew Howlett
 
PPTX
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
 
PPTX
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Slim Baltagi
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
PPTX
Best Practices for Enterprise User Management in Hadoop Environment
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
 
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Solving Cyber at Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
PPTX
Tuning kafka pipelines
Sumant Tambe
 
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Deep Dive into Apache Kafka
confluent
 
Kafka internals
David Groozman
 
Intro to Apache Kafka
Jason Hubbard
 
MapR Streams and MapR Converged Data Platform
MapR Technologies
 
Microservices in the Apache Kafka Ecosystem
confluent
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
Processing IoT Data with Apache Kafka
Matthew Howlett
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Slim Baltagi
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
Best Practices for Enterprise User Management in Hadoop Environment
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
Solving Cyber at Scale
DataWorks Summit/Hadoop Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
Tuning kafka pipelines
Sumant Tambe
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Apache Kafka Best Practices (20)

PDF
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PDF
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
PDF
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
PDF
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Tips & Tricks for Apache Kafka®
confluent
 
PDF
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
PPTX
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
PPTX
Monitoring Apache Kafka
confluent
 
PDF
Tips and Tricks for Operating Apache Kafka
All Things Open
 
PPTX
Kafka infrastructure production
lambdaloopers
 
PDF
Apache Kafka - From zero to hero
Apache Kafka TLV
 
PDF
Kafka zero to hero
Avi Levi
 
PDF
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
PDF
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
PDF
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
Putting Kafka Into Overdrive
Todd Palino
 
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Tips & Tricks for Apache Kafka®
confluent
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
Introduction to apache kafka
Samuel Kerrien
 
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Monitoring Apache Kafka
confluent
 
Tips and Tricks for Operating Apache Kafka
All Things Open
 
Kafka infrastructure production
lambdaloopers
 
Apache Kafka - From zero to hero
Apache Kafka TLV
 
Kafka zero to hero
Avi Levi
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 

Apache Kafka Best Practices

  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Kafka  Core APIs – The Producer API – The Consumer API – The Connector API – The Streams API  Broad classes of applications – Building real-time streaming data pipelines – Building real-time streaming applications – core building block in other data systems
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Key Concepts and Terminology
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Component Layout
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hardware Guidance Cluster Size Memory CPU Storage Kafka Brokers 3+ 24G+ (for small) 64GB+ (for large) Multi- core processors( 12 CPU+ core), Hyper threading enabled 6+ x 1TB dedicated disks( RAID or JBOD) Zookeeper 3 (for small) 5 (for large) 8GB+ (for small) 24GB+ (for large) 2 core + SSD for Transaction logs
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved OS Tuning  OS Page Cache – Ex: Allocate to hold all the active segments of the log.  File descriptor limits : >100k  less swapping  Tcp tuning  JVM Configs – Java 8 with G1 Collector – 6-8 GB heap
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Disk Storage  Use multiple disk spindles, dedicated to kafka  JBOD vs RAID10  JBOD – Gives all the disk I/O  JBOD Limitations – any disk failure causes an unclean shutdown and requires lengthy recovery – data is not distributed consistently across disks – Multiple directories  KIP-112/113 – necessary tools for users to manage JBOD – Intelligent partition assignment – On disk failure, broker can serve replicas on the good disks – re-assign replicas between disks of the same broker
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved RAID  RAID10 – Can survive single disk failure – Performance and protection – balance load across disks – Single mount point – Performance hit and reduces the space  File System – EXT or XFS – SSD – Issues on NFS. – SAN, NAS
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Basic Monitoring  CPU Load  Network Metrics  File Handle Usage  Disk Space  Disk I/O Performance  Garbage Collection  ZooKeeper Monitoring
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Replication  Partition has replicas – Leader replica, Follower replicas  Leader maintains in-sync-replicas (ISR) – replica.lag.time.max.ms, num.replica.fetchers – min.insync.replica – used by producer to ensure greater durability https://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Under Replicated Partitions  Number of partitions which are not fully replicated within the cluster  Mbean - kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions  ISR Shrink/Expand Rate  Under Replicated Partitions – Lost Broker? – Controller Issues – Zookeeper Issues – Network Issues  Solutions – Tune the ISR settings – Expand brokers
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Controller  Manages Partitions Life cycle  Avoid controller's ZK session expires – Soft failures – ISR Churn/Under replicated partitions – ZK Server performance – Long GC pauses on Broker – Bad network configuration  Monitoring – Mbean : kafka.controller:type=KafkaController,name=ActiveControllerCount – only one broker in the cluster should have 1 – LeaderElectionRate
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Unclean leader election  Enable replicas not in the ISR set to be elected as leader  Availability vs correctness – By-default kafka chooses availability  Monitoring – Mbean : kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec  Default will be changed in next release
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Broker Configs  log.retention.{ms, minutes, hours} , log.retention.bytes  message.max.bytes, replica.fetch.max.bytes  delete.topic.enable  unclean.leader.election.enable = false  min.insync.replicas = 2  replica.lag.time.max.ms, num.replica.fetchers  replica.fetch.response.max.bytes  zookeeper.session.timeout.ms = 30s  num.io.threads
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Sizing  Broker Sizing – Partition count on each broker (<2K) – Keep partition size on disk manageable (under 25GB per partition )  Cluster Size (no. of brokers) – how much retention we need – how much traffic cluster is getting  Cluster Expansion – Disk usage on the log segments partition should stay under 60% – Network usage on each broker should stay under 75%  Cluster Monitoring – Keep cluster balanced – Ensure that partitions of a topic are fairly distributed across brokers – Ensure that nodes in a cluster are not running out of disk and network
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Broker Monitoring  Partition Counts – Mbean: kafka.server:type=ReplicaManager,name=PartitionCount  Leader replica counts – Mbean: kafka.server:type=ReplicaManager,name=LeaderCount  ISR Shrink Rate/ISR expansion rate – kafka.server:type=ReplicaManager,name=IsrExpandsPerSec  Message in rate/Byte in rate/Byte out rate  NetworkProcessorAvgIdlePercent  RequestHandlerAvgIdlePercent
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Topic Sizing  No. of partitions – Have at least as many partitions as there are consumers in the largest group – topic is very busy – more partitions – Keep partition size on disk manageable (under 25GB per partition ) – Take into account any other application requirements – Special use cases – single partition  Keyed messages – enough partitions to deal with future growth  expanding partitions – whenever the size of the partition on disk is larger than threshold
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Choosing Partitions  Based on throughput requirements one can pick a rough number of partitions. – Lets call the throughput from producer to a single partition is P – Throughput from a single partition to a consumer is C – Target throughput is T – At least max (T/P, T/C)  More Partitions – More open file handles – May increase unavailability – May increase end-to-end latency – More memory for clients
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Quotas  Protect from bad clients and maintain SLAs  byte-rate thresholds on produce and fetch requests  can be applied to (user, client-id), user or client-id groups.  Server delays the responses  Broker Metrics for monitoring – throttle-rate, byte-rate  replica.fetch.response.max.bytes – Limit memory usage of replica fetch response  Limiting bandwidth usage during data migration – kafka-reassign-partitions.sh -- -throttle option
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Producer  User new java based clients  Test in your Environment – kafka-producer-perf-test.sh  Memory  CPU  Batch Compression  Avoid large messages – creates more memory pressure – slows down the brokers
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Critical Configs  batch.size – size based batching – larger size -> high throughput, higher latency  linger.ms – time based batching – larger size -> high throughput, higher latency  max.in.flight.requests.per.connection – Better throughput, affects ordering  compression.type – adding more user threads can help throughput  acks – Affects message durability
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance tuning  If throughput < network capacity – Add more user threads – Increase batch size – Add more producers instances – Add more partitions  Latency when acks = -1 – Increase num.replica.fetchers  Cross datacenter data transfer – Tune socket buffer settings, OS tcp buffer settings
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Producer Monitoring  batch-size-avg  compression-rate-avg  waiting-threads  buffer-available-bytes  record-queue-time-max  record-send-rate  records-per-request-avg
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Consumer  Test in your Environment – kafka-consumer-perf-test.sh  Throughput Issues – not enough partitions – OS Page Cache - allocate enough to hold all the messages for your consumers for say, 30s – Application/Processing logic  Offsets topic – __consumer_offsets – offsets.topic.replication.factor – offsets.retention.minutes – Monitor ISR, topic size  Slow offset commits – commit async, manual commits
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Consumer Configs  fetch.min.bytes and fetch.max.wait.ms  max.poll.interval.ms  max.poll.records  session.timeout.ms  Consumer Rebalance – check timeouts – check processing times/logic – GC Issues  Tune network settings
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Consumer Monitoring  Whether or not the consumer is keeping up with the messages that are being produced  Consumer Lag: Difference between the end of the log and the consumer offset  Monitoring – Metrics Monitoring - records-lag-max – bin/kafka-consumer-groups.sh – LinkedIn’s Burrow for consumer monitoring  Decreasing Lag – Analyze consumer - GC Issues, hung instance – Add more consumer Instances – increase the number of partitions and consumers
  • 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved No data loss settings  Producer – block.on.buffer.full=true – retries=Long.MAX_VALUE – acks=all – max.in.flight.requests.per.connection=1 – close producer  Broker – replication factor >= 3 – min.insync.replicas=2 – disable unclean leader election  Consumer – disable auto.offset.commit – Commit offsets only after the messages are processed
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Authorizer - Ranger Auditing
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Mirror Maker  Tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Mirror Maker  Run multiple mirroring processes – high fault-tolerance – high throughput  --num.streams option to specify the number of consumer threads – no.of threads in num.streams
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kafka Mirror Maker  Consumer and source cluster socket buffer sizes – high value for the socket buffer size – consumer's fetch size – OS networking Tuning  Source and Target Clusters are independent entities – Can be different numbers of partitions – offsets will not be the same. – partitioning order is preserved on a per-key basis.  Create topics in target cluster  Monitor whether a mirror is keeping up – Consumer Lag  Running In Secure Clusters – We recommend to use SSL – We can run MM on source cluster
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Open source Operational Tools  Ambari Metrics – https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user- guide/content/grafana_kafka_dashboards.html  Removing brokers and rebalancing partitions in a cluster – https://github.com/linkedin/kafka-tools  Consumer Lag Monitoring – Burrow (https://github.com/linkedin/Burrow)  Kafka Manager - https://github.com/yahoo/kafka-manager
  • 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Kafka 0.10.2 release  Includes 15 KIPs, over 200 bug fixes and improvements  The newest Java Clients now support older brokers (0.10.0 and higher)  Separation of Internal and External traffic  Create Topic Policy  Security Improvements – Support for SASL/SCRAM mechanisms – Dynamic JAAS configuration for Kafka clients – Support for authentication of multiple Kafka clients in single JVM  Producer and Consumer Improvements  Connect API & Streams API improvements
  • 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You
  • 35. 35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved References  http://kafka.apache.org/documentation.html  https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html  https://www.slideshare.net/JiangjieQin/producer-performance-tuning-for-apache- kafka-63147600  https://www.slideshare.net/ToddPalino/tuning-kafka-for-fun-and-profit  https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka- 49753844  https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive  https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a- kafka-cluster/