SlideShare a Scribd company logo
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
We’re going to:
- Describe how thousands of unsupervised clients can wreak havoc on your
shared Kafka cluster, other clients, your mind, your body, and your soul
- Share how we have lived with and grown from this experience (broker, file
system, and client tuning)
- Encourage you to learn from our mistakes (spoiler: trust no one to make a
well-behaved client)
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Kafka at Pandora
- 8 clusters at Pandora of varying size, all multi-tenant
- 7 on-prem KC clusters
- 4 GCP KC clusters
- We have clusters secured with Kerberos and Ranger for secure
applications as well as more permissive clusters for general use
Kafka at Pandora
Our primary production cluster:
- 15 brokers, v 0.9.0.0 through 1.1.0
- 256GB Memory, 10 GB/s NIC, 24 cores
- Disk: 26 1TB disks formatted as 2x raidz2 ZFS
- 500 topics/1500 consumer groups/?? producers
- 200k msgs/sec, .5TB/day
- 2.3M requests/sec
Sorry How Many Clients?
Kafka at Pandora
- Primary data hub
- You got a message at point A? We’ll get it to point B
- Analytics pipeline (every play, skip, ad interaction, engagement, etc)
sinking to HDFS and GCS
- Countless other applications
The Dream of Multi-Tenancy
Low barrier to entry!
Speed to production!
Low response times for requests (especially produce, fetch, heartbeat)!
Ease of monitoring clients!
Everyone living in peace and harmony!
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
The Thing
Noun
An occurrence on a Kafka cluster wherein performance drastically degrades on
one or more brokers, typically characterized by slow response times, full
request queues, and weeping cluster admins.
“Why are Kelly and Cabe sobbing?”
“Oh, the cluster is doing The Thing again.”
…
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Ah, so it’s a volume issue?
LOL! Nope!
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
High Volume Cluster - No Degradation
Our higher volume prod cluster consistently outperforms our lower volume one
with more than double the volume
A few hundred well-written producers, batching appropriately
A couple dozen well-written consumers, with lag of 0
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Talkative Clients!!!
The Thing - Revised Definition
- Fetch requests for data out of ZFS cache
- Network threads blocking while reading from disk
- Request queues filling
- Response times rising
- Too many requests!!!
Request Handling -- Deep Dive
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
The Reality of Multi-Tenancy
- Rampant URP
- Seemingly inexplicable ISR thrashing
- Stifled produce and fetch requests for clients
- High response times from the brokers
- Death, Destruction, Etc
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Well, what did you do about it?
Make ZFS Work For You
ZFS can make Kafka incredibly fast
- Clever about invalidating the cache
- Smart about pre-fetching sequential reads
- One-off reads don’t pollute your cache
But you’ve got to know how to use it...
Make ZFS Work For You
Current Configs (good, yes)
- atime=off
- vdevs=100
- zfs_vdev_async_read_max_active =30
- zfs_vdev_async_write_max_active=100
- zfs_vdev_sync_read_max_active=100
- zfs_vdev_sync_write_max_active=100
- zfs_vdev_max_active=4000
- zfs_arc_max=214748364800
Original Configs (bad, no)
- atime=on
- vdevs=10
- zfs_vdev_async_read_max_active =3
- zfs_vdev_async_write_max_active=10
- zfs_vdev_sync_read_max_active=10
- zfs_vdev_sync_write_max_active=10
- zfs_vdev_max_active=1000
- zfs_arc_max=0
Broker Tuning
- Network threads == number of cores
- IO threads == number of vdevs
- Request queue = bigger (but not too big)
- Arc cache = as big as you can
And you lived happily ever after?
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
There’s hope!
Reform clients, especially consumers
Producer and consumer configs generally follow the same pattern - make
wrappers! Save yourselves! It’s too late for us, but you have your whole life
before you!
Up next in our queue: quotas! L2 cache to ameliorate worst case scenarios
Tales From The Crypt
What configs help and what patterns absolutely do not
Stories from real life:
- Producing one message a time
- Max.poll.interval.ms = Long.MAX_VALUE
- Committing consumer offsets for every record
- Consuming from the beginning at every group rebalance
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019
Kafka Is Excellent At High
Volume. Clients Are
Disrespectful!
● KBrowse - a UI to
deserialize and view
messages in your
topic
● Algorithm and Blues -
the Pandora
Engineering blog
● KAFKA-7504 -
performance
degradation and
patch for ext (not zfs)
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019

More Related Content

PDF
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Yoshiyasu SAEKI
 
PDF
A Streaming Platform Architecture Based on Apache Kafka
confluent
 
PDF
Monitoring Apache Kafka with Confluent Control Center
confluent
 
PPTX
How to Lock Down Apache Kafka and Keep Your Streams Safe
confluent
 
PDF
Apache Kafka lessons learned @PAYBACK
Maxim Shelest
 
PDF
Apache Kafka 0.11 の Exactly Once Semantics
Yoshiyasu SAEKI
 
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
PPTX
Change Data Capture using Kafka
Akash Vacher
 
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Yoshiyasu SAEKI
 
A Streaming Platform Architecture Based on Apache Kafka
confluent
 
Monitoring Apache Kafka with Confluent Control Center
confluent
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
confluent
 
Apache Kafka lessons learned @PAYBACK
Maxim Shelest
 
Apache Kafka 0.11 の Exactly Once Semantics
Yoshiyasu SAEKI
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
Change Data Capture using Kafka
Akash Vacher
 

What's hot (20)

PDF
What We Learned From Building a Modern Messaging and Streaming System for Cloud
StreamNative
 
PDF
Apache Kafka® at Dropbox
confluent
 
PPTX
Big Data Analytics Infrastructure
Min Zhou
 
PDF
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
AWSKRUG - AWS한국사용자모임
 
PPTX
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
 
PDF
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
PPTX
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
PDF
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
PPTX
Apache Storm In Retail Context
Karthik Deivasigamani
 
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
PDF
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Keigo Suda
 
PPTX
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
PDF
Apache kafka
the100rabh
 
PDF
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
PDF
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
 
PDF
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
PDF
StreamSQL Feature Store (Apache Pulsar Summit)
Simba Khadder
 
What We Learned From Building a Modern Messaging and Streaming System for Cloud
StreamNative
 
Apache Kafka® at Dropbox
confluent
 
Big Data Analytics Infrastructure
Min Zhou
 
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
AWSKRUG - AWS한국사용자모임
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Michael Noll
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
Apache Storm In Retail Context
Karthik Deivasigamani
 
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)
Keigo Suda
 
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Apache kafka
the100rabh
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
StreamSQL Feature Store (Apache Pulsar Summit)
Simba Khadder
 
Ad

Similar to Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019 (20)

PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PDF
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
PDF
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools (Christop...
confluent
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PDF
Kafka on ZFS: Better Living Through Filesystems
confluent
 
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PDF
Disaster Recovery Plans for Apache Kafka
confluent
 
PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
confluent
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Make 2016 your year of SMACK talk
DataStax Academy
 
PPTX
Apache Kafka
Saroj Panyasrivanit
 
PDF
Kafka Needs No Keeper
C4Media
 
PPTX
Apache kafka
Viswanath J
 
Putting Kafka Into Overdrive
Todd Palino
 
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
HostedbyConfluent
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
Introduction to apache kafka
Samuel Kerrien
 
Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools (Christop...
confluent
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Kafka
Shiao-An Yuan
 
Kafka on ZFS: Better Living Through Filesystems
confluent
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Disaster Recovery Plans for Apache Kafka
confluent
 
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
confluent
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Make 2016 your year of SMACK talk
DataStax Academy
 
Apache Kafka
Saroj Panyasrivanit
 
Kafka Needs No Keeper
C4Media
 
Apache kafka
Viswanath J
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Software Development Company | KodekX
KodekX
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Software Development Methodologies in 2025
KodekX
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandora Media and Cabe Waldrop, Pandora Media) Kafka Summit London 2019

  • 3. We’re going to: - Describe how thousands of unsupervised clients can wreak havoc on your shared Kafka cluster, other clients, your mind, your body, and your soul - Share how we have lived with and grown from this experience (broker, file system, and client tuning) - Encourage you to learn from our mistakes (spoiler: trust no one to make a well-behaved client)
  • 5. Kafka at Pandora - 8 clusters at Pandora of varying size, all multi-tenant - 7 on-prem KC clusters - 4 GCP KC clusters - We have clusters secured with Kerberos and Ranger for secure applications as well as more permissive clusters for general use
  • 6. Kafka at Pandora Our primary production cluster: - 15 brokers, v 0.9.0.0 through 1.1.0 - 256GB Memory, 10 GB/s NIC, 24 cores - Disk: 26 1TB disks formatted as 2x raidz2 ZFS - 500 topics/1500 consumer groups/?? producers - 200k msgs/sec, .5TB/day - 2.3M requests/sec
  • 7. Sorry How Many Clients?
  • 8. Kafka at Pandora - Primary data hub - You got a message at point A? We’ll get it to point B - Analytics pipeline (every play, skip, ad interaction, engagement, etc) sinking to HDFS and GCS - Countless other applications
  • 9. The Dream of Multi-Tenancy Low barrier to entry! Speed to production! Low response times for requests (especially produce, fetch, heartbeat)! Ease of monitoring clients! Everyone living in peace and harmony!
  • 13. The Thing Noun An occurrence on a Kafka cluster wherein performance drastically degrades on one or more brokers, typically characterized by slow response times, full request queues, and weeping cluster admins. “Why are Kelly and Cabe sobbing?” “Oh, the cluster is doing The Thing again.”
  • 14.
  • 17. Ah, so it’s a volume issue?
  • 20. High Volume Cluster - No Degradation Our higher volume prod cluster consistently outperforms our lower volume one with more than double the volume A few hundred well-written producers, batching appropriately A couple dozen well-written consumers, with lag of 0
  • 24. The Thing - Revised Definition - Fetch requests for data out of ZFS cache - Network threads blocking while reading from disk - Request queues filling - Response times rising - Too many requests!!!
  • 25. Request Handling -- Deep Dive
  • 28. The Reality of Multi-Tenancy - Rampant URP - Seemingly inexplicable ISR thrashing - Stifled produce and fetch requests for clients - High response times from the brokers - Death, Destruction, Etc
  • 32. Well, what did you do about it?
  • 33. Make ZFS Work For You ZFS can make Kafka incredibly fast - Clever about invalidating the cache - Smart about pre-fetching sequential reads - One-off reads don’t pollute your cache But you’ve got to know how to use it...
  • 34. Make ZFS Work For You Current Configs (good, yes) - atime=off - vdevs=100 - zfs_vdev_async_read_max_active =30 - zfs_vdev_async_write_max_active=100 - zfs_vdev_sync_read_max_active=100 - zfs_vdev_sync_write_max_active=100 - zfs_vdev_max_active=4000 - zfs_arc_max=214748364800 Original Configs (bad, no) - atime=on - vdevs=10 - zfs_vdev_async_read_max_active =3 - zfs_vdev_async_write_max_active=10 - zfs_vdev_sync_read_max_active=10 - zfs_vdev_sync_write_max_active=10 - zfs_vdev_max_active=1000 - zfs_arc_max=0
  • 35. Broker Tuning - Network threads == number of cores - IO threads == number of vdevs - Request queue = bigger (but not too big) - Arc cache = as big as you can
  • 36. And you lived happily ever after?
  • 43. There’s hope! Reform clients, especially consumers Producer and consumer configs generally follow the same pattern - make wrappers! Save yourselves! It’s too late for us, but you have your whole life before you! Up next in our queue: quotas! L2 cache to ameliorate worst case scenarios
  • 44. Tales From The Crypt What configs help and what patterns absolutely do not Stories from real life: - Producing one message a time - Max.poll.interval.ms = Long.MAX_VALUE - Committing consumer offsets for every record - Consuming from the beginning at every group rebalance
  • 46. Kafka Is Excellent At High Volume. Clients Are Disrespectful!
  • 47. ● KBrowse - a UI to deserialize and view messages in your topic ● Algorithm and Blues - the Pandora Engineering blog ● KAFKA-7504 - performance degradation and patch for ext (not zfs)