SlideShare a Scribd company logo
instaclustr.comTwitter @instaclustr info@instaclustr.com instaclustr.com
Lessons Learned from Building an
Apache Kafka
Managed Service
instaclustr.com
Introduction
# Who am I?
/bin/whoami
â—Ź Ben Bromhead, CTO, Instaclustr
# Who is Instaclustr?
/bin/id -g -n
â—Ź Experts in reliability at scale
â—Ź Manage/Support 3k+ Cassandra, Spark and Elassandra nodes
â—Ź Platform providers automated provisioning, monitoring and management
â—Ź Available on AWS, GCP, Azure and IBM Cloud
â—Ź Managed Apache Kafka released May 21st
instaclustr.com
Agenda
â—Ź What is Kafka
â—Ź A quick intro to how it works
â—Ź Context - our offering and development process
â—Ź Hardware choice and benchmarking
â—Ź Topic and user management
â—Ź Broker security configuration
â—Ź Monitoring
â—Ź Backup and Restore
instaclustr.com
What Is Apache Kafka?
Key Characteristics
â—Ź Horizontal scalable, distributed system
â—Ź Performance
â—‹ Low latency, high throughput
â—Ź Scalability
â—‹ Linear broker scalability via partitioned topics
â—‹ Linear consumer scalability via consumer groups
â—Ź Fault-tolerance
â—‹ Data is replicated across multiple brokers
â—‹ Automatic broker failover when primary replica goes offline
â—‹ Automatic consumer failover when consumer in consumer group
goes offline
â—Ź Apache Foundation Open Source
â—Ź Production Proven
• Publish & Subscribe to streams of data
(reliable message transport)
• Transform and/or aggregate data streams
using distributed processing applications
(stream processing)
instaclustr.com
Why use Apache Kafka?
â—Ź Provide a buffering mechanism in front of a processing (ie deal with temporary incoming message rate
greater than processing app can deal with)
â—Ź A special case of buffering is to allow producers to publish messages with guaranteed delivery even if
the consumers are down when the message is published
â—Ź As an event store for events sourcing or Kappa architecture
â—Ź Facilitate flexible, configurable architectures with many producers -> many consumers by separating
the details who what is consuming messages for the apps that produce them (and vice-versa)
â—Ź Perform stream analytics (with Kafka Streams)
instaclustr.com
How does it work: producing records
â—Ź Each topic has a fixed number of partitions
â—Ź Records published to a topic by a producer are
divided amongst the topic’s partitions
â—Ź Partitions are ordered, immutable lists
â—Ź Each new record is appended to the end of a
partition
â—Ź Each partition is stored on a single leader broker,
and may optionally be replicated to one or more
follower brokers
instaclustr.com
How does it work: consuming records
â—Ź A consumer reads from one or more partitions
â—Ź Consumer maintains an offset of the last record in the partition read
â—Ź The consumer requests a micro-batch of records from Kafka. The
broker uses the offset to provide the latest records to the consumer
â—Ź Once the consumer has finished processing a record, it must
commit the new offset
â—Ź Because Kafka does not delete records immediately after they are
read, consumers may reset the offset to a previous value to replay
records
instaclustr.com
How does it work: consumer groups
â—Ź Multiple consumers reading from a topic may be
arranged into Consumer Groups
â—Ź A Consumer Group load-balances partitions amongst
consumers
â—Ź If a consumer goes offline, the consumer group will
automatically re-distribute it’s partitions amongst the
remaining consumers
instaclustr.com
How it works: Easier Abstractions
â—Ź High-level API
â—Ź Drop-in source (import) & sink (export) connectors
exist for many popular technologies, including
Amazon S3, Amazon Kinesis, Apache Cassandra,
HDFS and JDBC
Kafka Connect
â—Ź Provides functionality to aggregate data, join
multiple topics and perform complex
transformations to live data as it arrives
â—Ź The API abstracts away most of the difficult
scalability, fault-tolerance and consistency
problems associated with performing live
aggregations on a distributed system
Kafka Streams
instaclustr.com
Instaclustr Managed Kafka - Key Features
â—Ź Available Now:
â—‹ Open source Apache Kafka (Brokers) and Zookeeper
automatically provisioned in AWS, GCP and Azure
â—‹ Broker Monitoring
â—‹ Instaclustr monitoring and provisioning API support
â—‹ Private network clusters (AWS only)
â—‹ Run in your cloud provider account or ours
â—Ź For GA (end June):
â—‹ SOC2 compliant
â—‹ User & credential management
â—‹ More cluster config options
â—‹ Topic Level and Synthetic transaction monitoring
â—‹ Infrastructure config tuning
â—Ź Likely future release scope:
â—‹ Topic Management UI
○ Cluster “copy”
â—‹ Managed:
â–  Kafka Connect
â–  Schema Registry
â–  Mirror Maker
â—‹ Dynamic scaling
instaclustr.com
Instaclustr Managed Kafka - Development Process
â—Ź First customer requests 2016
â—Ź Internal infrastructure deployment and usage of Kafka mid 2017
â—Ź Managed service platform development
commenced November 2017
â—Ź Early access program with 4 customers
commenced December 2017
â—Ź Public preview release 21 May 2018
â—Ź GA expected 25 June 2018
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
â—Ź AWS Benchmark - r4.large w
500GB disks
â—Ź Avg 10% improved throughput
with ST1 vs GP2 EBS
â—Ź ST1 is 45% of the cost of GP2
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
â—Ź AWS Benchmark - r4.large w
500GB disks
â—Ź Avg 10% improved throughput
with ST1 vs GP2 EBS
â—Ź ST1 is 45% of the cost of GP2
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
â—Ź AWS Benchmark - r4.large w
1500GB ST1 disks
â—Ź 512 byte messages
â—Ź ~30% decrease in throughput with
Broker and Client SSL enabled
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
â—Ź AWS Benchmark - r4.large w
1500GB ST1 disks
â—Ź 512 byte messages
â—Ź ~30% decrease in throughput with
Broker and Client SSL enabled
instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
â—Ź Increasing topics small reduction
performances
â—Ź However,
more topics = more partitions
and
significantly slows recovery time from
node failure
10
Topic
s
100
Topic
s
1000
Topic
s
5000
Topic
s
instaclustr.com
Hardware Choice and Benchmarking -
Colocated Zookeeper
â—Ź Often recommended to host zookeeper
separately to Kafka.
â—Ź However, recent changes have
significantly reduced load on
Zookeeper from Kafka.
â—‹ Consumer offsets are no longer
stored in Zookeeper.
â—Ź Our benchmarking showed no
measurable difference in performance,
at least for smaller clusters.
Consumer Rate - Colocated Consumer Rate - Separate
6 Broker Test with Node Restart
instaclustr.com
Topic and User Configuration Management
â—Ź Existing Kafka utilities for managing topic and user configuration required direct access to Zookeeper
â—Ź However, Zookeeper does not have a robust external security model (TLS support, node to node auth, etc)
â—Ź Providing Zookeeper access to customers introduces a whole class of very strange ways to break a cluster
by corrupting Zookeeper
â—Ź Solutions:
â—‹ Developed command line tool to use Kafka API for topic configuration (https://github.com/instaclustr/ic-kafka-tools)
â–  may add to Instaclustr console later although we think maintaining topic config as a version controlled file in your repo is
a better approach
â—‹ Adding user management to Instaclustr console
â–  we do no want to keep cluster passwords in our central management system so this feature will require users to enter an
existing Kafka credentials to be temporarily used by our system
instaclustr.com
Broker Security Configuration
â—Ź Using SCRAM (Salted Challenge Response Authentication Mechanism) for authentication
â—‹ More secure
â—‹ Allows easier rotation of credentials
â—‹ Initial release for client->broker only with plain text for broker to broker
â—‹ Decided to also use for broker->broker to allow rapid rotation of credentials as part of SOC2 security measures
â—Ź TLS built on existing Cassandra infrastructure
â—‹ New CA created per cluster
â—‹ CA used to generate certificates for each node
â—‹ CA pub cert available for clients to download for full validation of certificates
â—Ź Access to managed clusters also follows same model as Cassandra
â—‹ Public IPs and whitelisting in firewall (security group or equivalent)
â—‹ Private IPs with VPC Peering (or equivalent in other cloud providers)
â—‹ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for admin access
â—‹ Did not expose through firewall Zookeeper due to weak security model
instaclustr.com
Monitoring
â—Ź Metrics exposed via JMX allowing us to use our existing Cassandra monitoring
â—‹ Custom agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann -> Cassandra + Spark -> Console, APIs, Grafana
â—Ź Exposing broker-level and per-topic metrics
â—Ź Alerting?
â—‹ The basics: service state, disk usage free space, server still exists
â—‹ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
â—‹ Synthetic transactions: publish and consume message to controlled topic, measure success and latency
instaclustr.com
Backup and Restore
â—Ź Internet wisdom = Kafka Backups is not a thing
â—‹ Rely on replication within cluster or mirror maker replication to another cluster
● Hmm - we rarely use backups for Cassandra but there have been a few times we’ve been very glad to have
them
â—‹ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication
â—Ź Working on regular automated backup and restore of topic and security configuration
â—Ź Consider using Kafka Connect to write important message to offline backup
instaclustr.comTwitter @instaclustr info@instaclustr.com instaclustr.com
Give a try!
14 Day Free Trial at
instaclustr.com

More Related Content

PPTX
Instaclustr Kafka Meetup Sydney Presentation
Ben Slater
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Copy of Kafka-Camus
Deep Shah
 
PDF
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
HostedbyConfluent
 
PDF
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
confluent
 
PPTX
kafka
Amikam Snir
 
PPTX
Apache Pulsar First Overview
Ricardo Paiva
 
PDF
Effectively-once semantics in Apache Pulsar
Matteo Merli
 
Instaclustr Kafka Meetup Sydney Presentation
Ben Slater
 
Apache Kafka - Martin Podval
Martin Podval
 
Copy of Kafka-Camus
Deep Shah
 
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
HostedbyConfluent
 
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
confluent
 
kafka
Amikam Snir
 
Apache Pulsar First Overview
Ricardo Paiva
 
Effectively-once semantics in Apache Pulsar
Matteo Merli
 

What's hot (17)

PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
StreamNative
 
PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
PDF
Kafka on Kubernetes—From Evaluation to Production at Intuit
confluent
 
PDF
Oops! I started a broker | Yinon Kahta, Taboola
HostedbyConfluent
 
ODP
Kafka aws
Ariel Moskovich
 
PDF
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
HostedbyConfluent
 
PDF
High performance messaging with Apache Pulsar
Matteo Merli
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PDF
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
PDF
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
PPTX
Netflix Data Pipeline With Kafka
Steven Wu
 
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
PPTX
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
Introduction to apache kafka
Dimitris Kontokostas
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
StreamNative
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Kafka on Kubernetes—From Evaluation to Production at Intuit
confluent
 
Oops! I started a broker | Yinon Kahta, Taboola
HostedbyConfluent
 
Kafka aws
Ariel Moskovich
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
HostedbyConfluent
 
High performance messaging with Apache Pulsar
Matteo Merli
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Netflix Data Pipeline With Kafka
Steven Wu
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Introduction to apache kafka
Dimitris Kontokostas
 
Ad

Similar to Insta clustr seattle kafka meetup presentation bb (20)

PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
PDF
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
PPTX
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Data Con LA
 
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
DEVELOPING FAST APPLICATIONS WITH OPEN SOURCE SOFTWARE - WITHOUT THE FURY
PaulBrebner2
 
PPTX
Apache kafka
Kumar Shivam
 
PDF
Streaming Processing with a Distributed Commit Log
Joe Stein
 
PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
confluent
 
PDF
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
PDF
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
HostedbyConfluent
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PDF
Fast Open Source Software - Without The Fury
PaulBrebner2
 
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Data Con LA
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
DEVELOPING FAST APPLICATIONS WITH OPEN SOURCE SOFTWARE - WITHOUT THE FURY
PaulBrebner2
 
Apache kafka
Kumar Shivam
 
Streaming Processing with a Distributed Commit Log
Joe Stein
 
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
confluent
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
HostedbyConfluent
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
Introduction to apache kafka
Samuel Kerrien
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Apache Kafka Introduction
Amita Mirajkar
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Fast Open Source Software - Without The Fury
PaulBrebner2
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Ad

More from Nitin Kumar (16)

PDF
Deep learning with kafka
Nitin Kumar
 
PDF
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
PDF
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Nitin Kumar
 
PPTX
Processing trillions of events per day with apache
Nitin Kumar
 
PPTX
Ren cao kafka connect
Nitin Kumar
 
PPTX
EventHub for kafka ecosystems kafka meetup
Nitin Kumar
 
PPTX
Kafka eos
Nitin Kumar
 
PPTX
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
PDF
Net flix kafka seattle meetup
Nitin Kumar
 
PDF
Avvo fkafka
Nitin Kumar
 
PPTX
Brandon obrien streaming_data
Nitin Kumar
 
PDF
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
PPTX
Microsoft kafka load imbalance
Nitin Kumar
 
PPTX
Map r seattle streams meetup oct 2016
Nitin Kumar
 
PPTX
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
PPTX
Seattle kafka meetup nov 2015 published siphon
Nitin Kumar
 
Deep learning with kafka
Nitin Kumar
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Nitin Kumar
 
Processing trillions of events per day with apache
Nitin Kumar
 
Ren cao kafka connect
Nitin Kumar
 
EventHub for kafka ecosystems kafka meetup
Nitin Kumar
 
Kafka eos
Nitin Kumar
 
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
Net flix kafka seattle meetup
Nitin Kumar
 
Avvo fkafka
Nitin Kumar
 
Brandon obrien streaming_data
Nitin Kumar
 
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
Microsoft kafka load imbalance
Nitin Kumar
 
Map r seattle streams meetup oct 2016
Nitin Kumar
 
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
Seattle kafka meetup nov 2015 published siphon
Nitin Kumar
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The Future of Artificial Intelligence (AI)
Mukul
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 

Insta clustr seattle kafka meetup presentation bb

  • 1. instaclustr.comTwitter @instaclustr [email protected] instaclustr.com Lessons Learned from Building an Apache Kafka Managed Service
  • 2. instaclustr.com Introduction # Who am I? /bin/whoami â—Ź Ben Bromhead, CTO, Instaclustr # Who is Instaclustr? /bin/id -g -n â—Ź Experts in reliability at scale â—Ź Manage/Support 3k+ Cassandra, Spark and Elassandra nodes â—Ź Platform providers automated provisioning, monitoring and management â—Ź Available on AWS, GCP, Azure and IBM Cloud â—Ź Managed Apache Kafka released May 21st
  • 3. instaclustr.com Agenda â—Ź What is Kafka â—Ź A quick intro to how it works â—Ź Context - our offering and development process â—Ź Hardware choice and benchmarking â—Ź Topic and user management â—Ź Broker security configuration â—Ź Monitoring â—Ź Backup and Restore
  • 4. instaclustr.com What Is Apache Kafka? Key Characteristics â—Ź Horizontal scalable, distributed system â—Ź Performance â—‹ Low latency, high throughput â—Ź Scalability â—‹ Linear broker scalability via partitioned topics â—‹ Linear consumer scalability via consumer groups â—Ź Fault-tolerance â—‹ Data is replicated across multiple brokers â—‹ Automatic broker failover when primary replica goes offline â—‹ Automatic consumer failover when consumer in consumer group goes offline â—Ź Apache Foundation Open Source â—Ź Production Proven • Publish & Subscribe to streams of data (reliable message transport) • Transform and/or aggregate data streams using distributed processing applications (stream processing)
  • 5. instaclustr.com Why use Apache Kafka? â—Ź Provide a buffering mechanism in front of a processing (ie deal with temporary incoming message rate greater than processing app can deal with) â—Ź A special case of buffering is to allow producers to publish messages with guaranteed delivery even if the consumers are down when the message is published â—Ź As an event store for events sourcing or Kappa architecture â—Ź Facilitate flexible, configurable architectures with many producers -> many consumers by separating the details who what is consuming messages for the apps that produce them (and vice-versa) â—Ź Perform stream analytics (with Kafka Streams)
  • 6. instaclustr.com How does it work: producing records â—Ź Each topic has a fixed number of partitions â—Ź Records published to a topic by a producer are divided amongst the topic’s partitions â—Ź Partitions are ordered, immutable lists â—Ź Each new record is appended to the end of a partition â—Ź Each partition is stored on a single leader broker, and may optionally be replicated to one or more follower brokers
  • 7. instaclustr.com How does it work: consuming records â—Ź A consumer reads from one or more partitions â—Ź Consumer maintains an offset of the last record in the partition read â—Ź The consumer requests a micro-batch of records from Kafka. The broker uses the offset to provide the latest records to the consumer â—Ź Once the consumer has finished processing a record, it must commit the new offset â—Ź Because Kafka does not delete records immediately after they are read, consumers may reset the offset to a previous value to replay records
  • 8. instaclustr.com How does it work: consumer groups â—Ź Multiple consumers reading from a topic may be arranged into Consumer Groups â—Ź A Consumer Group load-balances partitions amongst consumers â—Ź If a consumer goes offline, the consumer group will automatically re-distribute it’s partitions amongst the remaining consumers
  • 9. instaclustr.com How it works: Easier Abstractions â—Ź High-level API â—Ź Drop-in source (import) & sink (export) connectors exist for many popular technologies, including Amazon S3, Amazon Kinesis, Apache Cassandra, HDFS and JDBC Kafka Connect â—Ź Provides functionality to aggregate data, join multiple topics and perform complex transformations to live data as it arrives â—Ź The API abstracts away most of the difficult scalability, fault-tolerance and consistency problems associated with performing live aggregations on a distributed system Kafka Streams
  • 10. instaclustr.com Instaclustr Managed Kafka - Key Features â—Ź Available Now: â—‹ Open source Apache Kafka (Brokers) and Zookeeper automatically provisioned in AWS, GCP and Azure â—‹ Broker Monitoring â—‹ Instaclustr monitoring and provisioning API support â—‹ Private network clusters (AWS only) â—‹ Run in your cloud provider account or ours â—Ź For GA (end June): â—‹ SOC2 compliant â—‹ User & credential management â—‹ More cluster config options â—‹ Topic Level and Synthetic transaction monitoring â—‹ Infrastructure config tuning â—Ź Likely future release scope: â—‹ Topic Management UI â—‹ Cluster “copy” â—‹ Managed: â–  Kafka Connect â–  Schema Registry â–  Mirror Maker â—‹ Dynamic scaling
  • 11. instaclustr.com Instaclustr Managed Kafka - Development Process â—Ź First customer requests 2016 â—Ź Internal infrastructure deployment and usage of Kafka mid 2017 â—Ź Managed service platform development commenced November 2017 â—Ź Early access program with 4 customers commenced December 2017 â—Ź Public preview release 21 May 2018 â—Ź GA expected 25 June 2018
  • 12. instaclustr.com Hardware Choice and Benchmarking - GP2 vs ST1 â—Ź AWS Benchmark - r4.large w 500GB disks â—Ź Avg 10% improved throughput with ST1 vs GP2 EBS â—Ź ST1 is 45% of the cost of GP2
  • 13. instaclustr.com Hardware Choice and Benchmarking - GP2 vs ST1 â—Ź AWS Benchmark - r4.large w 500GB disks â—Ź Avg 10% improved throughput with ST1 vs GP2 EBS â—Ź ST1 is 45% of the cost of GP2
  • 14. instaclustr.com Hardware Choice and Benchmarking - SSL vs non-SSL â—Ź AWS Benchmark - r4.large w 1500GB ST1 disks â—Ź 512 byte messages â—Ź ~30% decrease in throughput with Broker and Client SSL enabled
  • 15. instaclustr.com Hardware Choice and Benchmarking - SSL vs non-SSL â—Ź AWS Benchmark - r4.large w 1500GB ST1 disks â—Ź 512 byte messages â—Ź ~30% decrease in throughput with Broker and Client SSL enabled
  • 16. instaclustr.com Hardware Choice and Benchmarking - Number of Topics â—Ź Increasing topics small reduction performances â—Ź However, more topics = more partitions and significantly slows recovery time from node failure 10 Topic s 100 Topic s 1000 Topic s 5000 Topic s
  • 17. instaclustr.com Hardware Choice and Benchmarking - Colocated Zookeeper â—Ź Often recommended to host zookeeper separately to Kafka. â—Ź However, recent changes have significantly reduced load on Zookeeper from Kafka. â—‹ Consumer offsets are no longer stored in Zookeeper. â—Ź Our benchmarking showed no measurable difference in performance, at least for smaller clusters. Consumer Rate - Colocated Consumer Rate - Separate 6 Broker Test with Node Restart
  • 18. instaclustr.com Topic and User Configuration Management â—Ź Existing Kafka utilities for managing topic and user configuration required direct access to Zookeeper â—Ź However, Zookeeper does not have a robust external security model (TLS support, node to node auth, etc) â—Ź Providing Zookeeper access to customers introduces a whole class of very strange ways to break a cluster by corrupting Zookeeper â—Ź Solutions: â—‹ Developed command line tool to use Kafka API for topic configuration (https://github.com/instaclustr/ic-kafka-tools) â–  may add to Instaclustr console later although we think maintaining topic config as a version controlled file in your repo is a better approach â—‹ Adding user management to Instaclustr console â–  we do no want to keep cluster passwords in our central management system so this feature will require users to enter an existing Kafka credentials to be temporarily used by our system
  • 19. instaclustr.com Broker Security Configuration â—Ź Using SCRAM (Salted Challenge Response Authentication Mechanism) for authentication â—‹ More secure â—‹ Allows easier rotation of credentials â—‹ Initial release for client->broker only with plain text for broker to broker â—‹ Decided to also use for broker->broker to allow rapid rotation of credentials as part of SOC2 security measures â—Ź TLS built on existing Cassandra infrastructure â—‹ New CA created per cluster â—‹ CA used to generate certificates for each node â—‹ CA pub cert available for clients to download for full validation of certificates â—Ź Access to managed clusters also follows same model as Cassandra â—‹ Public IPs and whitelisting in firewall (security group or equivalent) â—‹ Private IPs with VPC Peering (or equivalent in other cloud providers) â—‹ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for admin access â—‹ Did not expose through firewall Zookeeper due to weak security model
  • 20. instaclustr.com Monitoring â—Ź Metrics exposed via JMX allowing us to use our existing Cassandra monitoring â—‹ Custom agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann -> Cassandra + Spark -> Console, APIs, Grafana â—Ź Exposing broker-level and per-topic metrics â—Ź Alerting? â—‹ The basics: service state, disk usage free space, server still exists â—‹ Kafka metrics: offline partitions, active controllers != 1, partition under replicated â—‹ Synthetic transactions: publish and consume message to controlled topic, measure success and latency
  • 21. instaclustr.com Backup and Restore â—Ź Internet wisdom = Kafka Backups is not a thing â—‹ Rely on replication within cluster or mirror maker replication to another cluster â—Ź Hmm - we rarely use backups for Cassandra but there have been a few times we’ve been very glad to have them â—‹ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication â—Ź Working on regular automated backup and restore of topic and security configuration â—Ź Consider using Kafka Connect to write important message to offline backup
  • 22. instaclustr.comTwitter @instaclustr [email protected] instaclustr.com Give a try! 14 Day Free Trial at instaclustr.com