Disaster Recovery Options
running Apache Kafka On Kubernetes
Rema Subramanian
Customer Success Technical Architect
Jennifer Snipes
Staff Customer Success Technical Architect
Contents
2
1. Resilient Kafka Architectures
2. Kafka & Kubernetes
3. Getting Started
4. Kubernetes Operator
5. Stretch Cluster on Kubernetes
6. Putting it to the Test
7. Wrapping it Up
8. Demo / Q & A
Resilient Kafka Architectures
Know your
RTO & RPO RPO
RPO (Recovery Point Objective) is about how
much data you can afford to lose before it
impacts business operations. For example, for
a banking system, 1 hour of data loss can be
catastrophic as they operate live transactions.
4
RTO
RTO (Recovery Time Objective) is the
timeframe within which an application and
systems must be restored after an outage.
Resilient Kafka Architectures
5
Active DC-1 / Passive DC-2
● Two independent clusters in different
Data Centers / Regions
● Producers only in one Data Center
● Consumers in both Data Centers
● Multi-cloud / Multi-region
● One way Replication
● Asynchronous Replication
● RPO >0, RTO >0
Resilient Kafka Architectures
6
Active DC-1 / Active DC-2
● Two independent clusters in different
Data Centers / Regions
● Producers in both Data Centers
● Consumers in both Data Centers
● Multi-region / Multi-cloud
● Bi-Directional Replication with
Provenance Headers
● Asynchronous Replication
● RPO >0, RTO >0
Resilient Kafka Architectures
7
Stretch Cluster
● Single Cluster stretched across different
Data Centers
● Producers write transparently across Data
Centers
● Consumers in all Data Centers
● RPO = 0, RTO near 0
● Synchronous Replication native to cluster
● Asynchronous Replication with Observers*
● Replica & Observer placement defines
Active-Active vs Active-Passive*
● Auto Observer Promotion*
● Multi-region*
Kafka & Kubernetes
Kafka Deployment Arenas
9
Traditional vs K8s
• Broker
• Hostnames /IPs
• Placement across DCs
• Communication across DCs
• Failure
• Broker and ZK co-location
• Multi-tenancy
Stretch Cluster on Kubernetes
10
Built-in Disaster Recovery on Kubernetes
11
Kubernetes Operator Kafka Stretch Cluster Chaos Testing / Monitoring
Getting Started
Building the K8s Cluster
13
GCP VPC Native
Cluster
● Alias IP address
range for nodes,
pods and services
● Requires non-
overlapping CIDR
ranges
GKE Cluster
● Separately
Managed Node
Pool
● Node machine type
● Configurable
number of nodes
distributed across
AZs
● Distinct namespace
per cluster
Networking
● VPC Native cluster
installs routing
● Firewall rules
○ allow tcp
between k8s
clusters
○ allow access to
2181, 2888, 3888,
9092, 7778,
3000
StorageClass
● provisioner:
kubernetes.io/gce-pd
● allowVolumeExpansion
: true
● type: pd-ssd
● fstype: ext4
● reclaimPolicy: Retain
● volumeBindingMode:
WaitForFirstConsumer
2 3
1 4
Networking between Kubernetes Clusters
14
stubDomains:
{ "west.svc.cluster.local":
["34.83.255.165"],
"central.svc.cluster.local":
["34.69.152.240"] }
Kubernetes Operator
Operator
16
CRDs
● Define various
application
components
● Medium to tie to
kafka
server.properties
Controller
● CRD’s
behavior
● Reconciliation
loop
Services
● Headless
Service
● Expose
individual
pods as
external
services
● Bootstrap LB
service to get
metadata
StatefulSets
● PVC claim
2 3
1 4
Pod Accessibility
17
Identifying the kafka pods
• Unique broker IDs
• Internally, each pod resolves
kafka-{n}.kafka.east.svc.cluster.local
• Externally, broker prefixes to
map to pod ordinals
{region}-b{n}.{domain}
Kafka CRD Services Galore
18
Stretch Cluster on Kubernetes
20
East Cluster
● Single Kubernetes Cluster (us-east1-cluster1-gke)
● 3+ Brokers
● Single Zookeeper
Central Cluster
● Single Kubernetes Cluster (us-central1-cluster1-gke)
● Single Zookeeper
West Cluster
● Single Kubernetes Cluster (us-west1-cluster1-gke)
● 3+ Brokers
● Single Zookeeper
Multi Region (Stretch) Cluster
ZooKeeper Configuration
21
Kafka Configuration
22
23
• Broker Rack Awareness
• Synchronous Replicas
• Asynchronous Observers*
• Observer Promotion Policy*
Topic Replica Placement*
Putting it to the Test
Testing
25
● Deleted a pod
Pod Kill
● Node VM down
● Auto-scaler off &
Node VM down
Node Fault
● Introduced pod
failures with chaos-mesh
Pod Failure
● Edited kube-dns
stub-domain
Network Fault
● 2 ZK nodes down -
disrupt quorum
ZK Quorum Failure
● Producers don’t
stop
● Controller broker
and topic leader
broker move to
west
● ZK west is
accessible for write
Region Down
Pod Failure/Node Fault
ZK Quorum Failure
27
Region Down
28
Network fault
29
Watching the Metrics
30
Wrapping it Up
Best Practices - Node to Pod Ratio
32
Pod
Pod Nodes
Quantity
● Use eventsizer.io to
derive number of
broker pods
● Adjust count to
balance across AZs
● Each AZ is a rack
Capacity
● Eventsizer.io output
will derive CPU and
memory per pod
● Set CPU and memory
limits and requests
Nodes
● Memory optimized
node type
● Evaluate capacity of
node based on how
average load
● Enable auto scaling
with average and peak
range
Best Practices
33
● Use Confluent for Kubernetes
● Choose the right storage that guarantees reliability, efficiency and speed e.g. SSD
○ Refer to ‘Building the K8s Cluster’ slide for other storage best practices
● Run health/liveness checks on:
○ Individual pod and bootstrap LBs
○ kube-system LBs
○ kube-dns service
● Use node affinity/pod anti-affinity to strategically place broker and ZK pods across AZs
● Use automation so infrastructure and CRDs can be deployed multi-region across all of your environments
● Follow best practices for tuning tcp socket buffers, replica fetcher, and clients for optimal performance with stretch
clusters
● Minimum Durability Configuration
○ 2 Replicas and 2 Observers in each region
○ min.ISR=3
● Monitor everything!
34
● Requires separate IP CIDR ranges
● CoreDNS exposed externally
● Restricted to single K8s implementation
● CRDs may get stuck if the finalizer logic in Operator is not finishing
● Manually restart stateful set if pods are erroring - known issue
● GCP VPCs are global, subnets are regional
Achieving your Desired Resilience
35
Active-Passive Active-Active
● RTO > 0, RPO > 0
● Replicas in one Region
● Observers in another Region
● Under-replicated AOP
● RTO ~ 0, RPO = 0
● 2 Replicas in each region
● 2 Observers in each region
● Under-replicated AOP
Demo and Q&A
References
37
1. https://assets.confluent.io/m/69c5ce7aff462f44/original/20180619-WP-
Recommendations_for_Deploying_Apache_Kafka_on_Kubernetes.pdf
2. https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
3. https://learn.hashicorp.com/tutorials/terraform/gke
4. https://chaos-mesh.org/docs/
5. https://docs.confluent.io/operator/current/overview.html
6. https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region.html
7. https://www.confluent.io/en-gb/events/kafka-summit-americas-2021/a-tale-of-2-n-data-centers-
tuning-apache-kafka-clusters-to-combat-latency/
Your Apache Kafka®
journey begins here
developer.confluent.io
38
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subramanian and Jennifer Snipes | Kafka Summit London 2022

More Related Content

PDF
Apache Airflow
PPTX
Apache Kafka Best Practices
PPTX
Introduction to Kafka Cruise Control
PDF
From airflow to google cloud composer
PDF
Producer Performance Tuning for Apache Kafka
PPTX
PDF
Airflow presentation
PDF
Reliability Guarantees for Apache Kafka
Apache Airflow
Apache Kafka Best Practices
Introduction to Kafka Cruise Control
From airflow to google cloud composer
Producer Performance Tuning for Apache Kafka
Airflow presentation
Reliability Guarantees for Apache Kafka

What's hot (20)

PPTX
Apache Airflow overview
PDF
A Deep Dive into Kafka Controller
PPTX
Google Cloud Composer
PPTX
Apache Flink and what it is used for
PDF
ksqlDB로 실시간 데이터 변환 및 스트림 처리
PPTX
Apache Flink in the Cloud-Native Era
PPTX
A visual introduction to Apache Kafka
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Introduction to Apache Airflow
PPTX
OpenTelemetry For Operators
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
PDF
Airflow for Beginners
PDF
Flink Complex Event Processing
PPTX
Apache airflow
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Common issues with Apache Kafka® Producer
PDF
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
PDF
When NOT to use Apache Kafka?
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
Kafka Quotas Talk at LinkedIn
Apache Airflow overview
A Deep Dive into Kafka Controller
Google Cloud Composer
Apache Flink and what it is used for
ksqlDB로 실시간 데이터 변환 및 스트림 처리
Apache Flink in the Cloud-Native Era
A visual introduction to Apache Kafka
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Introduction to Apache Airflow
OpenTelemetry For Operators
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
Airflow for Beginners
Flink Complex Event Processing
Apache airflow
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Common issues with Apache Kafka® Producer
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
When NOT to use Apache Kafka?
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Kafka Quotas Talk at LinkedIn
Ad

Similar to Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subramanian and Jennifer Snipes | Kafka Summit London 2022 (20)

PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
PDF
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
PPTX
Data weekender deploying prod grade sql 2019 big data clusters
PDF
Data protection in a kubernetes-native world
PDF
Kafka Mirror Tester: Go and Kubernetes Powered Test Suite for Kafka Replicati...
PDF
kubernetes.pdf
PDF
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
PDF
Kubernetes Multi-cluster without Federation - Kubecon EU 2018
PDF
Stories from running Kafka on K8S.pdf
PDF
Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Wa...
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PDF
Kubernetes extensibility: crd & operators
PDF
Kubernetes extensibility: CRDs & Operators
PDF
Deep Dive into Kubernetes - Part 2
PDF
Disaster Recovery Plans for Apache Kafka
PPTX
Kubernetes fundamentals
PDF
A Primer Towards Running Kafka on Top of Kubernetes.pdf
PDF
Kubernetes: Learning from Zero to Production
PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
PPTX
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Data weekender deploying prod grade sql 2019 big data clusters
Data protection in a kubernetes-native world
Kafka Mirror Tester: Go and Kubernetes Powered Test Suite for Kafka Replicati...
kubernetes.pdf
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
Kubernetes Multi-cluster without Federation - Kubecon EU 2018
Stories from running Kafka on K8S.pdf
Kafka Excellence at Scale – Cloud, Kubernetes, Infrastructure as Code (Vik Wa...
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Kubernetes extensibility: crd & operators
Kubernetes extensibility: CRDs & Operators
Deep Dive into Kubernetes - Part 2
Disaster Recovery Plans for Apache Kafka
Kubernetes fundamentals
A Primer Towards Running Kafka on Top of Kubernetes.pdf
Kubernetes: Learning from Zero to Production
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
CloudStack 4.21: First Look Webinar slides
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Zenith AI: Advanced Artificial Intelligence
DOCX
search engine optimization ppt fir known well about this
PPT
Geologic Time for studying geology for geologist
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Unlock new opportunities with location data.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
What is a Computer? Input Devices /output devices
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
August Patch Tuesday
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Five Habits of High-Impact Board Members
PPTX
The various Industrial Revolutions .pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
CloudStack 4.21: First Look Webinar slides
DP Operators-handbook-extract for the Mautical Institute
Zenith AI: Advanced Artificial Intelligence
search engine optimization ppt fir known well about this
Geologic Time for studying geology for geologist
O2C Customer Invoices to Receipt V15A.pptx
A novel scalable deep ensemble learning framework for big data classification...
Unlock new opportunities with location data.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Enhancing emotion recognition model for a student engagement use case through...
Taming the Chaos: How to Turn Unstructured Data into Decisions
What is a Computer? Input Devices /output devices
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Chapter 5: Probability Theory and Statistics
A comparative study of natural language inference in Swahili using monolingua...
August Patch Tuesday
Module 1.ppt Iot fundamentals and Architecture
Five Habits of High-Impact Board Members
The various Industrial Revolutions .pptx

Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subramanian and Jennifer Snipes | Kafka Summit London 2022

  • 1. Disaster Recovery Options running Apache Kafka On Kubernetes Rema Subramanian Customer Success Technical Architect Jennifer Snipes Staff Customer Success Technical Architect
  • 2. Contents 2 1. Resilient Kafka Architectures 2. Kafka & Kubernetes 3. Getting Started 4. Kubernetes Operator 5. Stretch Cluster on Kubernetes 6. Putting it to the Test 7. Wrapping it Up 8. Demo / Q & A
  • 4. Know your RTO & RPO RPO RPO (Recovery Point Objective) is about how much data you can afford to lose before it impacts business operations. For example, for a banking system, 1 hour of data loss can be catastrophic as they operate live transactions. 4 RTO RTO (Recovery Time Objective) is the timeframe within which an application and systems must be restored after an outage.
  • 5. Resilient Kafka Architectures 5 Active DC-1 / Passive DC-2 ● Two independent clusters in different Data Centers / Regions ● Producers only in one Data Center ● Consumers in both Data Centers ● Multi-cloud / Multi-region ● One way Replication ● Asynchronous Replication ● RPO >0, RTO >0
  • 6. Resilient Kafka Architectures 6 Active DC-1 / Active DC-2 ● Two independent clusters in different Data Centers / Regions ● Producers in both Data Centers ● Consumers in both Data Centers ● Multi-region / Multi-cloud ● Bi-Directional Replication with Provenance Headers ● Asynchronous Replication ● RPO >0, RTO >0
  • 7. Resilient Kafka Architectures 7 Stretch Cluster ● Single Cluster stretched across different Data Centers ● Producers write transparently across Data Centers ● Consumers in all Data Centers ● RPO = 0, RTO near 0 ● Synchronous Replication native to cluster ● Asynchronous Replication with Observers* ● Replica & Observer placement defines Active-Active vs Active-Passive* ● Auto Observer Promotion* ● Multi-region*
  • 9. Kafka Deployment Arenas 9 Traditional vs K8s • Broker • Hostnames /IPs • Placement across DCs • Communication across DCs • Failure • Broker and ZK co-location • Multi-tenancy
  • 10. Stretch Cluster on Kubernetes 10
  • 11. Built-in Disaster Recovery on Kubernetes 11 Kubernetes Operator Kafka Stretch Cluster Chaos Testing / Monitoring
  • 13. Building the K8s Cluster 13 GCP VPC Native Cluster ● Alias IP address range for nodes, pods and services ● Requires non- overlapping CIDR ranges GKE Cluster ● Separately Managed Node Pool ● Node machine type ● Configurable number of nodes distributed across AZs ● Distinct namespace per cluster Networking ● VPC Native cluster installs routing ● Firewall rules ○ allow tcp between k8s clusters ○ allow access to 2181, 2888, 3888, 9092, 7778, 3000 StorageClass ● provisioner: kubernetes.io/gce-pd ● allowVolumeExpansion : true ● type: pd-ssd ● fstype: ext4 ● reclaimPolicy: Retain ● volumeBindingMode: WaitForFirstConsumer 2 3 1 4
  • 14. Networking between Kubernetes Clusters 14 stubDomains: { "west.svc.cluster.local": ["34.83.255.165"], "central.svc.cluster.local": ["34.69.152.240"] }
  • 16. Operator 16 CRDs ● Define various application components ● Medium to tie to kafka server.properties Controller ● CRD’s behavior ● Reconciliation loop Services ● Headless Service ● Expose individual pods as external services ● Bootstrap LB service to get metadata StatefulSets ● PVC claim 2 3 1 4
  • 17. Pod Accessibility 17 Identifying the kafka pods • Unique broker IDs • Internally, each pod resolves kafka-{n}.kafka.east.svc.cluster.local • Externally, broker prefixes to map to pod ordinals {region}-b{n}.{domain}
  • 18. Kafka CRD Services Galore 18
  • 19. Stretch Cluster on Kubernetes
  • 20. 20 East Cluster ● Single Kubernetes Cluster (us-east1-cluster1-gke) ● 3+ Brokers ● Single Zookeeper Central Cluster ● Single Kubernetes Cluster (us-central1-cluster1-gke) ● Single Zookeeper West Cluster ● Single Kubernetes Cluster (us-west1-cluster1-gke) ● 3+ Brokers ● Single Zookeeper Multi Region (Stretch) Cluster
  • 23. 23 • Broker Rack Awareness • Synchronous Replicas • Asynchronous Observers* • Observer Promotion Policy* Topic Replica Placement*
  • 24. Putting it to the Test
  • 25. Testing 25 ● Deleted a pod Pod Kill ● Node VM down ● Auto-scaler off & Node VM down Node Fault ● Introduced pod failures with chaos-mesh Pod Failure ● Edited kube-dns stub-domain Network Fault ● 2 ZK nodes down - disrupt quorum ZK Quorum Failure ● Producers don’t stop ● Controller broker and topic leader broker move to west ● ZK west is accessible for write Region Down
  • 32. Best Practices - Node to Pod Ratio 32 Pod Pod Nodes Quantity ● Use eventsizer.io to derive number of broker pods ● Adjust count to balance across AZs ● Each AZ is a rack Capacity ● Eventsizer.io output will derive CPU and memory per pod ● Set CPU and memory limits and requests Nodes ● Memory optimized node type ● Evaluate capacity of node based on how average load ● Enable auto scaling with average and peak range
  • 33. Best Practices 33 ● Use Confluent for Kubernetes ● Choose the right storage that guarantees reliability, efficiency and speed e.g. SSD ○ Refer to ‘Building the K8s Cluster’ slide for other storage best practices ● Run health/liveness checks on: ○ Individual pod and bootstrap LBs ○ kube-system LBs ○ kube-dns service ● Use node affinity/pod anti-affinity to strategically place broker and ZK pods across AZs ● Use automation so infrastructure and CRDs can be deployed multi-region across all of your environments ● Follow best practices for tuning tcp socket buffers, replica fetcher, and clients for optimal performance with stretch clusters ● Minimum Durability Configuration ○ 2 Replicas and 2 Observers in each region ○ min.ISR=3 ● Monitor everything!
  • 34. 34 ● Requires separate IP CIDR ranges ● CoreDNS exposed externally ● Restricted to single K8s implementation ● CRDs may get stuck if the finalizer logic in Operator is not finishing ● Manually restart stateful set if pods are erroring - known issue ● GCP VPCs are global, subnets are regional
  • 35. Achieving your Desired Resilience 35 Active-Passive Active-Active ● RTO > 0, RPO > 0 ● Replicas in one Region ● Observers in another Region ● Under-replicated AOP ● RTO ~ 0, RPO = 0 ● 2 Replicas in each region ● 2 Observers in each region ● Under-replicated AOP
  • 37. References 37 1. https://assets.confluent.io/m/69c5ce7aff462f44/original/20180619-WP- Recommendations_for_Deploying_Apache_Kafka_on_Kubernetes.pdf 2. https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips 3. https://learn.hashicorp.com/tutorials/terraform/gke 4. https://chaos-mesh.org/docs/ 5. https://docs.confluent.io/operator/current/overview.html 6. https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region.html 7. https://www.confluent.io/en-gb/events/kafka-summit-americas-2021/a-tale-of-2-n-data-centers- tuning-apache-kafka-clusters-to-combat-latency/
  • 38. Your Apache Kafka® journey begins here developer.confluent.io 38