Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization

olena@aiven.io
@OlenaKutsenko aiven.io Olena Babenko:
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
Olena Kutsenko
Sr. Developer Advocate
Aiven
Olena Babenko
Staff Software Engineer
Aiven

olena@aiven.io
Mastering Kafka consumer distribution
A guide to efficient scaling and resource optimization
➔ why scaling consumers is not always desirable
➔ why consumer lag isn’t the metric you want to rely on
➔ how not to scale stateful consumers
➔ what is the most anticipated change in rebalancing protocol
➔ how to find a right balance between latency, durability and
costs

Definition
1 ● What is rebalancing?
● Why do we need it?

Producers Consumers
Topic
��
Partition 1
Partition 2
Partition 3
Partition 4

Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Side effects of rebalancing:
● Increased consumer lag, latency and reduced throughput
● Increased resource utilization
● Potential data duplication or data loss
● Increased complexity
We need efficient rebalancing for:
● Scalability
● Elasticity
● Fault tolerance
Moving ownership from one consumer to another is called a rebalance

olena@aiven.io
Rebalancing has a lot in common with
cooking

olena@aiven.io
1. Know when to scale (and when not)
2. Minimize unnecessary data movement
3. Avoid unnecessary rebalancing
We want to scale efficiently and effectively:

olena@aiven.io
Rebalancing is a teamwork

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Group coordinator - broker

Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Consumer 1
Consumer 2
Consumer 3
Broker 1
Consumer group
Consumer 4
Group coordinator - broker
Group leader - consumer

Status quo
3 Incremental
cooperative
rebalance

olena@aiven.io
Group
coordinator
Consumer 1 Consumer 2
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
Consumer 3
new

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
{REBALANCE_IN_PROGRESS}
HeartbeatResponse

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
HeartbeatResponse
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
HeartbeatResponse
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
{memberId, member list & subscriptions}
JoinGroupResponse
JoinGroupResponse

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
HeartbeatResponse
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
��
SyncGroupRequest
{assignment plan}
SyncGroupRequest
SyncGroupRequest
JoinGroupResponse
JoinGroupResponse

olena@aiven.io
Group
coordinator
Consumer 1
member.id=c1
partitions 1 & 2
Consumer 2
member.id=c2
partition 3
��
Consumer 3
new
JoinGroupRequest
{member.id=unknown}
HeartbeatResponse
HeartbeatResponse
JoinGroupRequest
{member.id=c1}
JoinGroupRequest
{member.id=c2}
JoinGroupResponse
JoinGroupResponse
��
SyncGroupRequest
{assignment plan}
SyncGroupResponse
{assignment plan}
SyncGroupRequest
SyncGroupRequest
SyncGroupResponse
{assignment plan}
SyncGroupResponse
{assignment plan}
JoinGroupResponse

olena@aiven.io
You probably already see some
bottlenecks….

olena@aiven.io
Drama points
1. Group coordinator is too slow

olena@aiven.io
Drama points
2. Group leader is too slow

olena@aiven.io
Drama points
3. Some of consumers are too slow

olena@aiven.io
Drama points
# of consumers Probability of success per
instance
Overall probability

olena@aiven.io
Drama points
instance
Overall probability
6 99%

olena@aiven.io
Drama points
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%

olena@aiven.io
4.
Drama points
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99%

olena@aiven.io
4.
Drama points
instance
Overall probability
6 99% =0.99^6 = 0.94 = 94%
100 99% =0.99^100 = 0.366 = 37%

olena@aiven.io
Drama points
4. A new node is stuck in rebalancing

olena@aiven.io
Drama points
5. onPartitionsRevoked dark hole

olena@aiven.io
Drama points
5. onPartitionsRevoked dark hole
Consumers apply the new assignment plan:
1. What partitions are newly assigned and what are now revoked
2. Start reading from newly assigned partitions
3. If any existing partitions are revoked trigger a new rebalance

olena@aiven.io
Scale horizontally vs vertically
Vertical scaling Pros:
- Less rebalancing, if static members used. (group.instance.id config).
More flexible, when run out of resources(CPU, RAM, disc etc).
Vertical scaling Cons:
- Lots of partitions on one node, not always good as well - one hot
partition could hog all resources.
- Bigger machines not always possible.
- State might be lost

olena@aiven.io
Horizontal scaling is time
consuming and risky.
How to make it more efficient?

olena@aiven.io
Build In consumer metrics
records-lag-max
Lag - very common metric to identify that there are too much going on,
especially, is lag is among ALL or majority of partitions.
records-consumed-rate
The average number of records consumed per second
join-rate
If only one partition is laging, that might be an error, or problems with
Job Groups. Helps to monitor if something is wrong with a rebalancing

olena@aiven.io
Build In consumer metrics
Pros:
- It is a simplest option to start from
Cons:
- On a consumer side and depends on a consumer health and state

olena@aiven.io
Kafka cluster metrics

olena@aiven.io
Generic Kafka Cluster metrics
- Also have lag info
- Less biased
- More info about producers and events production
- Additional important info about group coordinator health

olena@aiven.io
Autoscale

olena@aiven.io
KEDA (Apache Kafka scaler)
lagThreshold
Could be tuned to scale instanced based on lag.
activationLagThreshold
The activating (or deactivating) phase is the moment when KEDA
(operator) has to decide if the workload should be scaled from/to zero
+ many more
+ A lot more, if chose prometheus trigger (custom metrics)

olena@aiven.io
Knative
- Scale up and down faster using an amount of events
- You can Scale Kafka Source using KEDA
- Great in handling spikes
- Reusability of resources
- Keeps same pod identity, while replacing nodes (reduce amount of
rebalancing during failure)
- More complicated

olena@aiven.io
Lag only grows after autoscale

olena@aiven.io
Autoscale using lag is not always optimal
- Lots of joins/rebalancing can make event consumption slower. More
nodes will be requested as a result
- Too much pressure on one Leading node
- Lag metric doesn’t answer question WHAT CAUSED A LAG! (it is not
always lack of resources)
As a result:
- Fast autoscaling might be problematic

olena@aiven.io
Lag == Money 💰 ?

olena@aiven.io
Lag == Money 💰 !!
Time
⌛

olena@aiven.io
Time is more universal unit
- Lag is depending on a message sizes, on batch.size, on linger.ms
- Time is more universal unit for many businesses - you probably
know how much it cost to delay order for 2 hours, or paying website
downtime for 5 minutes.
- AWS, Confluent, Aiven etc usually on a server side provide
time-related metrics like Estimated Time Lag or Latency

olena@aiven.io
Simplest way to calculate time lag
- Took latest offset from a consumer group
- Read committed/consumed message timestamp from a topic
- Compare with current time
Pros:
- Accurate
Cons:
- Need to get a whole message (might be big)
- Need to do this quite often
- Do not scale well for multiple producers, consumer-groups, topics
and partitions

olena@aiven.io
Serglo
- Build an interpolation table to eliminate disadvantages of a simple
method
- A latest committed/consumed message get
approximated(predicted) timestamp
- Predicted timestamp compared with current time, to return time lag

olena@aiven.io
Serglo

olena@aiven.io
Aiven time lag predictor

olena@aiven.io
Aiven Lag predictor
Checkpoint 1: 09:00

olena@aiven.io
Aiven Lag predictor
Checkpoint 2: 09:05

olena@aiven.io
Aiven Lag predictor
Consumption speed:
100 - 70 = 30 records per 5 seconds
= 30 / 5 = 6 records per second
Left to consume:
180 - 100 = 80 records
= 80 / 6 = 13.333 seconds to catch up

olena@aiven.io
Aiven Lag predictor
kafka_lag_predictor_group_lag_predicted_seconds - estimate how much
time you need to catch up, with a current producing and consuming speed.
OR
Estimate WHEN will be consumed event that was published right
NOW.
More data points gives more precise results.

olena@aiven.io
Compare
- Serglo more about individual message level(deduce timestamp for
a one last message) vs Aiven lag predictor more about overall
speed. Both could be useful in a right context
- Any options, usually works good
- Can give you slightly different results, and expectations might be
different

olena@aiven.io
More metrics for better conclusion

olena@aiven.io
Aiven Lag predictor

olena@aiven.io
Aiven Lag predictor
Server-side metrics, defined at the same time:
`kafka_lag_predictor_topic_produced_records_total` Represents the total
count of records produced.(per partition)
`kafka_lag_predictor_group_consumed_records_total` Represents the
total count of records consumed. (per partition)

olena@aiven.io
3. Avoid unnecessary rebalancing -
Scale multiple instances at once

olena@aiven.io
Scaling ratio
Aiven lag predictor (per partition):
- Changes over time of
AVG(kafka_lag_predictor_topic_produced_records_total /
kafka_lag_predictor_group_consumed_records_total )
Client side alternative (per topic):
- Changes over time of AVG(record-send-total / record-consumed-total)
- OR per second record-send-rate / records-consumed-rate

olena@aiven.io
Identify other issues that caused lag
3. Avoid unnecessary rebalancing

olena@aiven.io
Rebalancing issues
- Max over time of for all consumers in topic
kafka_lag_predictor_group_consumed_records_total == 0
New server side metrics (KIP-714):
- consumer.coordinator.assigned.partitions != 0
- consumer.coordinator.rebalance.latency.max
Client side alternative:
- join-rate (Consumer)

olena@aiven.io
Consumption issues
- Max over time per partition, per topic
kafka_lag_predictor_group_consumed_records_total == 0
- consumer.fetch.manager.fetch.latency.max
- consumer.node.request.latency.max
- consumer.connection.creation.total

olena@aiven.io
Production issues (Scale down)
- Max over time of
kafka_lag_predictor_topic_produced_records_total == 0
- Hot partition
MAX(kafka_lag_predictor_topic_produced_records_total)
AVG(kafka_lag_predictor_topic_produced_records_total)
- producer.record.queue.time.max
- producer.node.request.latency.max
- producer.record.queue.time.max

olena@aiven.io
What is important for your business?
- Define relevant business rules to predict problem
- Rules could be a combination of different metrics:
- Time lag estimation
- Producer health/speed
- Consumer health/speed
- Group Coordinator metrics
Example: Burrow. Took an offsets and other metrics and transform them
into status.

olena@aiven.io
Conclusion
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader to
reduce rebalancing <- Could we do better?

olena@aiven.io
Yes!

olena@aiven.io
A new consumer protocol! Yey!

olena@aiven.io
KIP-848 The Next Generation of the Consumer Rebalance
Protocol
➔

olena@aiven.io
Should we redo everything right
now?

olena@aiven.io
New consumer group protocol
- Consumers are not responsible for keeping state
- Leader is not responsible for calculating assignment
- Simpler
- Not all problems gone

olena@aiven.io
New consumer group protocol
- Pay more attention to broker health. This might be another important
dimension to your metrics
- Life of consumers should become easier, and some metrics become
obsolete
- Life of Kafka providers like Consuent, AWS, Aiven became harder, but
it is not your problem ;)

olena@aiven.io
Conclusion.
- Use static groups when possible
- Include /Try server side metrics + new broker metrics
- Reactive scaling not always good (better to predict lag, then act
when consumer group already lagging)
- Scale based on your business needs
- Scale X instances at once and not overload your partition leader <-
Not a problem anymore
- You can try a new protocol version soon (3.7 preview)

Stateful
consumers
6 Assigners

olena@aiven.io
Assignors
➔ RangeAssignor
➔ RoundRobinAssignor
➔ CooperativeStickyAssignor

olena@aiven.io
RangeAssignor

olena@aiven.io
RangeAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group

olena@aiven.io
RoundRobinAssignor

olena@aiven.io
RoundRobinAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group

olena@aiven.io
CooperativeStickyAssignor

olena@aiven.io
CooperativeStickyAssignor
Topic 2
Partition 1
Partition 2
Partition 1
Partition 2
Consumer 1
Consumer 2
Consumer 3
Topic 1
Consumer group

olena@aiven.io
● Scaling infinitely is not possible
● Use static groups and CooperativeStickyAssignor
● Pay attention to broker and consumer health
● Predict lag, not act when consumer group already lagging
● Define business rules and control them with metrics
Remember

olena@aiven.io
Olena Kutsenko
twitter.com/OlenaKutsenko
linkedin.com/in/olenakutsenko
Olena Babenko
linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution

olena@aiven.io
Olena Kutsenko
Olena Babenko
linkedin.com/in/melhelen/ https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits:

olena@aiven.io
Olena Kutsenko
Olena Babenko
linkedin.com/in/melhelen/
Find us at
#108
https://github.com/anelook/mastering-kafka-consumer-distribution
Register for Aiven
for Apache Kafka
and get extra credits:

Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization

More Related Content

Similar to Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Mastering Kafka Consumer Distribution: A Guide to Efficient Scaling and Resource Optimization