SlideShare a Scribd company logo
Transactions in
Action
The Story of Exactly Once in Apache Kafka®
What are transactions?
● All or nothing
● Well-known in databases
● Similar notion in Kafka
○ Read then write using streams
● We expect failures!
● At most once semantics
● At least once semantics
● If you can fulfill both, you get exactly once!
What are “EXACTLY ONCE” semantics?
At most once At least once
EOS!
What is IDEMPOTENCY?
An operation can be performed multiple times and will always
result in the same outcome.
x=20 x++
Can we have transactions and EOS in Kafka?
Enter KIP-98!
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+an
d+Transactional+Messaging)
Producer
Broker
Transaction
Coordinator
Group
Coordinator
1
2 4.1
4.1a
2a
4.2
4.2a
4.3
4.3a
5.1
5.1a 5.3
4.4
4.4a 5.2a 5.2a
Tp0-offset-
x (PID)
Tp0-offset-
y (PID)
Commit
(PID)
Txn Id
-> PID
Insert
tp(PID)
Prepare
(PID)
Commit
(PID)
M0
(PID)
M1
(PID)
Commit
(PID)
How do transactions work?
● Producer initiates transaction and sends data
○ Add data
○ Commit offsets
○ Abort or commit the transaction
● Transaction Coordinator tracks transactional metadata
○ Stores + persists ongoing transactions, transaction metadata
○ Sends markers to appropriate data partitions
○ Calls off the transaction if it is ongoing too long
How do transactions work?
● Brokers host data partitions
○ Contains transactional record data
○ Records contain producer metadata
● Group Coordinators host group metadata and offsets
○ Optional transactions participant
○ Supports streams/read + write EOS applications
How do transactions work?
1. Producer initializes transactions
2. Producer produces data
a. Add partition
b. Produce data
3. Producer may add offsets
4. Producer requests to abort or commit
5. Transaction coordinator sends markers to partitions
Producer
Broker
Transaction
Coordinator
Group
Coordinator
1
2 4.1
4.1a
2a
4.2
4.2a
4.3
4.3a
5.1
5.1a 5.3
4.4
4.4a 5.2a 5.2a
tp0-offset-x
(PID)
tp0-offset-y
(PID)
Commit
(PID)
Txn Id
-> PID
Insert
tp(PID)
Prepare
(PID)
Commit
(PID)
M0
(PID)
M1
(PID)
Commit
(PID)
What’s in the logs?
Transaction
Coordinator
tp0-offset-
x
(PID)
tp0-offset-
y
(PID)
Commit
(PID)
Txn Id ->
PID
Insert tp
(PID)
Prepare
(PID)
Commit
(PID)
M0
(PID)
M1
(PID)
Commit
(PID)
Group Coordinator
Data Partition
How to ensure ordering and no duplicates?
● Each message is assigned a sequence number
● Broker keeps track of the last 5 batches to deduplicate
○ If incoming message matches stored sequence, rejected as duplicate
● Not next sequence – OutOfOrderSequence
● Do idempotent producers guarantee idempotency?
○ KIP-98 promises EOS, but there are some scenarios that were missed.
What happened to KIP-185?
● Can we make idempotency the default?
● OutOfOrderSequence error was too ambiguous
○ real data loss, retriable, or unclear – all seen as fatal!
● KAFKA-5793 tried to address known scenarios
○ OutOfOrderSequence should represent real data loss
○ UnknownProducerId to cover retention errors, trigger epoch bump
(https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Exactly+Once+-+Solving+the
+problem+of+spurious+OutOfOrderSequence+errors)
Idempotent Producer problems continue...
● UnknownProducerId error could lead to more confusing
scenarios
○ Sequence would be reset in ambiguous cases – was the record written?
○ Still some fatal cases
● Leader failovers can result in OutOfOrderSequence errors
How does KIP-360 change the EOS producer?
● Safe epoch bumping (KAFKA-8710)
● UnknownProducerId becomes rare and abortable, never fatal
○ Removed from the server code entirely!
○ We accept nonzero sequences in most cases
○ Can get stuck in retriable loop (KAFKA-14359)
● Store producer state for longer
(https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820)
KIP-484 and Slow Loading
● Transaction coordinator partition persists metadata
● Reassign partition or restart → load from disk
● More data on disk → slower loading time
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+gro
up+and+transaction+metadata+loading+duration)
KIP-691 and error handling
● Error handling not consistent
● Compatibility vs consistency
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+
Producer+Exception+Handling)
KIP-679, KIP-854, OOM, and you
● KIP-679 established idempotency as the default
● Some users spin up one-time use producers
● Storing too much producer state causes OOM
● Transactional and Producer state use same config
○ Transactional metadata deletion can cause InvalidPidMapping errors
● KIP-854 separated these configs
○ 1 day default
References to previous KIPs
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+t
he+strongest+delivery+guarantee+by+default)
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+fo
r+producer+ID+expiry)
What are hanging transactions?
● Txn Coordinator sends markers to all added partitions
● What if partition wasn’t added?
● How could it not be added?!
○ Buggy client
○ Race conditions
Why are hanging transactions an issue?
● Last Stable Offset (LSO) gets stuck
● READ_COMMITTED consumers can’t read past
● Log cleaner can’t clean past, large partitions
0
PID-2
1
PID-2
Commit
PID-2
0
PID-7
0
PID-5
2
PID-2
Commit
PID-2
Commit
PID-5
LSO LSO LSO LSO LSO
3
PID-2
KIP-890
● Part 1 eliminates hanging transactions on all clients
○ Extra interbroker hop
○ Can still add records to the wrong transaction
● Part 2 includes new client changes to strengthen EOS
○ Implicitly adds partition, eliminates extra hop
○ Epoch + transaction id will uniquely identify txn
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Si
de+Defense)
KIP-936
● Does producer ID expiration cause loss of idempotency?
● Can we throttle the creation of new producer IDs?
● Availability vs correctness
● How can we identify “misbehaving” clients?
○ KIP-936 suggests throttling by user
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-936%3A+Throttle+number+of+act
ive+PIDs)
Idempotent Producers V2
● Clear semantics
○ Ordering
○ No duplicates
● Bounded memory usage
● Contract between client and server
● Use across sessions?
KIP-939 and the future of 2PC in Kafka
● Outside system (ie. DB) and Kafka need to stay in sync?
● 2 phase commit
○ Prepare, then commit
○ Kafka does this implicitly internally
○ Want to extend to external systems/coordinators
● FLIP-319 for Flink to use Kafka 2PC transactions
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+
in+2PC)
(https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710)
baseOffset: 123456 lastOffset: 123456 count: 1 baseSequence: -1 lastSequence: -1
producerId: 123456 producerEpoch: 0 partitionLeaderEpoch: 0 isTransactional: true
isControl: true deleteHorizonMs: OptionalLong.empty position: 123456789 CreateTime:
1695839400 size: 78 magic: 2 compresscodec: none crc: 1234567890 isvalid: true
| offset: 123456 CreateTime: 1691433940165 keySize: 4 valueSize: 6 sequence: -1
headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 12

More Related Content

Similar to Transactions in Action: the Story of Exactly Once in Apache Kafka (20)

PPTX
Webinar patterns anti patterns
confluent
 
PDF
Error Handling with Kafka: From Patterns to Code
HostedbyConfluent
 
PPTX
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
PDF
Reliability Guarantees for Apache Kafka
confluent
 
PDF
TDEA 2018 Kafka EOS (Exactly-once)
Erhwen Kuo
 
PPTX
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Flink Forward
 
PPTX
Kafka_Transactions.pptx
Dalibor Blazevic
 
PDF
Apache Kafka: New Features That You Might Not Know About
Yaroslav Tkachenko
 
PDF
Scaling big with Apache Kafka
Nikolay Stoitsev
 
PPTX
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
DOCX
A Quick Guide to Refresh Kafka Skills
Ravindra kumar
 
PDF
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
confluent
 
PPTX
Paris Kafka Meetup - patterns anti-patterns
Florent Ramiere
 
PPTX
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PPTX
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
PPTX
Event Driven Architectures
arconsis
 
PDF
Event Driven Architectures
Dimosthenis Botsaris
 
PDF
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
PDF
Actual CCDAK Questions with Practice Tests and braindumps
killexamsofficial
 
Webinar patterns anti patterns
confluent
 
Error Handling with Kafka: From Patterns to Code
HostedbyConfluent
 
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Reliability Guarantees for Apache Kafka
confluent
 
TDEA 2018 Kafka EOS (Exactly-once)
Erhwen Kuo
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Flink Forward
 
Kafka_Transactions.pptx
Dalibor Blazevic
 
Apache Kafka: New Features That You Might Not Know About
Yaroslav Tkachenko
 
Scaling big with Apache Kafka
Nikolay Stoitsev
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
A Quick Guide to Refresh Kafka Skills
Ravindra kumar
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
confluent
 
Paris Kafka Meetup - patterns anti-patterns
Florent Ramiere
 
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
Event Driven Architectures
arconsis
 
Event Driven Architectures
Dimosthenis Botsaris
 
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Actual CCDAK Questions with Practice Tests and braindumps
killexamsofficial
 

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Ad

Transactions in Action: the Story of Exactly Once in Apache Kafka

  • 1. Transactions in Action The Story of Exactly Once in Apache Kafka®
  • 2. What are transactions? ● All or nothing ● Well-known in databases ● Similar notion in Kafka ○ Read then write using streams ● We expect failures!
  • 3. ● At most once semantics ● At least once semantics ● If you can fulfill both, you get exactly once! What are “EXACTLY ONCE” semantics? At most once At least once EOS!
  • 4. What is IDEMPOTENCY? An operation can be performed multiple times and will always result in the same outcome. x=20 x++
  • 5. Can we have transactions and EOS in Kafka? Enter KIP-98! (https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+an d+Transactional+Messaging) Producer Broker Transaction Coordinator Group Coordinator 1 2 4.1 4.1a 2a 4.2 4.2a 4.3 4.3a 5.1 5.1a 5.3 4.4 4.4a 5.2a 5.2a Tp0-offset- x (PID) Tp0-offset- y (PID) Commit (PID) Txn Id -> PID Insert tp(PID) Prepare (PID) Commit (PID) M0 (PID) M1 (PID) Commit (PID)
  • 6. How do transactions work? ● Producer initiates transaction and sends data ○ Add data ○ Commit offsets ○ Abort or commit the transaction ● Transaction Coordinator tracks transactional metadata ○ Stores + persists ongoing transactions, transaction metadata ○ Sends markers to appropriate data partitions ○ Calls off the transaction if it is ongoing too long
  • 7. How do transactions work? ● Brokers host data partitions ○ Contains transactional record data ○ Records contain producer metadata ● Group Coordinators host group metadata and offsets ○ Optional transactions participant ○ Supports streams/read + write EOS applications
  • 8. How do transactions work? 1. Producer initializes transactions 2. Producer produces data a. Add partition b. Produce data 3. Producer may add offsets 4. Producer requests to abort or commit 5. Transaction coordinator sends markers to partitions
  • 9. Producer Broker Transaction Coordinator Group Coordinator 1 2 4.1 4.1a 2a 4.2 4.2a 4.3 4.3a 5.1 5.1a 5.3 4.4 4.4a 5.2a 5.2a tp0-offset-x (PID) tp0-offset-y (PID) Commit (PID) Txn Id -> PID Insert tp(PID) Prepare (PID) Commit (PID) M0 (PID) M1 (PID) Commit (PID)
  • 10. What’s in the logs? Transaction Coordinator tp0-offset- x (PID) tp0-offset- y (PID) Commit (PID) Txn Id -> PID Insert tp (PID) Prepare (PID) Commit (PID) M0 (PID) M1 (PID) Commit (PID) Group Coordinator Data Partition
  • 11. How to ensure ordering and no duplicates? ● Each message is assigned a sequence number ● Broker keeps track of the last 5 batches to deduplicate ○ If incoming message matches stored sequence, rejected as duplicate ● Not next sequence – OutOfOrderSequence ● Do idempotent producers guarantee idempotency? ○ KIP-98 promises EOS, but there are some scenarios that were missed.
  • 12. What happened to KIP-185? ● Can we make idempotency the default? ● OutOfOrderSequence error was too ambiguous ○ real data loss, retriable, or unclear – all seen as fatal! ● KAFKA-5793 tried to address known scenarios ○ OutOfOrderSequence should represent real data loss ○ UnknownProducerId to cover retention errors, trigger epoch bump (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Exactly+Once+-+Solving+the +problem+of+spurious+OutOfOrderSequence+errors)
  • 13. Idempotent Producer problems continue... ● UnknownProducerId error could lead to more confusing scenarios ○ Sequence would be reset in ambiguous cases – was the record written? ○ Still some fatal cases ● Leader failovers can result in OutOfOrderSequence errors
  • 14. How does KIP-360 change the EOS producer? ● Safe epoch bumping (KAFKA-8710) ● UnknownProducerId becomes rare and abortable, never fatal ○ Removed from the server code entirely! ○ We accept nonzero sequences in most cases ○ Can get stuck in retriable loop (KAFKA-14359) ● Store producer state for longer (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89068820)
  • 15. KIP-484 and Slow Loading ● Transaction coordinator partition persists metadata ● Reassign partition or restart → load from disk ● More data on disk → slower loading time (https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+gro up+and+transaction+metadata+loading+duration)
  • 16. KIP-691 and error handling ● Error handling not consistent ● Compatibility vs consistency (https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+ Producer+Exception+Handling)
  • 17. KIP-679, KIP-854, OOM, and you ● KIP-679 established idempotency as the default ● Some users spin up one-time use producers ● Storing too much producer state causes OOM ● Transactional and Producer state use same config ○ Transactional metadata deletion can cause InvalidPidMapping errors ● KIP-854 separated these configs ○ 1 day default
  • 18. References to previous KIPs (https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+t he+strongest+delivery+guarantee+by+default) (https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+fo r+producer+ID+expiry)
  • 19. What are hanging transactions? ● Txn Coordinator sends markers to all added partitions ● What if partition wasn’t added? ● How could it not be added?! ○ Buggy client ○ Race conditions
  • 20. Why are hanging transactions an issue? ● Last Stable Offset (LSO) gets stuck ● READ_COMMITTED consumers can’t read past ● Log cleaner can’t clean past, large partitions 0 PID-2 1 PID-2 Commit PID-2 0 PID-7 0 PID-5 2 PID-2 Commit PID-2 Commit PID-5 LSO LSO LSO LSO LSO 3 PID-2
  • 21. KIP-890 ● Part 1 eliminates hanging transactions on all clients ○ Extra interbroker hop ○ Can still add records to the wrong transaction ● Part 2 includes new client changes to strengthen EOS ○ Implicitly adds partition, eliminates extra hop ○ Epoch + transaction id will uniquely identify txn (https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Si de+Defense)
  • 22. KIP-936 ● Does producer ID expiration cause loss of idempotency? ● Can we throttle the creation of new producer IDs? ● Availability vs correctness ● How can we identify “misbehaving” clients? ○ KIP-936 suggests throttling by user (https://cwiki.apache.org/confluence/display/KAFKA/KIP-936%3A+Throttle+number+of+act ive+PIDs)
  • 23. Idempotent Producers V2 ● Clear semantics ○ Ordering ○ No duplicates ● Bounded memory usage ● Contract between client and server ● Use across sessions?
  • 24. KIP-939 and the future of 2PC in Kafka ● Outside system (ie. DB) and Kafka need to stay in sync? ● 2 phase commit ○ Prepare, then commit ○ Kafka does this implicitly internally ○ Want to extend to external systems/coordinators ● FLIP-319 for Flink to use Kafka 2PC transactions (https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+ in+2PC) (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710)
  • 25. baseOffset: 123456 lastOffset: 123456 count: 1 baseSequence: -1 lastSequence: -1 producerId: 123456 producerEpoch: 0 partitionLeaderEpoch: 0 isTransactional: true isControl: true deleteHorizonMs: OptionalLong.empty position: 123456789 CreateTime: 1695839400 size: 78 magic: 2 compresscodec: none crc: 1234567890 isvalid: true | offset: 123456 CreateTime: 1691433940165 keySize: 4 valueSize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 12