SlideShare a Scribd company logo
Matteo Merli
Guaranteed “effectively-once” messaging semantic
What is Apache Pulsar?
• Distributed pub/sub messaging
• Backed by a scalable log store — Apache BookKeeper
• Streaming & Queuing
• Low latency
• Multi-tenant
• Geo-Replication
2
Architecture view
• Separate layers
between brokers
bookies
• Broker and bookies
can be added
independently
• Traffic can be shifted
very quickly across
brokers
• New bookies will ramp
up on traffic quickly
3
Pulsar Broker 1 Pulsar Broker 2 Pulsar Broker 3
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Messaging model
4
Messaging semantics
At most once
At least once
Exactly once
5
“Exactly once”
• There is no agreement in industry on what it really means
• Any vendor has claimed exactly once at some point
• Many caveats… “only if there are no crashes…”
• No formal definition of exactly once — unlike “consensus” or “atomic
broadcast”
6
“Effectively once”
• Identify and discard duplicated messages with 100% accuracy
• In presence of any kind of failures
• Messages can be received and processed more than once
• …but effects on the resulting state will be observed only once
7
What can fail?
8
What can fail?
9
What can fail?
10
What can fail?
11
What can fail? — Geo-Replication
12
Breaking the problem
1. Store the message once — ”producer idempotency”
2. Allow applications to “process data only-once”
13
Idempotent producer
• Pulsar broker detects and discards messages that are being retransmitted
• It works when a broker crashes and topic is reassigned
• It works when a producer application crashes
14
Identifying producers
• Use “sequence ids” to detect retransmissions
• Each producer on a topic has it own sequence of messages
• Use “producer-name” to identify producers
15
Detecting duplicates
16
Detecting duplicates
17
Detecting duplicates
18
Detecting duplicates
19
Sequence Id snapshot
20
Sequence Id snapshot
21
Sequence Id snapshot
• Snapshots are taken every N entries to limit recovery time
• Snapshot & cursor updates are atomic
• Cursor updates are stored in BookKeeper — durable & replicated
• On recovery
• Load the snapshot from the cursor
• Replay the entries from the cursor position
22
What if application producer crashes?
• Pulsar needs to identify the new producer as being the same “logical”
producer as before
• In practice, this is only useful if you have a “replayable” source (eg: file,
stream, …)
23
Resuming a producer session
ProducerConfiguration conf = new ProducerConfiguration();
conf.setProducerName("my-producer-name");
conf.setSendTimeout(0, TimeUnit.SECONDS);
Producer producer = client.createProducer(MY_TOPIC, conf);
// Get last committed sequence id before crash
long lastSequenceId = producer.getLastSequenceId();
24
Using sequence Ids
// Fictitious record reader class
RecordReader source = new RecordReader("/my/file/path");
long fileOffset = producer.getLastSequenceId();
source.seekToOffset(fileOffset);
while (source.hasNext()) {
long currentOffset = source.currentOffset();
Message msg = MessageBuilder.create()
.setSequenceId(currentOffset)
.setContent(source.next()).build();
producer.send(msg);
}
25
Consuming messages only once
• Pulsar Consumer API is very convenient
• Managed subscription — tracking individual messages
Consumer consumer = client.subscribe(MY_TOPIC, MY_SUBSCRIPTION_NAME);
while (true) {
Message msg = consumer.receive();
// Process the message...
consumer.acknowledge(msg);
}
26
Effectively-once with Consumer
• Consumer is very simple but doesn’t allow a large degree of control
• Processing and acknowledge are not atomic
• To achieve “effectively once” we need to rely on an external system to
deduplicate the processing results. Eg:
• RDBMS — Keep the message id as a column with a “unique” index
• Critical write to update the state — compareAndSet() or similar
27
Pulsar Reader
• Reader is a low level API to receive data from a Pulsar topic
• There is no managed subscription
• Application always specifies the message id where it wants to start reading
from
28
Reader example
MessageId lastMessageId = recoverLastMessageIdFromDB();
Reader reader = client.createReader(MY_TOPIC, lastMessageId,
new ReaderConfiguration());
while (true) {
Message msg = reader.readNext();
byte[] msgId = msg.getMessageId().toByteArray();
// Process the message and store msgId atomically
}
29
Example — Pulsar Functions
30
Pulsar Functions
• A function gets messages from 1 or more topics
• An instance of the function is invoked to process the event
• The output of the function is published on 1 or more topics
• Super simple to use — No SDK required — Python example:
def process(input):
return input + '!'
31
Pulsar Functions
32
Effectively once with functions
• Use the message id from source topic as sequence id for sink topic
• Works with “Consumer” API
• When consuming from multiple topics or partitions, creates 1 producer per
each source topic/partition, to ensure monotonic sequence ids
33
Performance
• Pulsar approach guarantees deduplication in all failure scenarios
• Overhead is minimal: 2 in memory hashmap updates
• No reduction in throughput — No increased latency
• Controllable increase in recovery time
34
Performance — Benchmark
OpenMessaging
Benchmark
1 Topic / 1 Partition
1 Partition / 1
Consumer
1Kb msg
35
Difference with Kafka approach
36
Kafka Pulsar
Producer Idempotency Best-effort (in memory only) Guaranteed after crash
Transactions 2 phase commit No transactions
Dedup across producer
sessions
No Yes
Dedup with geo-
replication
No Yes
Throughput
Lower (1 in-flight message/batch for
ordering)
Equal
Curious to Learn More?
• Apache Pulsar — https://pulsar.incubator.apache.org
• Follow Us — @apache_pulsar
• Streamlio blog — https://streaml.io/blog
37

More Related Content

What's hot (20)

PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PPTX
Apache Pinot Meetup Sept02, 2020
Mayank Shrivastava
 
PPTX
Kafka 101
Clement Demonchy
 
PPT
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PPTX
Getting started with postgresql
botsplash.com
 
PDF
From Zero to Hero with Kafka Connect
confluent
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Apache kafka
Viswanath J
 
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
PPTX
Distributed Tracing in Practice
DevOps.com
 
PDF
Introduction to the Disruptor
Trisha Gee
 
PDF
Apache Airflow
Sumit Maheshwari
 
PPTX
Monitoring Apache Kafka
confluent
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PDF
Introducing Saga Pattern in Microservices with Spring Statemachine
VMware Tanzu
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PDF
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
Apache Kafka - Martin Podval
Martin Podval
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
Apache Pinot Meetup Sept02, 2020
Mayank Shrivastava
 
Kafka 101
Clement Demonchy
 
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Getting started with postgresql
botsplash.com
 
From Zero to Hero with Kafka Connect
confluent
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Apache kafka
Viswanath J
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Distributed Tracing in Practice
DevOps.com
 
Introduction to the Disruptor
Trisha Gee
 
Apache Airflow
Sumit Maheshwari
 
Monitoring Apache Kafka
confluent
 
An Introduction to Apache Kafka
Amir Sedighi
 
Introducing Saga Pattern in Microservices with Spring Statemachine
VMware Tanzu
 
Introduction to Apache Kafka
Shiao-An Yuan
 
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 

Similar to Effectively-once semantics in Apache Pulsar (20)

PDF
Pulsar - Distributed pub/sub platform
Matteo Merli
 
PDF
Transaction preview of Apache Pulsar
StreamNative
 
PDF
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
PDF
Timothy Spann: Apache Pulsar for ML
Edunomica
 
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
biruktresehb
 
PDF
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
PDF
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
PDF
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
PDF
OSA Con 2022: Streaming Data Made Easy
Timothy Spann
 
PDF
Open Source Bristol 30 March 2022
Timothy Spann
 
PDF
Unified Messaging and Data Streaming 101
Timothy Spann
 
PPTX
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
StreamNative
 
PDF
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
PDF
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
PDF
Apache Pulsar Development 101 with Python
Timothy Spann
 
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
PDF
Transaction Support in Pulsar 2.5.0
StreamNative
 
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Transaction preview of Apache Pulsar
StreamNative
 
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
biruktresehb
 
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
OSA Con 2022: Streaming Data Made Easy
Timothy Spann
 
Open Source Bristol 30 March 2022
Timothy Spann
 
Unified Messaging and Data Streaming 101
Timothy Spann
 
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
StreamNative
 
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
Apache Pulsar Development 101 with Python
Timothy Spann
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
Transaction Support in Pulsar 2.5.0
StreamNative
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
big data eco system fundamentals of data science
arivukarasi
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
What Is Data Integration and Transformation?
subhashenia
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Ad

Effectively-once semantics in Apache Pulsar

  • 2. What is Apache Pulsar? • Distributed pub/sub messaging • Backed by a scalable log store — Apache BookKeeper • Streaming & Queuing • Low latency • Multi-tenant • Geo-Replication 2
  • 3. Architecture view • Separate layers between brokers bookies • Broker and bookies can be added independently • Traffic can be shifted very quickly across brokers • New bookies will ramp up on traffic quickly 3 Pulsar Broker 1 Pulsar Broker 2 Pulsar Broker 3 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Apache BookKeeper Apache Pulsar Producer Consumer
  • 5. Messaging semantics At most once At least once Exactly once 5
  • 6. “Exactly once” • There is no agreement in industry on what it really means • Any vendor has claimed exactly once at some point • Many caveats… “only if there are no crashes…” • No formal definition of exactly once — unlike “consensus” or “atomic broadcast” 6
  • 7. “Effectively once” • Identify and discard duplicated messages with 100% accuracy • In presence of any kind of failures • Messages can be received and processed more than once • …but effects on the resulting state will be observed only once 7
  • 12. What can fail? — Geo-Replication 12
  • 13. Breaking the problem 1. Store the message once — ”producer idempotency” 2. Allow applications to “process data only-once” 13
  • 14. Idempotent producer • Pulsar broker detects and discards messages that are being retransmitted • It works when a broker crashes and topic is reassigned • It works when a producer application crashes 14
  • 15. Identifying producers • Use “sequence ids” to detect retransmissions • Each producer on a topic has it own sequence of messages • Use “producer-name” to identify producers 15
  • 22. Sequence Id snapshot • Snapshots are taken every N entries to limit recovery time • Snapshot & cursor updates are atomic • Cursor updates are stored in BookKeeper — durable & replicated • On recovery • Load the snapshot from the cursor • Replay the entries from the cursor position 22
  • 23. What if application producer crashes? • Pulsar needs to identify the new producer as being the same “logical” producer as before • In practice, this is only useful if you have a “replayable” source (eg: file, stream, …) 23
  • 24. Resuming a producer session ProducerConfiguration conf = new ProducerConfiguration(); conf.setProducerName("my-producer-name"); conf.setSendTimeout(0, TimeUnit.SECONDS); Producer producer = client.createProducer(MY_TOPIC, conf); // Get last committed sequence id before crash long lastSequenceId = producer.getLastSequenceId(); 24
  • 25. Using sequence Ids // Fictitious record reader class RecordReader source = new RecordReader("/my/file/path"); long fileOffset = producer.getLastSequenceId(); source.seekToOffset(fileOffset); while (source.hasNext()) { long currentOffset = source.currentOffset(); Message msg = MessageBuilder.create() .setSequenceId(currentOffset) .setContent(source.next()).build(); producer.send(msg); } 25
  • 26. Consuming messages only once • Pulsar Consumer API is very convenient • Managed subscription — tracking individual messages Consumer consumer = client.subscribe(MY_TOPIC, MY_SUBSCRIPTION_NAME); while (true) { Message msg = consumer.receive(); // Process the message... consumer.acknowledge(msg); } 26
  • 27. Effectively-once with Consumer • Consumer is very simple but doesn’t allow a large degree of control • Processing and acknowledge are not atomic • To achieve “effectively once” we need to rely on an external system to deduplicate the processing results. Eg: • RDBMS — Keep the message id as a column with a “unique” index • Critical write to update the state — compareAndSet() or similar 27
  • 28. Pulsar Reader • Reader is a low level API to receive data from a Pulsar topic • There is no managed subscription • Application always specifies the message id where it wants to start reading from 28
  • 29. Reader example MessageId lastMessageId = recoverLastMessageIdFromDB(); Reader reader = client.createReader(MY_TOPIC, lastMessageId, new ReaderConfiguration()); while (true) { Message msg = reader.readNext(); byte[] msgId = msg.getMessageId().toByteArray(); // Process the message and store msgId atomically } 29
  • 30. Example — Pulsar Functions 30
  • 31. Pulsar Functions • A function gets messages from 1 or more topics • An instance of the function is invoked to process the event • The output of the function is published on 1 or more topics • Super simple to use — No SDK required — Python example: def process(input): return input + '!' 31
  • 33. Effectively once with functions • Use the message id from source topic as sequence id for sink topic • Works with “Consumer” API • When consuming from multiple topics or partitions, creates 1 producer per each source topic/partition, to ensure monotonic sequence ids 33
  • 34. Performance • Pulsar approach guarantees deduplication in all failure scenarios • Overhead is minimal: 2 in memory hashmap updates • No reduction in throughput — No increased latency • Controllable increase in recovery time 34
  • 35. Performance — Benchmark OpenMessaging Benchmark 1 Topic / 1 Partition 1 Partition / 1 Consumer 1Kb msg 35
  • 36. Difference with Kafka approach 36 Kafka Pulsar Producer Idempotency Best-effort (in memory only) Guaranteed after crash Transactions 2 phase commit No transactions Dedup across producer sessions No Yes Dedup with geo- replication No Yes Throughput Lower (1 in-flight message/batch for ordering) Equal
  • 37. Curious to Learn More? • Apache Pulsar — https://pulsar.incubator.apache.org • Follow Us — @apache_pulsar • Streamlio blog — https://streaml.io/blog 37