SlideShare a Scribd company logo
Matteo Merli & Sijie Guo
fast, durable, flexible pub/sub messaging
Overview
What is Apache Pulsar?
3
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both Topic &
Queue semantic in a
single model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
Usage of Pulsar
• In production for 3+ years at Yahoo
• Powering critical products like:
• Yahoo Mail, Yahoo Finance, Gemini Ads,
Flickr and Sherpa (NoSQL database)
• 80+ tenants
• 2.3 Million topics
• 100 B messages / day
• Full-mesh replication in 8 data-centers
4
Why build a new system?
• No existing solution to satisfy requirements
• Multi tenant — 1M topics — Low latency — Durability — Geo replication
• Kafka doesn’t scale well with many topics:
• Storage model based on individual directory per topic partition
• Enabling durability kills the performance
• Many other choking points: getting stats, access to metadata, flow-control
• Operations are not very convenient
• eg: replacing a server, manual commands to copy the data and involves clients
• clients access to ZK clusters not desirable
• Ability to manage large backlogs
• No scalable support to keep consumer position
5
Architecture view
Separate layers between
brokers bookies
• Broker and bookies can
be added
independently
• Traffic can be shifted
very quickly across
brokers
• New bookies will ramp
up on traffic quickly
6
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Apache BookKeeper
• A replicated log storage
• Low-latency durable writes
• Simple repeatable read
consistency
• Highly available
• Store many logs per node
• I/O Isolation
7
Kafka vs. Pulsar Segment
Distribution
BookKeeper - Storage
• A single bookie can serve and
store thousands of ledgers
• Write and read paths are
separated:
• Avoid read activity to impact
write latency
• Writes are added to in-
memory write-cache and
committed to journal
• Write cache is flushed in
background to separated
device
• Entries are sorted to allow for
mostly sequential reads
9
Concepts
Pulsar Concepts
Support both topic and queue semantics in a unified topic
concept
11
Pulsar Namespace
12
Pulsar Subscription
13
Partitioned Topic
14
Pulsar client library
• Java — C++ — Python — WebSocket APIs
• Partitioned topics
• Apache Kafka compatibility wrapper API
• Transparent batching of messages
• Compression
• TLS encryption and authentication
• End-to-end encryption
15
Demo time
16
Demo time
17
Demo time
18
Lunch Break
Use cases - Message Queue
• Decouple online / background
• Provide high-availability
• Reliable data transport
20
Online
events
Pulsar
topic 1
Worker 1
Worker 2
Worker 3
Pulsar
topic 2
Low latency
publish
Long running task
Notification
Use cases - Message Queue
• Async processing of time intensive background tasks
• Publishes and complete HTTP request
• Examples: Image / Video transcoding, bulk operations (large folder deletions)
21
Use cases - Message Queue
22
• Delegate the processing to multiple compute jobs which interact with micro-services
• Wait until the response is pushed back to the HTTP server
• Examples: breaking monolithic applications into micro-services
Use cases - Notifications
• Change data capture
• Listeners are frequently different tenants
• Quotas needs to ensure producer is not affected
23
Event
Pulsar
topic
Component 1
Component 2
Component 3
Listeners
Use cases - Notifications
• Replicating DB transactions across regions
• Notification to multiple interested tenants (example: new user
account, …)
24
Use cases - Feedback system
• Coordinate a large number of machines
• Propagate state
25
External
inputs
Pulsar
topic 1
Serving
system
Serving
system
Serving
system
Pulsar
topic 2
Controller
Updates
Feedback
Multi Tenancy
Multi-Tenancy
• Authentication / Authorization / Namespaces / Admin APIs
• I/O Isolations between writes and reads
• Provided by BookKeeper - Ensure readers draining backlog won’t
affect publishers
• Soft isolation
• Storage quotas — flow-control — back-pressure — rate limiting
• Hardware isolation
• Constrain some tenants on a subset of brokers or bookies
27
Demo time
28
Geo-Replication
Geo-Replication
• Scalable
asynchronous
replication
• Integrated in the
broker message
flow
• Simple
configuration to
add/remove
regions
30
Topic (T1) Topic (T1)
Topic (T1)
Subscrip@on (S1) Subscrip@on (S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
Group Exercise
31
Pulsar — Conclusion
• A fast durable, distributed pub/sub messaging
• Flexible Traditional Messaging: Queuing and Pub/Sub
• Focus on message dispatch and consumption
• Remove data as soon as possible if they are not needed
• It is backed by a scalable log store
• durable message store, zero data loss
• allow rewinding / reprocessing messages for backfill, bootstrap systems and
stream computing
• It carries all the fantastic features from the log store.
32
Real-Time Solution
33
Curious to Learn More?
• Apache Pulsar : http://pulsar.incubator.apache.org
• Apache DistributedLog : http://bookkeeper.apache.org/
distributedlog
• Apache BookKeeper : http://bookkeeper.apache.org
• Follow Us @apache_pulsar @asfbookkeeper
@distributedlog
34
Curious to Learn More?
• Messaging, Storage, or Both: https://streaml.io/blog/
messaging-storage-or-both/
• Why BookKeeper: Consistency, Durability and Availability:
https://streaml.io/blog/why-apache-bookkeeper/
• Introduction to Apache Pulsar: https://streaml.io/blog/intro-
to-pulsar/
• Kafka vs DistributedLog: https://bookkeeper.apache.org/
distributedlog/technical-review/2016/09/19/kafka-vs-
distributedlog.html
35
Curious to learn more about
Streamlio?
• Streamlio: https://streaml.io
• Sandbox Preview: https://streaml.io/docs/getting-started
• Learn slack channel: https://learn-streamlio.slack.com

36

More Related Content

What's hot (20)

PPTX
Apache Pulsar First Overview
Ricardo Paiva
 
PDF
Pulsar - Distributed pub/sub platform
Matteo Merli
 
PDF
Devoxx Morocco 2016 - Microservices with Kafka
László-Róbert Albert
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PPTX
Design Patterns for working with Fast Data
MapR Technologies
 
PPTX
kafka
Amikam Snir
 
PPTX
Reducing Microservice Complexity with Kafka and Reactive Streams
jimriecken
 
PDF
Apache con2016final
Salesforce
 
PPTX
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
PPTX
Kafka 101
Clement Demonchy
 
PDF
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Apache kafka
Srikrishna k
 
PDF
Kafka internals
David Groozman
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PDF
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
confluent
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PDF
Apache Pulsar at Yahoo! Japan
StreamNative
 
Apache Pulsar First Overview
Ricardo Paiva
 
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Devoxx Morocco 2016 - Microservices with Kafka
László-Róbert Albert
 
Apache Kafka - Martin Podval
Martin Podval
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Design Patterns for working with Fast Data
MapR Technologies
 
Reducing Microservice Complexity with Kafka and Reactive Streams
jimriecken
 
Apache con2016final
Salesforce
 
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Kafka 101
Clement Demonchy
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Apache kafka
Srikrishna k
 
Kafka internals
David Groozman
 
Introduction to Apache Kafka
Jeff Holoman
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
confluent
 
kafka for db as postgres
PivotalOpenSourceHub
 
Apache Pulsar at Yahoo! Japan
StreamNative
 

Similar to Hands-on Workshop: Apache Pulsar (20)

PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
Modern Distributed Messaging and RPC
Max Alexejev
 
PPTX
messaging.pptx
NParakh1
 
PPTX
Kafka overview v0.1
Mahendran Ponnusamy
 
PPTX
Building an Event Bus at Scale
jimriecken
 
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
PPTX
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
PDF
Kafka - Messaging System
Tanuj Mehta
 
PDF
Apache Kafka Introduction
Amita Mirajkar
 
PPTX
Apache Kafka
emreakis
 
PDF
OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
NETWAYS
 
PDF
OSMC 2016 | Monasca: Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
NETWAYS
 
PDF
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
HostedbyConfluent
 
PDF
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
PPTX
Taking the open cloud to 11
Joe Brockmeier
 
PDF
Evaluating Streaming Data Solutions
Streamlio
 
PDF
Timothy Spann: Apache Pulsar for ML
Edunomica
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Modern Distributed Messaging and RPC
Max Alexejev
 
messaging.pptx
NParakh1
 
Kafka overview v0.1
Mahendran Ponnusamy
 
Building an Event Bus at Scale
jimriecken
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
Kafka - Messaging System
Tanuj Mehta
 
Apache Kafka Introduction
Amita Mirajkar
 
Apache Kafka
emreakis
 
OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
NETWAYS
 
OSMC 2016 | Monasca: Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
NETWAYS
 
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
HostedbyConfluent
 
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Taking the open cloud to 11
Joe Brockmeier
 
Evaluating Streaming Data Solutions
Streamlio
 
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Ad

Recently uploaded (20)

PPTX
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
internet básico presentacion es una red global
70965857
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
Ad

Hands-on Workshop: Apache Pulsar

  • 1. Matteo Merli & Sijie Guo fast, durable, flexible pub/sub messaging
  • 3. What is Apache Pulsar? 3 Ordering Guaranteed ordering Multi-tenancy A single cluster can support many tenants and use cases High throughput Can reach 1.8 M messages/s in a single partition Durability Data replicated and synced to disk Geo-replication Out of box support for geographically distributed applications Unified messaging model Support both Topic & Queue semantic in a single model Delivery Guarantees At least once, at most once and effectively once Low Latency Low publish latency of 5ms at 99pct Highly scalable Can support millions of topics
  • 4. Usage of Pulsar • In production for 3+ years at Yahoo • Powering critical products like: • Yahoo Mail, Yahoo Finance, Gemini Ads, Flickr and Sherpa (NoSQL database) • 80+ tenants • 2.3 Million topics • 100 B messages / day • Full-mesh replication in 8 data-centers 4
  • 5. Why build a new system? • No existing solution to satisfy requirements • Multi tenant — 1M topics — Low latency — Durability — Geo replication • Kafka doesn’t scale well with many topics: • Storage model based on individual directory per topic partition • Enabling durability kills the performance • Many other choking points: getting stats, access to metadata, flow-control • Operations are not very convenient • eg: replacing a server, manual commands to copy the data and involves clients • clients access to ZK clusters not desirable • Ability to manage large backlogs • No scalable support to keep consumer position 5
  • 6. Architecture view Separate layers between brokers bookies • Broker and bookies can be added independently • Traffic can be shifted very quickly across brokers • New bookies will ramp up on traffic quickly 6 Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Apache BookKeeper Apache Pulsar Producer Consumer
  • 7. Apache BookKeeper • A replicated log storage • Low-latency durable writes • Simple repeatable read consistency • Highly available • Store many logs per node • I/O Isolation 7
  • 8. Kafka vs. Pulsar Segment Distribution
  • 9. BookKeeper - Storage • A single bookie can serve and store thousands of ledgers • Write and read paths are separated: • Avoid read activity to impact write latency • Writes are added to in- memory write-cache and committed to journal • Write cache is flushed in background to separated device • Entries are sorted to allow for mostly sequential reads 9
  • 11. Pulsar Concepts Support both topic and queue semantics in a unified topic concept 11
  • 15. Pulsar client library • Java — C++ — Python — WebSocket APIs • Partitioned topics • Apache Kafka compatibility wrapper API • Transparent batching of messages • Compression • TLS encryption and authentication • End-to-end encryption 15
  • 20. Use cases - Message Queue • Decouple online / background • Provide high-availability • Reliable data transport 20 Online events Pulsar topic 1 Worker 1 Worker 2 Worker 3 Pulsar topic 2 Low latency publish Long running task Notification
  • 21. Use cases - Message Queue • Async processing of time intensive background tasks • Publishes and complete HTTP request • Examples: Image / Video transcoding, bulk operations (large folder deletions) 21
  • 22. Use cases - Message Queue 22 • Delegate the processing to multiple compute jobs which interact with micro-services • Wait until the response is pushed back to the HTTP server • Examples: breaking monolithic applications into micro-services
  • 23. Use cases - Notifications • Change data capture • Listeners are frequently different tenants • Quotas needs to ensure producer is not affected 23 Event Pulsar topic Component 1 Component 2 Component 3 Listeners
  • 24. Use cases - Notifications • Replicating DB transactions across regions • Notification to multiple interested tenants (example: new user account, …) 24
  • 25. Use cases - Feedback system • Coordinate a large number of machines • Propagate state 25 External inputs Pulsar topic 1 Serving system Serving system Serving system Pulsar topic 2 Controller Updates Feedback
  • 27. Multi-Tenancy • Authentication / Authorization / Namespaces / Admin APIs • I/O Isolations between writes and reads • Provided by BookKeeper - Ensure readers draining backlog won’t affect publishers • Soft isolation • Storage quotas — flow-control — back-pressure — rate limiting • Hardware isolation • Constrain some tenants on a subset of brokers or bookies 27
  • 30. Geo-Replication • Scalable asynchronous replication • Integrated in the broker message flow • Simple configuration to add/remove regions 30 Topic (T1) Topic (T1) Topic (T1) Subscrip@on (S1) Subscrip@on (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C
  • 32. Pulsar — Conclusion • A fast durable, distributed pub/sub messaging • Flexible Traditional Messaging: Queuing and Pub/Sub • Focus on message dispatch and consumption • Remove data as soon as possible if they are not needed • It is backed by a scalable log store • durable message store, zero data loss • allow rewinding / reprocessing messages for backfill, bootstrap systems and stream computing • It carries all the fantastic features from the log store. 32
  • 34. Curious to Learn More? • Apache Pulsar : http://pulsar.incubator.apache.org • Apache DistributedLog : http://bookkeeper.apache.org/ distributedlog • Apache BookKeeper : http://bookkeeper.apache.org • Follow Us @apache_pulsar @asfbookkeeper @distributedlog 34
  • 35. Curious to Learn More? • Messaging, Storage, or Both: https://streaml.io/blog/ messaging-storage-or-both/ • Why BookKeeper: Consistency, Durability and Availability: https://streaml.io/blog/why-apache-bookkeeper/ • Introduction to Apache Pulsar: https://streaml.io/blog/intro- to-pulsar/ • Kafka vs DistributedLog: https://bookkeeper.apache.org/ distributedlog/technical-review/2016/09/19/kafka-vs- distributedlog.html 35
  • 36. Curious to learn more about Streamlio? • Streamlio: https://streaml.io • Sandbox Preview: https://streaml.io/docs/getting-started • Learn slack channel: https://learn-streamlio.slack.com
 36