SlideShare a Scribd company logo
Matteo Merli
fast, durable, flexible pub/sub messaging
Introduction
2
One sentence definition for Apache Pulsar:
“Flexible pub-sub system backed by a durable log storage”
• Easy to use API — Support both Queuing and Streaming
• Strong storage guarantees — Durability, latency,
scalability
Pulsar architecture basics
3
• Brokers — Serving nodes
• Bookies (Apache BookKeeper) — Storage nodes
• Each layer can be scaled independently
• No data locality — Data for a single topic/partition is not
tied to any particular node
Pulsar architecture basics
4
Considerations
• Stateful systems can become unbalanced when traffic
changes
• The system needs to be designed to allow for quick
reaction, distributing the load across all nodes
5
Pulsar broker
• Broker is the only point of interaction for clients
• Brokers acquire ownership of group of topics and
“serve” them
• Broker has no durable state
• There’s a service discovery mechanism for client to
connect to right broker
6
Pulsar broker
7
Segment centric storage
• Storage for a topic is an infinite “stream” of messages
• Implemented as a sequence of segments
• Each segment is a replicated log — BookKeeper “ledger”
• Segments are rolled over based on time, size and after
crashes
8
Segment centric storage
9
Broker failure recovery
10
• Topic is reassigned to
an available broker
based on load
• Can reconstruct the
previous state
consistently
• No data needs to be
copied
• Failover handled
transparently by client
library
Bookie failure recovery
11
• After a write failure,
BookKeeper will
immediately switch write to
a new bookie, within the
same segment.
• As long as we have any 3
bookies in the cluster, we
can continue to write
Bookie failure recovery
12
• In background, starts a
many-to-many recovery
process to regain the
configured replication
factor
Seamless cluster expansion
13
Why should I care?
“Segment centric” vs “Partition centric”
14
Comparison with Apache Kafka
15
Comparison with Apache Kafka
• In Kafka, partitions are assigned to brokers “permanently”
• A single partition is stored entirely in a single node
• Retention is limited by a single node storage capacity
• Failure recovery and capacity expansion require
“rebalancing”
• Rebalancing has a big impact over the system, affecting
regular traffic
16
Recap
Advantages of segment-centric architecture:
• Unbounded log storage
• Instant scaling without data rebalancing
• Fast replica repair
• High write and read availability via maximized data
placement options
17
Q & A
Thank You
http://pulsar.incubator.apache.org
18

More Related Content

What's hot (20)

PDF
A Deep Dive into Kafka Controller
confluent
 
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
PDF
Common issues with Apache Kafka® Producer
confluent
 
PDF
Introduction to Redis
Dvir Volk
 
PDF
MongoDB WiredTiger Internals
Norberto Leite
 
PDF
Kafka Deep Dive
Knoldus Inc.
 
PPTX
Architecture Sustaining LINE Sticker services
LINE Corporation
 
PDF
Apache BookKeeper: A High Performance and Low Latency Storage Service
Sijie Guo
 
PDF
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PDF
Apache Pulsar Overview
Streamlio
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PPTX
Terraform
Pathum Fernando ☁
 
PPSX
Domain Driven Design
Araf Karsh Hamid
 
PDF
CQRS + Event Sourcing
Mike Bild
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
PDF
Terraform introduction
Jason Vance
 
PPTX
YARN Federation
DataWorks Summit/Hadoop Summit
 
A Deep Dive into Kafka Controller
confluent
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Common issues with Apache Kafka® Producer
confluent
 
Introduction to Redis
Dvir Volk
 
MongoDB WiredTiger Internals
Norberto Leite
 
Kafka Deep Dive
Knoldus Inc.
 
Architecture Sustaining LINE Sticker services
LINE Corporation
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Sijie Guo
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Apache Pulsar Overview
Streamlio
 
Introduction to Apache Kafka
Jeff Holoman
 
Kafka replication apachecon_2013
Jun Rao
 
Domain Driven Design
Araf Karsh Hamid
 
CQRS + Event Sourcing
Mike Bild
 
Cassandra Introduction & Features
DataStax Academy
 
Kafka presentation
Mohammed Fazuluddin
 
Terraform introduction
Jason Vance
 

Similar to Apache pulsar - storage architecture (20)

PDF
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
PDF
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
PDF
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
PDF
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
PDF
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
PPTX
Apache Pulsar as a Dual Stream / Batch Processor
Joe Olson
 
PDF
Evaluating Streaming Data Solutions
Streamlio
 
PDF
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Streamlio
 
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
PDF
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA
 
PDF
Creating Data Fabric for #IOT with Apache Pulsar
Karthik Ramasamy
 
PDF
High performance messaging with Apache Pulsar
Matteo Merli
 
PDF
Effectively-once semantics in Apache Pulsar
Matteo Merli
 
PDF
Pulsar - Distributed pub/sub platform
Matteo Merli
 
PPTX
Apache Pulsar First Overview
Ricardo Paiva
 
PDF
apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...
apidays
 
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
Apache Pulsar as a Dual Stream / Batch Processor
Joe Olson
 
Evaluating Streaming Data Solutions
Streamlio
 
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Streamlio
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
Data Con LA
 
Creating Data Fabric for #IOT with Apache Pulsar
Karthik Ramasamy
 
High performance messaging with Apache Pulsar
Matteo Merli
 
Effectively-once semantics in Apache Pulsar
Matteo Merli
 
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Apache Pulsar First Overview
Ricardo Paiva
 
apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...
apidays
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Ad

Apache pulsar - storage architecture

  • 1. Matteo Merli fast, durable, flexible pub/sub messaging
  • 2. Introduction 2 One sentence definition for Apache Pulsar: “Flexible pub-sub system backed by a durable log storage” • Easy to use API — Support both Queuing and Streaming • Strong storage guarantees — Durability, latency, scalability
  • 3. Pulsar architecture basics 3 • Brokers — Serving nodes • Bookies (Apache BookKeeper) — Storage nodes • Each layer can be scaled independently • No data locality — Data for a single topic/partition is not tied to any particular node
  • 5. Considerations • Stateful systems can become unbalanced when traffic changes • The system needs to be designed to allow for quick reaction, distributing the load across all nodes 5
  • 6. Pulsar broker • Broker is the only point of interaction for clients • Brokers acquire ownership of group of topics and “serve” them • Broker has no durable state • There’s a service discovery mechanism for client to connect to right broker 6
  • 8. Segment centric storage • Storage for a topic is an infinite “stream” of messages • Implemented as a sequence of segments • Each segment is a replicated log — BookKeeper “ledger” • Segments are rolled over based on time, size and after crashes 8
  • 10. Broker failure recovery 10 • Topic is reassigned to an available broker based on load • Can reconstruct the previous state consistently • No data needs to be copied • Failover handled transparently by client library
  • 11. Bookie failure recovery 11 • After a write failure, BookKeeper will immediately switch write to a new bookie, within the same segment. • As long as we have any 3 bookies in the cluster, we can continue to write
  • 12. Bookie failure recovery 12 • In background, starts a many-to-many recovery process to regain the configured replication factor
  • 14. Why should I care? “Segment centric” vs “Partition centric” 14
  • 16. Comparison with Apache Kafka • In Kafka, partitions are assigned to brokers “permanently” • A single partition is stored entirely in a single node • Retention is limited by a single node storage capacity • Failure recovery and capacity expansion require “rebalancing” • Rebalancing has a big impact over the system, affecting regular traffic 16
  • 17. Recap Advantages of segment-centric architecture: • Unbounded log storage • Instant scaling without data rebalancing • Fast replica repair • High write and read availability via maximized data placement options 17
  • 18. Q & A Thank You http://pulsar.incubator.apache.org 18