SlideShare a Scribd company logo
© 2019 SPLUNK INC.
The Next Generation
Messaging and Queuing
System
© 2019 SPLUNK INC.
Intro
Senior Principal Engineer - Splunk
Co-creator Apache Pulsar
Matteo Merli
Senior Director of Engineering - Splunk
Karthik Ramasamy
© 2019 SPLUNK INC.
Messaging and Streaming
© 2019 SPLUNK INC.
Messaging
Message passing between
components, application,
services
© 2019 SPLUNK INC.
Streaming
Analyze events that just
happened
© 2019 SPLUNK INC.
Messaging vs Streaming
2 worlds, 1 infra
© 2019 SPLUNK INC.
Use cases
● OLTP, Integration
● Main challenges:
○ Latency
○ Availability
○ Data durability
○ High level features
■ Routing, DLQ, delays, individual acks
● Real-time analytics
● Main challenges:
○ Throughput
○ Ordering
○ Stateful processing
○ Batch + Real-Time
Messaging Streaming
© 2019 SPLUNK INC.
Storage
Messaging
Compute
© 2019 SPLUNK INC.
Apache Pulsar
Data replicated
and synced to
disk
Durability
Low publish
latency of 5ms at
99pct
Low
Latency
Can reach 1.8 M
messages/s in a
single partition
High
Throughput
System is
available if any 2
nodes are up
High
Availability
Take advantage
of dynamic
cluster scaling in
cloud
environments
Cloud
Native
Flexible Pub-Sub and Compute backed by durable log storage
© 2019 SPLUNK INC.
Apache Pulsar
Support both
Topic & Queue
semantic in a
single model
Unified
messaging
model
Can support
millions of topics
Highly
Scalable
Lightweight
compute
framework based
on functions
Native
Compute
Supports multiple
users and
workloads in a
single cluster
Multi
Tenant
Out of box
support for
geographically
distributed
applications
Geo
Replication
Flexible Pub-Sub and Compute backed by durable log storage
© 2019 SPLUNK INC.
Apache Pulsar project in numbers
192
Contributors
30
Committers
100s
Adopters
4.6K
Github Stars
© 2019 SPLUNK INC.
Sample of Pulsar users and contributors
© 2019 SPLUNK INC.
Messaging Model
© 2019 SPLUNK INC.
Pulsar Client libraries
● Java — C++ — C — Python — Go — NodeJS — WebSocket APIs
● Partitioned topics
● Apache Kafka compatibility wrapper API
● Transparent batching and compression
● TLS encryption and authentication
● End-to-end encryption
© 2019 SPLUNK INC.
Architectural view
Separate layers between
brokers bookies
● Broker and bookies can
be added independently
● Traffic can be shifted very
quickly across brokers
● New bookies will ramp up
on traffic quickly
© 2019 SPLUNK INC.
Apache BookKeeper
● Low-latency durable writes
● Simple repeatable read
consistency
● Highly available
● Store many logs per node
● I/O Isolation
Replicated log storage
© 2019 SPLUNK INC.
Inside
BookKeeper
Storage optimized for
sequential & immutable data
● IO isolation between write and read
operations
● Does not rely on OS page cache
● Slow consumers won’t impact latency
● Very effective IO patterns:
○ Journal — append only and no reads
○ Storage device — bulk write and
sequential reads
● Number of files is independent from number
of topics
© 2019 SPLUNK INC.
Segment
Centric
Storage
In addition to partitioning, messages are stored
in segments (based on time and size)
Segments are independent from each others and
spread across all storage nodes
© 2019 SPLUNK INC.
Segments vs Partitions
© 2019 SPLUNK INC.
Tiered
Storage
Unlimited topic storage capacity
Achieves the true “stream-storage”: keep the raw
data forever in stream form
Extremely cost effective
© 2019 SPLUNK INC.
Schema Registry
Store information on the data structure — Stored in BookKeeper
Enforce data types on topic
Allow for compatible schema evolutions
© 2019 SPLUNK INC.
Schema Registry
● Integrated schema in API
● End-to-end type safety — Enforced in Pulsar broker
Producer<MyClass> producer = client
.newProducer(Schema.JSON(MyClass.class))
.topic("my-topic")
.create();
producer.send(new MyClass(1, 2));
Consumer<MyClass> consumer = client
.newConsumer(Schema.JSON(MyClass.class))
.topic("my-topic")
.subscriptionName("my-subscription")
.subscribe();
Message<MyClass> msg = consumer.receive();
Type Safe API
© 2019 SPLUNK INC.
Geo
Replication
Scalable asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
© 2019 SPLUNK INC.
Replicated Subscriptions
● Consumption will restart close to where a consumer left off - Small amount of dups
● Implementation
○ Use markers injected into the data flow
○ Create a consistent snapshot of message ids across cluster
○ Establish a relationship: If consumed MA-1 in Cluster-A it must have consumed
MB-2 in Cluster-B
Migrate subscriptions across geo-replicated clusters
© 2019 SPLUNK INC.
Multi-Tenancy
● Authentication / Authorization / Namespaces / Admin APIs
● I/O Isolations between writes and reads
○ Provided by BookKeeper
○ Ensure readers draining backlog won’t affect publishers
● Soft isolation
○ Storage quotas — flow-control — back-pressure — rate limiting
● Hardware isolation
○ Constrain some tenants on a subset of brokers or bookies
A single Pulsar cluster supports multiple users and mixed workloads
© 2019 SPLUNK INC.
Lightweight Compute with
Pulsar Functions
© 2019 SPLUNK INC.
Pulsar Functions
© 2019 SPLUNK INC.
Pulsar Functions
● User supplied compute against a
consumed message
○ ETL, data enrichment, filtering, routing
● Simplest possible API
○ Use language specific “function” notation
○ No SDK required
○ SDK available for more advanced
features (state, metrics, logging, …)
● Language agnostic
○ Java, Python and Go
○ Easy to support more languages
● Pluggable runtime
○ Managed or manual deployment
○ Run as threads, processes or containers
in Kubernetes
© 2019 SPLUNK INC.
Pulsar Functions
def process(input):
return input + '!'
import java.util.function.Function;
public class ExclamationFunction
implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
Python Java
Examples
© 2019 SPLUNK INC.
Pulsar Functions
● Functions can store state in stream storage
● State is global and replicated
● Multiple instances of the same function can access the same state
● Functions framework provides simple abstraction over state
State management
© 2019 SPLUNK INC.
Pulsar Functions
● Implemented on top of Apache BookKeeper “Table Service”
● BookKeeper provides a sharded key/value store based on:
○ Log & Snapshot - Stored as BookKeeper ledgers
○ Warm replicas that can be quickly promoted to leader
● In case of leader failure there is no downtime or huge log to replay
State management
© 2019 SPLUNK INC.
Pulsar Functions
State example
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction
implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
© 2019 SPLUNK INC.
Pulsar IO
Connectors Framework based on Pulsar Functions
© 2019 SPLUNK INC.
Built-in Pulsar IO connectors
© 2019 SPLUNK INC.
Querying data stored
in Pulsar
© 2019 SPLUNK INC.
Pulsar SQL
● Uses Presto for interactive SQL
queries over data stored in Pulsar
● Query historic and real-time data
● Integrated with schema registry
● Can join with data from other
sources
© 2019 SPLUNK INC.
Pulsar SQL
● Read data directly from BookKeeper into Presto — bypass Pulsar Broker
● Many-to-many data reads
○ Data is split even on a single partition — multiple workers can read data in parallel from single
Pulsar partition
● Time based indexing — Use “publishTime” in predicates to reduce data being read
from disk
© 2019 SPLUNK INC.
Pulsar Storage API
● Work in progress to allow direct access to data stored in Pulsar
● Generalization of the work done for Presto connector
● Most efficient way to retrieve and process data from “batch” execution engines
Thank You
© 2019 SPLUNK INC.

More Related Content

What's hot (20)

PDF
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
PDF
固定化か?最新化か?オプティマイザ統計の運用をもう一度考える。 -JPOUG Tech Talk Night #6-
歩 柴田
 
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PPTX
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
PPTX
Vault Open Source vs Enterprise v2
Stenio Ferreira
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
PDF
Substrait Overview.pdf
Rinat Abdullin
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPT
Tale of two streaming frameworks- Apace Storm & Apache Flink
Karthik Deivasigamani
 
PDF
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Simplilearn
 
PPTX
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PDF
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
固定化か?最新化か?オプティマイザ統計の運用をもう一度考える。 -JPOUG Tech Talk Night #6-
歩 柴田
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Introduction to Stream Processing
Guido Schmutz
 
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
Vault Open Source vs Enterprise v2
Stenio Ferreira
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Databricks Fundamentals
Dalibor Wijas
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Substrait Overview.pdf
Rinat Abdullin
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Karthik Deivasigamani
 
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Simplilearn
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Data Security at Scale through Spark and Parquet Encryption
Databricks
 

Similar to Apache Pulsar: The Next Generation Messaging and Queuing System (20)

PDF
Timothy Spann: Apache Pulsar for ML
Edunomica
 
PDF
Pulsar summit-keynote-final
Karthik Ramasamy
 
PDF
Apache Pulsar @Splunk
Karthik Ramasamy
 
PDF
Interactive querying of streams using Apache Pulsar_Jerry peng
StreamNative
 
PDF
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
PDF
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
PDF
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
PDF
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
PDF
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
HostedbyConfluent
 
PDF
Apache Pulsar Development 101 with Python
Timothy Spann
 
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
PDF
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Timothy Spann
 
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Timothy Spann
 
PDF
Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PDF
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
PDF
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
PDF
Apache Pulsar Overview
Streamlio
 
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Pulsar summit-keynote-final
Karthik Ramasamy
 
Apache Pulsar @Splunk
Karthik Ramasamy
 
Interactive querying of streams using Apache Pulsar_Jerry peng
StreamNative
 
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Hands-on Workshop: Apache Pulsar
Sijie Guo
 
Pulsar - flexible pub-sub for internet scale
Matteo Merli
 
Linked In Stream Processing Meetup - Apache Pulsar
Karthik Ramasamy
 
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
HostedbyConfluent
 
Apache Pulsar Development 101 with Python
Timothy Spann
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Timothy Spann
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Timothy Spann
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
Apache Pulsar Overview
Streamlio
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
AI/ML Applications in Financial domain projects
Rituparna De
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 

Apache Pulsar: The Next Generation Messaging and Queuing System

  • 1. © 2019 SPLUNK INC. The Next Generation Messaging and Queuing System
  • 2. © 2019 SPLUNK INC. Intro Senior Principal Engineer - Splunk Co-creator Apache Pulsar Matteo Merli Senior Director of Engineering - Splunk Karthik Ramasamy
  • 3. © 2019 SPLUNK INC. Messaging and Streaming
  • 4. © 2019 SPLUNK INC. Messaging Message passing between components, application, services
  • 5. © 2019 SPLUNK INC. Streaming Analyze events that just happened
  • 6. © 2019 SPLUNK INC. Messaging vs Streaming 2 worlds, 1 infra
  • 7. © 2019 SPLUNK INC. Use cases ● OLTP, Integration ● Main challenges: ○ Latency ○ Availability ○ Data durability ○ High level features ■ Routing, DLQ, delays, individual acks ● Real-time analytics ● Main challenges: ○ Throughput ○ Ordering ○ Stateful processing ○ Batch + Real-Time Messaging Streaming
  • 8. © 2019 SPLUNK INC. Storage Messaging Compute
  • 9. © 2019 SPLUNK INC. Apache Pulsar Data replicated and synced to disk Durability Low publish latency of 5ms at 99pct Low Latency Can reach 1.8 M messages/s in a single partition High Throughput System is available if any 2 nodes are up High Availability Take advantage of dynamic cluster scaling in cloud environments Cloud Native Flexible Pub-Sub and Compute backed by durable log storage
  • 10. © 2019 SPLUNK INC. Apache Pulsar Support both Topic & Queue semantic in a single model Unified messaging model Can support millions of topics Highly Scalable Lightweight compute framework based on functions Native Compute Supports multiple users and workloads in a single cluster Multi Tenant Out of box support for geographically distributed applications Geo Replication Flexible Pub-Sub and Compute backed by durable log storage
  • 11. © 2019 SPLUNK INC. Apache Pulsar project in numbers 192 Contributors 30 Committers 100s Adopters 4.6K Github Stars
  • 12. © 2019 SPLUNK INC. Sample of Pulsar users and contributors
  • 13. © 2019 SPLUNK INC. Messaging Model
  • 14. © 2019 SPLUNK INC. Pulsar Client libraries ● Java — C++ — C — Python — Go — NodeJS — WebSocket APIs ● Partitioned topics ● Apache Kafka compatibility wrapper API ● Transparent batching and compression ● TLS encryption and authentication ● End-to-end encryption
  • 15. © 2019 SPLUNK INC. Architectural view Separate layers between brokers bookies ● Broker and bookies can be added independently ● Traffic can be shifted very quickly across brokers ● New bookies will ramp up on traffic quickly
  • 16. © 2019 SPLUNK INC. Apache BookKeeper ● Low-latency durable writes ● Simple repeatable read consistency ● Highly available ● Store many logs per node ● I/O Isolation Replicated log storage
  • 17. © 2019 SPLUNK INC. Inside BookKeeper Storage optimized for sequential & immutable data ● IO isolation between write and read operations ● Does not rely on OS page cache ● Slow consumers won’t impact latency ● Very effective IO patterns: ○ Journal — append only and no reads ○ Storage device — bulk write and sequential reads ● Number of files is independent from number of topics
  • 18. © 2019 SPLUNK INC. Segment Centric Storage In addition to partitioning, messages are stored in segments (based on time and size) Segments are independent from each others and spread across all storage nodes
  • 19. © 2019 SPLUNK INC. Segments vs Partitions
  • 20. © 2019 SPLUNK INC. Tiered Storage Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form Extremely cost effective
  • 21. © 2019 SPLUNK INC. Schema Registry Store information on the data structure — Stored in BookKeeper Enforce data types on topic Allow for compatible schema evolutions
  • 22. © 2019 SPLUNK INC. Schema Registry ● Integrated schema in API ● End-to-end type safety — Enforced in Pulsar broker Producer<MyClass> producer = client .newProducer(Schema.JSON(MyClass.class)) .topic("my-topic") .create(); producer.send(new MyClass(1, 2)); Consumer<MyClass> consumer = client .newConsumer(Schema.JSON(MyClass.class)) .topic("my-topic") .subscriptionName("my-subscription") .subscribe(); Message<MyClass> msg = consumer.receive(); Type Safe API
  • 23. © 2019 SPLUNK INC. Geo Replication Scalable asynchronous replication Integrated in the broker message flow Simple configuration to add/remove regions
  • 24. © 2019 SPLUNK INC. Replicated Subscriptions ● Consumption will restart close to where a consumer left off - Small amount of dups ● Implementation ○ Use markers injected into the data flow ○ Create a consistent snapshot of message ids across cluster ○ Establish a relationship: If consumed MA-1 in Cluster-A it must have consumed MB-2 in Cluster-B Migrate subscriptions across geo-replicated clusters
  • 25. © 2019 SPLUNK INC. Multi-Tenancy ● Authentication / Authorization / Namespaces / Admin APIs ● I/O Isolations between writes and reads ○ Provided by BookKeeper ○ Ensure readers draining backlog won’t affect publishers ● Soft isolation ○ Storage quotas — flow-control — back-pressure — rate limiting ● Hardware isolation ○ Constrain some tenants on a subset of brokers or bookies A single Pulsar cluster supports multiple users and mixed workloads
  • 26. © 2019 SPLUNK INC. Lightweight Compute with Pulsar Functions
  • 27. © 2019 SPLUNK INC. Pulsar Functions
  • 28. © 2019 SPLUNK INC. Pulsar Functions ● User supplied compute against a consumed message ○ ETL, data enrichment, filtering, routing ● Simplest possible API ○ Use language specific “function” notation ○ No SDK required ○ SDK available for more advanced features (state, metrics, logging, …) ● Language agnostic ○ Java, Python and Go ○ Easy to support more languages ● Pluggable runtime ○ Managed or manual deployment ○ Run as threads, processes or containers in Kubernetes
  • 29. © 2019 SPLUNK INC. Pulsar Functions def process(input): return input + '!' import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } } Python Java Examples
  • 30. © 2019 SPLUNK INC. Pulsar Functions ● Functions can store state in stream storage ● State is global and replicated ● Multiple instances of the same function can access the same state ● Functions framework provides simple abstraction over state State management
  • 31. © 2019 SPLUNK INC. Pulsar Functions ● Implemented on top of Apache BookKeeper “Table Service” ● BookKeeper provides a sharded key/value store based on: ○ Log & Snapshot - Stored as BookKeeper ledgers ○ Warm replicas that can be quickly promoted to leader ● In case of leader failure there is no downtime or huge log to replay State management
  • 32. © 2019 SPLUNK INC. Pulsar Functions State example import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  • 33. © 2019 SPLUNK INC. Pulsar IO Connectors Framework based on Pulsar Functions
  • 34. © 2019 SPLUNK INC. Built-in Pulsar IO connectors
  • 35. © 2019 SPLUNK INC. Querying data stored in Pulsar
  • 36. © 2019 SPLUNK INC. Pulsar SQL ● Uses Presto for interactive SQL queries over data stored in Pulsar ● Query historic and real-time data ● Integrated with schema registry ● Can join with data from other sources
  • 37. © 2019 SPLUNK INC. Pulsar SQL ● Read data directly from BookKeeper into Presto — bypass Pulsar Broker ● Many-to-many data reads ○ Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition ● Time based indexing — Use “publishTime” in predicates to reduce data being read from disk
  • 38. © 2019 SPLUNK INC. Pulsar Storage API ● Work in progress to allow direct access to data stored in Pulsar ● Generalization of the work done for Presto connector ● Most efficient way to retrieve and process data from “batch” execution engines
  • 39. Thank You © 2019 SPLUNK INC.