SlideShare a Scribd company logo
Scalable Stream
Processing
With
Apache Samza
Prateek Maheshwari
Apache Samza PMC
Agenda
● Stream Processing at LinkedIn
○ Scale at LinkedIn
○ Scenarios at LinkedIn
● Apache Samza
○ Processing Model
○ Stateful Processing
○ Processing APIs
○ Deployment Model
Apache Kafka
5 Trillion+ messages ingested
per day
1.5+ PB data per day
100k+ topics, 5M+ partitions
Brooklin
2 Trillion+ messages moved
per day
10k+ topics mirrored
2k+ change capture streams
Apache Samza
1.5 Trillion+ messages
processed per day
3k+ jobs in production
500 TB+ local state
Scale at LinkedIn
Scenarios at LinkedIn
DDoS prevention,
bot detection, access
monitoring
Security
Email and Push
notifications
Notifications
Topic tagging, NER in
news articles, image
classification
Classification
Site speed and
health monitoring
Site Speed
Monitoring
inter-service
dependencies and
SLAs
Call Graphs
Scenarios at LinkedIn
Tracking ad views
and clicks
Ad CTR
Tracking
Pre-aggregated
real-time counts by
dimensions
Business
Metrics
Standardizing titles,
companies,
education
Profile
Standardization
Updating search
indices with new
data
Index
Updates
Tracking member
page views,
dwell-time, sessions
Activity
Tracking
Hardened
at Scale
In production at
LinkedIn, Slack, Intuit,
TripAdvisor, VMWare,
Redfin, etc.
Processing events from
Kafka, Brooklin,
Kinesis, EventHubs,
HDFS, DynamoDB
Streams, Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Brooklin
Hadoop
Task-1
Task-2
Task-3
Container-1
Container-2
Kafka
Heartbeat
Job Coordinator
Samza Application
Processing Model
Kafka
Hadoop
Serving Stores (e.g.
Espresso, Venice, Pinot)
Elasticsearch
● Parallelism across tasks by increasing the number of containers.
○ Up to 1 container per task.
● Parallelism across partitions by increasing the number of tasks.
○ Up to 1 task per partition.
● Parallelism within a partition for out of order processing.
○ Any number of threads.
Scaling a Samza Application
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
• State is used for performing lookups
and joins, caching data,
buffering/batching data, and writing
computed results.
• State can be local (in-memory or on
disk) or remote.
Samza
Local Store I/O
Samza
Why State Matters
and
Remote DB I/O
Why Local State Matters: Throughput
on disk w/ caching comparable with in memory changelog adds minimal overhead
remote state
30-150x worse than
local state
Terminology
Disk Type: SSD
Max-Net: Max network bandwidth
CLog: Kafka changelog
ReadOnly: read only workloads (lookups)
ReadWrite: read - write workloads (counts)
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Why Local State Matters: Latency
on disk w/ caching comparable with in memory changelog adds minimal overhead
> 2 orders of magnitude slower compared to
local state
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Optimizations for Local State
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
1. Log state changes to a Kafka compacted
topic for durability.
2. Catch up on only the delta from the
change log topic on restart.
Task-2
Container-2
Optimizations for Local State
1. Host Affinity
2. Parallel Recovery
3. Bulk Load Mode
4. Standby Containers
5. Log Compaction
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
Task-2
Container-2
Why Remote I/O Matters
• Data is only available in the remote store (no change capture).
• Need strong consistency or transactions.
• Data cannot be partitioned but is too large to copy to every container.
• Writing processed results for online serving.
• Calling other services to handle complex business logic.
Optimizations for Remote I/O: Table API
• Async Requests
• Rate Limiting
• Batching
• Caching
• Retries
• Stream Table Joins
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Example Application
Count number of "Page Views" for each member in a 5 minute window
18
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking");
KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde);
KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde);
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
Apache Beam
● Event Time Processing
● Multi-lingual APIs (Java, Python, Go*)
● Advanced Windows and Joins
Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(LiKafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class)))
.via(newCounter()))
.apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
Apache Beam: Python
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60*5))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Samza SQL
● Declarative streaming SQL API
● Managed service at LinkedIn
● Create and deploy applications in minutes using SQL Shell
Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
Low Level
High Level
Samza SQL
Apache Beam
Java
Python
Samza APIs
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Samza on a Multi-Tenant Cluster
• Uses a cluster manager (e.g. YARN) for resource management,
coordination, liveness monitoring, etc.
• Better resource utilization in a multi-tenant environment.
• Works well for large number of applications.
Samza as an Embedded Library
• Embed Samza as a library in an application. No cluster manager dependency.
• Dynamically scale out applications by increasing or decreasing the number of
processors at run-time.
• Supports rolling upgrades and canaries.
● Uses ZooKeeper for leader election and liveness monitoring for processors.
● Leader JobCoordinator performs work assignments among processors.
● Leader redistributes partitions when processors join or leave the group.
Samza as a Library
ZooKeeper Based Coordination
Zookeeper
StreamProcessor
Samza
Container
Job Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator…
Leader
Apache Samza
• Mature, versatile, and scalable processing framework
• Best-in-class support for local and remote state
• Powerful and flexible APIs
• Can be operated as a platform or used as an embedded library
Contact Us
http://samza.apache.org
dev@samza.apache.org

More Related Content

What's hot (15)

PDF
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Amazon Web Services Korea
 
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
PDF
APAC Kafka Summit - Best Of
confluent
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
PPTX
MongoDB 3.4 webinar
Andrew Morgan
 
PDF
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Guido Schmutz
 
PPTX
Near Real-Time Data Analysis With FlyData
FlyData Inc.
 
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
PPTX
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
PPTX
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
PPTX
Change Data Capture using Kafka
Akash Vacher
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
KSQL - Stream Processing simplified!
Guido Schmutz
 
PDF
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Amazon Web Services Korea
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
APAC Kafka Summit - Best Of
confluent
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
MongoDB 3.4 webinar
Andrew Morgan
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Guido Schmutz
 
Near Real-Time Data Analysis With FlyData
FlyData Inc.
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
Change Data Capture using Kafka
Akash Vacher
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
KSQL - Stream Processing simplified!
Guido Schmutz
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 

Similar to Scalable Stream Processing with Apache Samza (20)

PDF
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
PDF
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
PPTX
Samza Demo @scale 2017
Xinyu Liu
 
PPTX
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
PPTX
Apache samza past, present and future
Ed Yakabosky
 
PDF
Apache Samza Past, Present and Future
Kartik Paramasivam
 
PPTX
Samza la hug
Sriram Subramanian
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PDF
Scaling up Near Real-time Analytics @Uber &LinkedIn
C4Media
 
PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PDF
SamzaSQL QCon'16 presentation
Yi Pan
 
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
PPTX
Riding the Stream Processing Wave (Strange loop 2019)
Samarth Shetty
 
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
PPTX
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
POTX
Nextcon samza preso july - final
Yi Pan
 
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
Samza Demo @scale 2017
Xinyu Liu
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
Apache samza past, present and future
Ed Yakabosky
 
Apache Samza Past, Present and Future
Kartik Paramasivam
 
Samza la hug
Sriram Subramanian
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
C4Media
 
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
SamzaSQL QCon'16 presentation
Yi Pan
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
Riding the Stream Processing Wave (Strange loop 2019)
Samarth Shetty
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Nextcon samza preso july - final
Yi Pan
 
Ad

Recently uploaded (20)

PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Import Data Form Excel to Tally Services
Tally xperts
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Executive Business Intelligence Dashboards
vandeslie24
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Tally software_Introduction_Presentation
AditiBansal54083
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Ad

Scalable Stream Processing with Apache Samza

  • 2. Agenda ● Stream Processing at LinkedIn ○ Scale at LinkedIn ○ Scenarios at LinkedIn ● Apache Samza ○ Processing Model ○ Stateful Processing ○ Processing APIs ○ Deployment Model
  • 3. Apache Kafka 5 Trillion+ messages ingested per day 1.5+ PB data per day 100k+ topics, 5M+ partitions Brooklin 2 Trillion+ messages moved per day 10k+ topics mirrored 2k+ change capture streams Apache Samza 1.5 Trillion+ messages processed per day 3k+ jobs in production 500 TB+ local state Scale at LinkedIn
  • 4. Scenarios at LinkedIn DDoS prevention, bot detection, access monitoring Security Email and Push notifications Notifications Topic tagging, NER in news articles, image classification Classification Site speed and health monitoring Site Speed Monitoring inter-service dependencies and SLAs Call Graphs
  • 5. Scenarios at LinkedIn Tracking ad views and clicks Ad CTR Tracking Pre-aggregated real-time counts by dimensions Business Metrics Standardizing titles, companies, education Profile Standardization Updating search indices with new data Index Updates Tracking member page views, dwell-time, sessions Activity Tracking
  • 6. Hardened at Scale In production at LinkedIn, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 7. Brooklin Hadoop Task-1 Task-2 Task-3 Container-1 Container-2 Kafka Heartbeat Job Coordinator Samza Application Processing Model Kafka Hadoop Serving Stores (e.g. Espresso, Venice, Pinot) Elasticsearch
  • 8. ● Parallelism across tasks by increasing the number of containers. ○ Up to 1 container per task. ● Parallelism across partitions by increasing the number of tasks. ○ Up to 1 task per partition. ● Parallelism within a partition for out of order processing. ○ Any number of threads. Scaling a Samza Application
  • 9. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 10. • State is used for performing lookups and joins, caching data, buffering/batching data, and writing computed results. • State can be local (in-memory or on disk) or remote. Samza Local Store I/O Samza Why State Matters and Remote DB I/O
  • 11. Why Local State Matters: Throughput on disk w/ caching comparable with in memory changelog adds minimal overhead remote state 30-150x worse than local state Terminology Disk Type: SSD Max-Net: Max network bandwidth CLog: Kafka changelog ReadOnly: read only workloads (lookups) ReadWrite: read - write workloads (counts) Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 12. Why Local State Matters: Latency on disk w/ caching comparable with in memory changelog adds minimal overhead > 2 orders of magnitude slower compared to local state Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 13. Optimizations for Local State Task-1 Container-1 Samza Application Master Durable Container ID – host mapping 1. Log state changes to a Kafka compacted topic for durability. 2. Catch up on only the delta from the change log topic on restart. Task-2 Container-2
  • 14. Optimizations for Local State 1. Host Affinity 2. Parallel Recovery 3. Bulk Load Mode 4. Standby Containers 5. Log Compaction Task-1 Container-1 Samza Application Master Durable Container ID – host mapping Task-2 Container-2
  • 15. Why Remote I/O Matters • Data is only available in the remote store (no change capture). • Need strong consistency or transactions. • Data cannot be partitioned but is too large to copy to every container. • Writing processed results for online serving. • Calling other services to handle complex business logic.
  • 16. Optimizations for Remote I/O: Table API • Async Requests • Rate Limiting • Batching • Caching • Retries • Stream Table Joins
  • 17. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 18. Example Application Count number of "Page Views" for each member in a 5 minute window 18 Page View Page View Per Member Repartition by member id Window Map SendTo Intermediate Stream
  • 19. High Level API ● Complex Processing Pipelines ● Easy Repartitioning ● Stream-Stream and Stream-Table Joins ● Processing Time Windows and Joins
  • 20. High Level API public class PageViewCountApplication implements StreamApplication { @Override public void describe(StreamApplicationDescriptor appDescriptor) { KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking"); KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde); KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde); appDescriptor.getInputStream(pageViews) .partitionBy(m -> m.memberId, serde) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(PageViewCount::new) .sendTo(appDescriptor.getOutputStream(pageViewCounts)); } }
  • 21. Apache Beam ● Event Time Processing ● Multi-lingual APIs (Java, Python, Go*) ● Advanced Windows and Joins
  • 22. Apache Beam public class PageViewCount { public static void main(String[] args) { ... pipeline .apply(LiKafkaIO.<PageViewEvent>read() .withTopic("PageView") .withTimestampFn(kv -> new Instant(kv.getValue().header.time)) .withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000)) .apply(Values.create()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers())) .via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1))) .apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5)))) .apply(Count.perKey()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class))) .via(newCounter())) .apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount") pipeline.run(); } }
  • 23. Apache Beam: Python p = Pipeline(options=pipeline_options) (p | 'read' >> ReadFromKafka(cluster="tracking", topic="PageViewEvent", config=config) | 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1)) | "windowing" >> beam.WindowInto(window.FixedWindows(60*5)) | "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn()) | 'write' >> WriteToKafka(cluster = "queuing", topic = "PageViewCount", config = config) p.run().waitUntilFinish()
  • 24. Samza SQL ● Declarative streaming SQL API ● Managed service at LinkedIn ● Create and deploy applications in minutes using SQL Shell
  • 25. Samza SQL INSERT INTO kafka.tracking.PageViewCount SELECT memberId, count(*) FROM kafka.tracking.PageView GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
  • 26. Low Level High Level Samza SQL Apache Beam Java Python Samza APIs
  • 27. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 28. Samza on a Multi-Tenant Cluster • Uses a cluster manager (e.g. YARN) for resource management, coordination, liveness monitoring, etc. • Better resource utilization in a multi-tenant environment. • Works well for large number of applications.
  • 29. Samza as an Embedded Library • Embed Samza as a library in an application. No cluster manager dependency. • Dynamically scale out applications by increasing or decreasing the number of processors at run-time. • Supports rolling upgrades and canaries.
  • 30. ● Uses ZooKeeper for leader election and liveness monitoring for processors. ● Leader JobCoordinator performs work assignments among processors. ● Leader redistributes partitions when processors join or leave the group. Samza as a Library ZooKeeper Based Coordination Zookeeper StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator… Leader
  • 31. Apache Samza • Mature, versatile, and scalable processing framework • Best-in-class support for local and remote state • Powerful and flexible APIs • Can be operated as a platform or used as an embedded library