SlideShare a Scribd company logo
Scalable Real-Time Complex Event Processing @Uber
Shuyi Chen
Uber Technology Inc.
6 continents, 70+ countries and 400+ cities
Transportation as reliable as running water, everywhere,
for everyone
Uber
Outline
● Motivation
● Architecture
● Limitations
● Challenges
Outline
● Motivation
● Architecture
● Limitations
● Challenges
Uber is a data-driven company
Thousands of Kafka topics from micro-services
We can extract a lot of useful information from this
rich set of logs in real-time!
Multiple logins from the same IP in the last 10
minutes
Partner accepted a trip
→ partner calls rider through the Uber APP
→ rider cancels the trip
Partners reject the second pickup of a UberPOOL
trip
Multiple logins from the same IP in the last 10
minutes
Window Aggregation
Partner accepted a trip
→ partner calls rider through the Uber APP
→ rider cancels the trip
Pattern detection
Partners reject the second pickup of a UberPOOL
trip
Filter
Can we use declarative languages to specify these
stream processing logics?
Complex event processing
● Combines data from multiple sources to infer events or patterns that suggest
more complicated circumstances
● CEP is used across many industries for various use cases, including:
○ Finance: Trade analysis, fraud detection
○ Airlines: Operations monitoring
○ Healthcare: Claims processing, patient monitoring
○ Energy and Telecommunications: Outage detection
● CEP uses declarative rule/query language to specify event processing logic
WSO2/Siddhi: Complex event processing engine
● Lightweight, extensible, open source, released as a Java library
● Features supported
○ Filter
○ Join
○ Aggregation
○ Group by
○ Window
○ Pattern processing
○ Sequence processing
○ Event tables
○ Event-time processing
○ UDF
○ Extensions
○ Declarative query language: SiddhiQL
How Siddhi works
● Specify processing logic declaratively with SiddhiQL
How Siddhi works
● Query is parsed at runtime into an execution plan runtime
● As events flow in, the execution plan runtime process events inside the CEP
engine according the query logic
How can we make it scalable at Uber scale?
Apache Samza
● A distributed stream processing framework
○ Distributed and Scalable
○ Built-in State management
○ Built-in fault tolerant
○ At-least-once message processing
○ Infrastructure support at Uber
How can we make the stream processing output
useful?
Actions
● Generalize a set of common action templates to make it easy for
micro-services and human to harness the power of realtime stream
processing
● Currently we support
○ Make an RPC call
○ Invoke a Webhook endpoint
○ Index to ElasticSearch
○ Index to Cassandra
○ Kafka
○ Statsd
○ Chat service
○ Email
○ Push notification
Actions
Real-time Scalable Complex Event Processing
Outline
● Motivation
● Architecture
● Limitations
● Challenges
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Partitioner
● Re-shuffle events based on key
● Support predicate pushdown through query analysis
● Support column pruning through query analysis (WIP)
Query processor
● Parse Siddhi queries into execution plan runtime
● Process events in Siddhi execution plan runtime
● Checkpoint state regularly to ensure recovery upon crash/restart using
RocksDB
Action processor
● Execute actions upon the query processing output
● Support various kinds of actions for easy integration
● Implement action retry mechanism using RocksDB to provide at-least-once
delivery
How do we translate a query into psychical plan that
runs?
DAG (Directed Acyclic Graph) generation
● Analyze Siddhi query to automatically generate the stream processing DAG in
Samza using the processors
Filter, transformation
Join, window, pattern
More complicated
No stream processing logic is hard-coded in any of
the processors
REST API backend
● All queries, actions are stored externally in database.
● RESTFUL API for CRUD operations
● If query/action logic changed
○ Redeploy the Samza DAG if needed
○ Otherwise, the updated queries/actions will be loaded at runtime w/o interruption
Unified management and monitoring
● Every use case
○ share the same set of processors
○ Use queries and actions to describe its processing logic
● A single monitoring template can be reused across different use cases
Production status
● In production for >1.5 years
● 120+ production use cases
● 30+ billion messages processed per day
Applications
● Real-time fraud detection
● Real-time anomaly detection
● Real-time marketing campaign
● Real-time promotion
● Real-time monitoring
● Real-time feedback system
● Real-time analytics
● Real-time visualizations
● And etc.
Outline
● Motivation
● Architecture
● Limitations
● Challenges
Out-of-order event handling
● Not a big concern
○ Events of the same rider/partner are usually seconds aparts
● K-slack extension in Siddhi for out-of-order event processing
Auto-scaling
● Manually re-partition kafka topics to increase parallelism
● Manually tune container memory if needed
● Future
○ Use CPU/memory/IO stats to automate the process
Outline
● Motivation
● Architecture
● Limitations
● Challenges
Large checkpointing state
● Samza use Kafka to log state changes
● Siddhi engine snapshot can be large
● Kafka message size limit to 1MB by default
● Solution: we build logics to slice state into smaller pieces and checkpoint
them.
Synchronous checkpointing
● Samza checkpointing is synchronous with message processing
● If state is large, time to checkpoint can be long, might cause processing lag
● Incremental state checkpointing
Exactly once state processing?
● Can not commit state and offset atomically
● No exactly once state processing
Custom business logic
● Common logic implemented as Siddhi extensions
● Ad-hoc logic implemented as UDF in javascript or scalascript inline with the
query
Intermediate Kafka messages
● Samza uses Kafka as message queue for intermediate processing output
○ Each stage is independent of each other
○ This can create large load on Kafka if a heave topic is re-shuffled multiple times
■ Encode the intermediate messages to reduce footprint
Upgrading Samza jobs
● Upgrade Samza jobs require a full restart, and can take minutes due to
○ Offset checkpointing topic too large → set retention to hours or enable compaction
○ Changelog topic too large → set retention or enable compaction in Kafka or host affinity
● To minimize the interruption during upgrade, it would be nice to have
○ Rolling restart
○ Per container restart
Our solution: non-interrupted handoff
● For critical jobs, we use replication during upgrade
○ Start a shadow job
○ Upgrade shadow
○ Switch primary and shadow
○ Upgrade primary
○ Switch back
● Downside: require 2x capacity during upgrade
Thank You!

More Related Content

What's hot (20)

PDF
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
confluent
 
PDF
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
confluent
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PDF
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
confluent
 
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
PDF
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
PDF
Ingesting Healthcare Data, Micah Whitacre
confluent
 
PDF
A Tour of Apache Kafka
confluent
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
PDF
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
confluent
 
PDF
Scalable complex event processing on samza @UBER
Shuyi Chen
 
PDF
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
confluent
 
PDF
Putting Kafka Together with the Best of Google Cloud Platform
confluent
 
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
HostedbyConfluent
 
PDF
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Ankur Bansal
 
PDF
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
confluent
 
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
confluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
confluent
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
Ingesting Healthcare Data, Micah Whitacre
confluent
 
A Tour of Apache Kafka
confluent
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
confluent
 
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
confluent
 
Putting Kafka Together with the Best of Google Cloud Platform
confluent
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
HostedbyConfluent
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Ankur Bansal
 
Flink forward-2017-netflix keystones-paas
Monal Daxini
 

Similar to Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber (20)

PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2
 
PDF
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PDF
Stream Processing in Uber
C4Media
 
PDF
Scaling up Near Real-time Analytics @Uber &LinkedIn
C4Media
 
PDF
Stream Computing & Analytics at Uber
Sudhir Tonse
 
PPTX
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
PDF
Case-Study: Building Real-Time Applications at Scale-Cyclist Crash Detection ...
HostedbyConfluent
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
PPTX
Siddhi: A Second Look at Complex Event Processing Implementations
Srinath Perera
 
PPTX
Cassandra Lunch #88: Cadence
Anant Corporation
 
PDF
Scalable Stream Processing with Apache Samza
Prateek Maheshwari
 
PPTX
Apache samza past, present and future
Ed Yakabosky
 
PDF
Apache Samza Past, Present and Future
Kartik Paramasivam
 
PDF
SamzaSQL QCon'16 presentation
Yi Pan
 
PDF
Overcoming Variable Payloads to Optimize for Performance
ScyllaDB
 
PDF
WSO2 Complex Event Processor - Product Overview
WSO2
 
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
PPTX
Essential Ingredients of Realtime Stream Processing @ Scale
Kartik Paramasivam
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2
 
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Stream Processing in Uber
C4Media
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
C4Media
 
Stream Computing & Analytics at Uber
Sudhir Tonse
 
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Case-Study: Building Real-Time Applications at Scale-Cyclist Crash Detection ...
HostedbyConfluent
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Siddhi: A Second Look at Complex Event Processing Implementations
Srinath Perera
 
Cassandra Lunch #88: Cadence
Anant Corporation
 
Scalable Stream Processing with Apache Samza
Prateek Maheshwari
 
Apache samza past, present and future
Ed Yakabosky
 
Apache Samza Past, Present and Future
Kartik Paramasivam
 
SamzaSQL QCon'16 presentation
Yi Pan
 
Overcoming Variable Payloads to Optimize for Performance
ScyllaDB
 
WSO2 Complex Event Processor - Product Overview
WSO2
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Essential Ingredients of Realtime Stream Processing @ Scale
Kartik Paramasivam
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Executive Business Intelligence Dashboards
vandeslie24
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

  • 1. Scalable Real-Time Complex Event Processing @Uber Shuyi Chen Uber Technology Inc.
  • 2. 6 continents, 70+ countries and 400+ cities Transportation as reliable as running water, everywhere, for everyone Uber
  • 3. Outline ● Motivation ● Architecture ● Limitations ● Challenges
  • 4. Outline ● Motivation ● Architecture ● Limitations ● Challenges
  • 5. Uber is a data-driven company
  • 6. Thousands of Kafka topics from micro-services
  • 7. We can extract a lot of useful information from this rich set of logs in real-time!
  • 8. Multiple logins from the same IP in the last 10 minutes
  • 9. Partner accepted a trip → partner calls rider through the Uber APP → rider cancels the trip
  • 10. Partners reject the second pickup of a UberPOOL trip
  • 11. Multiple logins from the same IP in the last 10 minutes Window Aggregation
  • 12. Partner accepted a trip → partner calls rider through the Uber APP → rider cancels the trip Pattern detection
  • 13. Partners reject the second pickup of a UberPOOL trip Filter
  • 14. Can we use declarative languages to specify these stream processing logics?
  • 15. Complex event processing ● Combines data from multiple sources to infer events or patterns that suggest more complicated circumstances ● CEP is used across many industries for various use cases, including: ○ Finance: Trade analysis, fraud detection ○ Airlines: Operations monitoring ○ Healthcare: Claims processing, patient monitoring ○ Energy and Telecommunications: Outage detection ● CEP uses declarative rule/query language to specify event processing logic
  • 16. WSO2/Siddhi: Complex event processing engine ● Lightweight, extensible, open source, released as a Java library ● Features supported ○ Filter ○ Join ○ Aggregation ○ Group by ○ Window ○ Pattern processing ○ Sequence processing ○ Event tables ○ Event-time processing ○ UDF ○ Extensions ○ Declarative query language: SiddhiQL
  • 17. How Siddhi works ● Specify processing logic declaratively with SiddhiQL
  • 18. How Siddhi works ● Query is parsed at runtime into an execution plan runtime ● As events flow in, the execution plan runtime process events inside the CEP engine according the query logic
  • 19. How can we make it scalable at Uber scale?
  • 20. Apache Samza ● A distributed stream processing framework ○ Distributed and Scalable ○ Built-in State management ○ Built-in fault tolerant ○ At-least-once message processing ○ Infrastructure support at Uber
  • 21. How can we make the stream processing output useful?
  • 22. Actions ● Generalize a set of common action templates to make it easy for micro-services and human to harness the power of realtime stream processing ● Currently we support ○ Make an RPC call ○ Invoke a Webhook endpoint ○ Index to ElasticSearch ○ Index to Cassandra ○ Kafka ○ Statsd ○ Chat service ○ Email ○ Push notification
  • 24. Outline ● Motivation ● Architecture ● Limitations ● Challenges
  • 27. Partitioner ● Re-shuffle events based on key ● Support predicate pushdown through query analysis ● Support column pruning through query analysis (WIP)
  • 28. Query processor ● Parse Siddhi queries into execution plan runtime ● Process events in Siddhi execution plan runtime ● Checkpoint state regularly to ensure recovery upon crash/restart using RocksDB
  • 29. Action processor ● Execute actions upon the query processing output ● Support various kinds of actions for easy integration ● Implement action retry mechanism using RocksDB to provide at-least-once delivery
  • 30. How do we translate a query into psychical plan that runs?
  • 31. DAG (Directed Acyclic Graph) generation ● Analyze Siddhi query to automatically generate the stream processing DAG in Samza using the processors Filter, transformation
  • 34. No stream processing logic is hard-coded in any of the processors
  • 35. REST API backend ● All queries, actions are stored externally in database. ● RESTFUL API for CRUD operations ● If query/action logic changed ○ Redeploy the Samza DAG if needed ○ Otherwise, the updated queries/actions will be loaded at runtime w/o interruption
  • 36. Unified management and monitoring ● Every use case ○ share the same set of processors ○ Use queries and actions to describe its processing logic ● A single monitoring template can be reused across different use cases
  • 37. Production status ● In production for >1.5 years ● 120+ production use cases ● 30+ billion messages processed per day
  • 38. Applications ● Real-time fraud detection ● Real-time anomaly detection ● Real-time marketing campaign ● Real-time promotion ● Real-time monitoring ● Real-time feedback system ● Real-time analytics ● Real-time visualizations ● And etc.
  • 39. Outline ● Motivation ● Architecture ● Limitations ● Challenges
  • 40. Out-of-order event handling ● Not a big concern ○ Events of the same rider/partner are usually seconds aparts ● K-slack extension in Siddhi for out-of-order event processing
  • 41. Auto-scaling ● Manually re-partition kafka topics to increase parallelism ● Manually tune container memory if needed ● Future ○ Use CPU/memory/IO stats to automate the process
  • 42. Outline ● Motivation ● Architecture ● Limitations ● Challenges
  • 43. Large checkpointing state ● Samza use Kafka to log state changes ● Siddhi engine snapshot can be large ● Kafka message size limit to 1MB by default ● Solution: we build logics to slice state into smaller pieces and checkpoint them.
  • 44. Synchronous checkpointing ● Samza checkpointing is synchronous with message processing ● If state is large, time to checkpoint can be long, might cause processing lag ● Incremental state checkpointing
  • 45. Exactly once state processing? ● Can not commit state and offset atomically ● No exactly once state processing
  • 46. Custom business logic ● Common logic implemented as Siddhi extensions ● Ad-hoc logic implemented as UDF in javascript or scalascript inline with the query
  • 47. Intermediate Kafka messages ● Samza uses Kafka as message queue for intermediate processing output ○ Each stage is independent of each other ○ This can create large load on Kafka if a heave topic is re-shuffled multiple times ■ Encode the intermediate messages to reduce footprint
  • 48. Upgrading Samza jobs ● Upgrade Samza jobs require a full restart, and can take minutes due to ○ Offset checkpointing topic too large → set retention to hours or enable compaction ○ Changelog topic too large → set retention or enable compaction in Kafka or host affinity ● To minimize the interruption during upgrade, it would be nice to have ○ Rolling restart ○ Per container restart
  • 49. Our solution: non-interrupted handoff ● For critical jobs, we use replication during upgrade ○ Start a shadow job ○ Upgrade shadow ○ Switch primary and shadow ○ Upgrade primary ○ Switch back ● Downside: require 2x capacity during upgrade