SlideShare a Scribd company logo
STREAM
PROCESSING@
UBERDANNY YUAN @ UBER
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/uber-stream-processing
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
What is Uber
Transportation at your fingertips
Stream Processing in Uber
Stream Data Allows Us To Feel The Pulse Of Cities
Marketplace Health
What’s Going on Now
What’s Happened?
Status Tracking
Stream Processing in Uber
Stream Processing in Uber
Stream Processing in Uber
A Little Background
Uber’s Platform Is a Distributed State Machine
Rider States
Uber’s Platform Is a Distributed State Machine
Rider States Driver States
Applications can’t do everything
Instead, Applications Emit Events
Events Should Be Available In Seconds
Events Should Rarely Get Lost
Events Should Be Cheap And Scalable
Stream Processing in Uber
Where are the challenges?
Many Dimensions
Dozens of fields per event
Granular Data
Granular Data
Granular Data
Over 10,000 hexagons in the city
Granular Data
7 vehicle types
Granular Data
1440 minutes in a day
Granular Data
13 driver states
Granular Data
300 cities
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion
possible combinations
Unknown Query Patterns
Any combination of dimensions
Variety of Aggregations
- Heatmap
- Top N
- Histogram
- count(), avg(), sum(), percent(), geo
Different Geo Aggregation
Large Data Volume
• Hundreds of thousands of events per
second, or billions of events per day

• At least dozens of fields in each event
Tight Schedule
Key: Generalization
Data Type
• Dimensional Temporal Spatial Data
Dimension Value
state driver_arrived
vehicle type uber X
timestamp 13244323342
lattitude 12.23
longitude 30.00
Data Query
• OLAP on single-table temporal-spatial data


SELECT	
  <agg	
  functions>,	
  <dimensions>	
  

FROM	
  <data_source>

WHERE	
  <boolean	
  filter>

GROUP	
  BY	
  <dimensions>

HAVING	
  <boolean	
  filter>

ORDER	
  BY	
  <sorting	
  criterial>

LIMIT	
  <n>

DO	
  <post	
  aggregation>
Finding the Right Storage System
Minimum Requirements
• OLAP with geospatial and time series support

• Support large amount of data

• Sub-second response time

• Query of raw data


It can’t be a KV store
Challenges to KV Store
Pre-computing all keys is O(2n)	
  for both space
and time 

It can’t be a relational database
Challenges to Relational DB
• Managing multiple indices is painful

• Scanning is not fast enough


A System That Supports
• Fast scan

• Arbitrary boolean queries

• Raw data

• Wide range of aggregations


Elasticsearch
Highly Efficient Inverted-Index For Boolean Query
Built-in Distributed Query
Fast Scan with Flexible Aggregations
Storage
Are We Done?
Transformation
e.g. (Lat, Long) -> (zipcode, hexagon)
Dynamic Pricing
Trend Prediction
Supply and Demand Distribution
Technically Speaking: Clustering & Pr(D, S, E)
New Use Cases —> New Requirements
Pre-aggregation
Joining Multiple Streams
Sessionization
Multi-Staged Processing
State Management
Apache Samza
Why Apache Samza?
DAG on Kafka
Excellent Integration with Kafka
Excellent Integration with Kafka
Built-in Checkpointing
Built-in State Management
Processing Storage
What If Storage Is Down?
What If Processing Takes Long?
Processing Storage
Are We Done?
Stream Processing in Uber
Stream Processing in Uber
Post Processing
Results Transformation and Smoothing
Scale of Post Processing
10,000 hexagons in a city
Scale of Post Processing
331 neighboring hexagons to look at
Scale of Post Processing
331 x 10,000 = 3.1 Million Hexagons to
Process for a Single Query
Scale of Post Processing
99%-ile Processing Time: 70ms
Post Processing
• Each processor is a pure function

• Processors can be composed by combinators
Post Processing
• Highly parallelized execution

• Pipelining
Post Processing
• Each processor is a pure function

• Processors can be composed by combinators

• Highly parallelized execution
Practical Considerations
Data Discovery
Elasticsearch Query Can Be Complex
/driverAcceptanceRate?	
  
geo_dist(10,	
  [37,	
  22])&	
  
time_range(2015-­‐02-­‐04,2015-­‐03-­‐06)&	
  
aggregate(timeseries(7d))&	
  
eq(msg.driverId,1)	
  
Elasticsearch Query Can Be Optimized
• Pipelining

• Validation

• Throttling
Timeinseconds
Elasticsearch Can Be Replaced
Storage QueryProcessing
There’s one more thing
There are always patterns in streams
There is always need for quick exploration
How many drivers cancel a request 10 times in a
row within a 5-minute window?
Which riders request a pickup from 100 miles
apart within a half hour window?
Stream Processing in Uber
Complex Event Processing
FROM	
  driver_canceled#window.time(10	
  min)	
  	
  
SELECT	
  clientUUID,	
  count(clientUUID)	
  as	
  cancelCount	
  
GROUP	
  BY	
  clientUUID	
  HAVING	
  cancelCount	
  >	
  10	
  	
  
INSERT	
  INTO	
  hipchat(room);
Implementation Becomes Easy
Thank You!
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/uber-
stream-processing

More Related Content

What's hot (20)

PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
PPTX
From a kafkaesque story to The Promised Land
Ran Silberman
 
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
PDF
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
confluent
 
PDF
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
confluent
 
PDF
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit
 
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
PDF
Kafka Summit SF 2017 - DNS for Data: The Need for a Stream Registry
confluent
 
PPTX
Challenges in Building a Data Pipeline
Manish Kumar
 
PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
PDF
Flink at netflix paypal speaker series
Monal Daxini
 
PPTX
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PDF
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
Coburn Watson
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward
 
PDF
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
PDF
Streaming process with Kafka Connect and Kafka Streams
vito jeng
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
From a kafkaesque story to The Promised Land
Ran Silberman
 
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
confluent
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
confluent
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
Kafka Summit SF 2017 - DNS for Data: The Need for a Stream Registry
confluent
 
Challenges in Building a Data Pipeline
Manish Kumar
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
Flink at netflix paypal speaker series
Monal Daxini
 
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
Coburn Watson
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Bowen Li
 
Streaming process with Kafka Connect and Kafka Streams
vito jeng
 

Viewers also liked (18)

PDF
Uber's Business Model
Jeffrey Funk Business Models
 
PPTX
Justdial is yellow pages on telephone . Here's a ppt on its marketing strateg...
Shlomoh Samuel
 
PDF
Uber a modern age business strategy
Dhruvajyoti Roy
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Scaling Uber
C4Media
 
PPTX
Uber Company Review
Kevin Tan
 
PDF
Disruptive Innovation in 2016
Jeremy Waite
 
PDF
Stream Computing & Analytics at Uber
Sudhir Tonse
 
PPTX
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
PDF
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
PDF
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Kafka and Storm - event processing in realtime
Guido Schmutz
 
PPTX
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
PPTX
UBER Strategy
HooJin Yoon
 
Uber's Business Model
Jeffrey Funk Business Models
 
Justdial is yellow pages on telephone . Here's a ppt on its marketing strateg...
Shlomoh Samuel
 
Uber a modern age business strategy
Dhruvajyoti Roy
 
Introduction to Kafka Streams
Guozhang Wang
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Scaling Uber
C4Media
 
Uber Company Review
Kevin Tan
 
Disruptive Innovation in 2016
Jeremy Waite
 
Stream Computing & Analytics at Uber
Sudhir Tonse
 
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
UBER Strategy
HooJin Yoon
 
Ad

Similar to Stream Processing in Uber (20)

PDF
QCon SF-2015 Stream Processing in uber
Danny Yuan
 
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Dataconomy Media
 
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Maya Lumbroso
 
PPTX
MongoDB for Time Series Data
MongoDB
 
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax Academy
 
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Maya Lumbroso
 
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Dataconomy Media
 
PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PPTX
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
PDF
Scalable IoT platform
Swapnil Bawaskar
 
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
PayPal Risk Platform High Performance Practice
Brian Ling
 
PDF
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
PDF
Drinking from the Firehose - Real-time Metrics
Samantha Quiñones
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
QCon SF-2015 Stream Processing in uber
Danny Yuan
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Dataconomy Media
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Maya Lumbroso
 
MongoDB for Time Series Data
MongoDB
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax Academy
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Dataconomy Media
 
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Danny Yuan
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
Scalable IoT platform
Swapnil Bawaskar
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PayPal Risk Platform High Performance Practice
Brian Ling
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
Drinking from the Firehose - Real-time Metrics
Samantha Quiñones
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
PDF
Next Generation Client APIs in Envoy Mobile
C4Media
 
PDF
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
PDF
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
PDF
Kafka Needs No Keeper
C4Media
 
PDF
High Performing Teams Act Like Owners
C4Media
 
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
PDF
Service Meshes- The Ultimate Guide
C4Media
 
PDF
Shifting Left with Cloud Native CI/CD
C4Media
 
PDF
CI/CD for Machine Learning
C4Media
 
PDF
Fault Tolerance at Speed
C4Media
 
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
PDF
Build Your Own WebAssembly Compiler
C4Media
 
PDF
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
PDF
Scaling Patterns for Netflix's Edge
C4Media
 
PDF
Make Your Electron App Feel at Home Everywhere
C4Media
 
PDF
The Talk You've Been Await-ing For
C4Media
 
PDF
Future of Data Engineering
C4Media
 
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

Recently uploaded (20)

PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
The Future of Artificial Intelligence (AI)
Mukul
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 

Stream Processing in Uber