SlideShare a Scribd company logo
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Zeus: Uber’s Highly Scalable and
Distributed Shuffle as a Service
Mayank Bansal, Data Infra, Uber
Bo Yang, Data Infra, Uber
Igniting opportunity by setting the world in motion
15 billion trips
18M trips per day
6 continents, 69 countries and 10,000 cities
103M active monthly users
5M active drivers
22,000 employees worldwide
3,700 developers worldwide
2
Data and ML Use Cases at Uber
○ Uber Eats
○ ETAs
○ Self-Driving Vehicles
○ Customer Support
○ Driver/Rider Match
○ Personalization
○ Demand Modeling
○ Dynamic Pricing
○ Forecasting
○ Maps
○ Fraud
○ Anomaly Detection
○ Capacity Planning
○ And many more...
Data and ML at Uber - ETAs
○ ETAs are core to the Uber customer experience
○ ETAs used by myriad internal systems
○ ETA are generated by route-based algorithms
○ ML models predict the route-based ETA error
○ Uber uses the predicted error to correct the
ETA
○ ETAs now dramatically more accurate
Data and ML at Uber - Driver/Rider Match
○ Optimize matchings of riders and drivers
on the Uber platform
○ Predict if open rider app will make trip
request
Data and ML at Uber - Eats
○ Models used for
○ Ranking of restaurants and
dishes
○ Delivery times
○ Search ranking
○ 100s of ML models called to
render Eats homepage
Data and ML at Uber - Self-Driving Vehicles
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber’s Data Stack
Mobile App Events
Device Telemetry
Micro-Service Events
Database Events
3rd Party Feeds
Bulk Uploads
Incremental
Ingestion
Kafka
Realtime, Pre-Aggregated
(AthenaX)
Ad hoc, Interactive
(Presto, Vertica)
Complex, Batch
(Hive)
Dashboards
(Summary, Dashbuilder)
Ad hoc Query
(QueryBuilder)
Data Preparation
(Piper, uWorc)
BI Tools
(Tableau, DSW)
Stream Processing
(Flink)
Batch Processing
(Spark, Tez, Map Reduce)
Compute Fabric (YARN / Mesos + Peloton)
Data Analytics Tools
In-memory
(Pinot,
AresDB)
Hot
(HDFS)
Warm
(HDFS)
Archival
(Cloud)
Query Engines
Data Processing Engines
Tiered Data Lake
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber’s ML Stack - Michelangelo
Kafka
Compute Fabric (YARN / Peloton+Mesos)
Data Analytics Tools
Query EnginesStream
Processing
(Flink)
Batch
Processing
(Hive,
Spark, Tez)
Data
Preparation
Jupyter
Notebook
Spark
Magic
Prototype
Tensorflow
Training
Pytorch
XGBoost
SparkML
Feature
Store
Model
Store
Metrics
Store
DataLake
(HDFS)
Inference
Realtime
Prediction
Service
Batch
Prediction
Jobs
Apache Spark
@Uber
Image Source: www.mindproject.io
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
○ Apache Spark is the primary analytics execution
engine teams at Uber use
○ At Uber, 95% batch and ML jobs run on Spark
○ We run Spark on YARN and Peloton/Mesos
○ We use external shuffle service for the
shuffle data
Apache Spark @ Uber
11
* Apache Hadoop,, Spark, and Mesos logos are either registered trademarks or trademarks of the Apache
Software Foundation in the United States and/or other countries. No endorsement by The Apache
Software Foundation is implied by the use of these marks. TensorFlow and the TensorFlow logo are
trademarks of Google Inc. Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis
Labs Ltd. Any use by Uber Technologies is for referential purposes only and does not indicate any
sponsorship, endorsement or affiliation between Redis and Uber Technologies.
12
How Does Apache Spark Shuffle Service Work?
Limitations of Apache Spark Shuffle Service
13
● SSD wearing out Issues
● Reliability
● Kubernetes dynamic allocation
● Collocation
Different Approaches
14
● Shuffle manager to external storage
○ Synchronous writes
■ NFS
● 2X slow
■ HDFS
● 5X slow
Different Approaches
15
● Shuffle manager to external storage
○ Semi-asynchronous writes
■ HDFS
● 4x slow
Different Approaches
16
● Remote Shuffle Service
○ Streaming writes to HDFS
■ 1.5x slower than writing to local storage
○ Streaming Writes to Local
■ ~Same Performance like external shuffle service
Remote Shuffle Service
17
● Remote Shuffle Service
○ Streaming Writes to Local Storage
■ Changed Mapreduce paradigm
■ Record Stream -> Shuffle Server -> Disk
■ No temporary spill files in executor side
Architecture - Remote Spark Shuffle Service
18
Deep Dive
Image Source: www.mindproject.io
Design Principles
20
● Scale out horizontally
○ Each server instance works independently
○ Avoid centralized state/storage
● Tackle network latency
○ Reduce waiting times for server response
○ Stream data
● Performance optimization
○ Most Spark Apps optimized for similar performance
○ Rely on YARN/Apache Spark retry for failure recovery
Scale Out
Tackle Network Latency
Performance Optimization
21
Horizontal Scalable
22
● Spark applications share/use different shuffle servers
● No shared state among shuffle servers
● More shuffle servers to scale out
Shuffle Server Distribution
23
● Mappers: m=4
● Reducers: r=5
● Shuffle Servers: s=3
Shuffle Server Distribution in General
24
● Mappers: m
● Reducers: r
● Shuffle Servers: s
● Network Connections
○ Mappers: m*s connections
○ Reducers: r connections
Scale Out
Tackle Network Latency
Performance Optimization
25
Server Implementation
26
● Use Netty
○ High performance asynchronous server framework
● Two thread groups
○ Group 1: Accept new socket connection
○ Group 2: Read socket data
○ Thread groups not block each other
● Binary network protocol
○ Efficient encoding/compression
Direct Write/Read on Disk File
27
● Write to OS file directly
○ No application level buffering
● Zero copy
○ Transfer data from disk file to shuffle reader without user space
memory
● Sequential write/read
○ No random disk IO
Client Side Compression
28
● Shuffle client compress/decompress data
● Reduce network transport data size
● Reduce CPU usage on shuffle server
● Support client side encryption
○ Encryption key inside each application
○ Encryption key not distributed to shuffle server
Parallel Serialization and Network IO
29
● Shuffle data serialization takes time
● Serialization in executor thread
● Network IO in another thread
Connection Pool
30
● Socket connect latency is not trivial
● Reuse client/server connections
Scale Out
Tackle Network Latency
Performance Optimization
31
Asynchronous Shuffle Data Commit
32
● Map task
○ Stream data to server
○ Not wait for response
● Server flushes (commits) data asynchronously
● Reduce task queries data availability when fetching data
Fault Tolerance
33
Shuffle Server Discovery/Health Check
34
● ZooKeeper as Server
Registry
Data Replica
35
● Server Replication Group
● Duplicate Write in Parallel
● Read from Single Server,
switch to another server on
failure
Local State Flush
36
● Local state persistence in batch
○ Avoid flushing state for each map task
○ Flush when shuffle stage finishes
● Client not waiting for server side state flush
Production Status
37
Compatible with Open Source Apache Spark
38
● Shuffle Manager Plugin
○ spark.shuffle.manager=
org.apache.spark.shuffle.RssShuffleManager
● MapStatus / MapOutputTracker
○ Embed remote shuffle service related data inside MapStatus
○ Query MapOutputTracker to retrieve needed information
Metrics/Monitoring
39
● Uber’s open source M3 metrics library
● Important metrics
○ Network connections
○ File descriptors
○ Disk utilization
Test Strategy
40
● Unit Test
● Stress/Random Test
● Production Query Sampling
Remote Spark Shuffle Service - Production Status
41
● In production in last 8+ months for
YARN
● Thousand’s of application running
every day
● Job latencies are on par with
external shuffle
● Open sourcing it soon!
Roadmap
42
● Support all Spark workloads including HiveOnSpark
● Multi-tenancy (quota)
● Load balancing
● Integrate with incoming Spark shuffle metadata APIs
Proprietary and confidential © 2020 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
4
3
Thank you !!!

More Related Content

What's hot (20)

PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Memory Management in Apache Spark
Databricks
 
PPTX
Apache Spark overview
DataArt
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Delta Lake: Optimizing Merge
Databricks
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Memory Management in Apache Spark
Databricks
 
Apache Spark overview
DataArt
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Optimizing Apache Spark SQL Joins
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Optimizing Hive Queries
Owen O'Malley
 
Delta Lake: Optimizing Merge
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark architecture
GauravBiswas9
 
Physical Plans in Spark SQL
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Dynamic Partition Pruning in Apache Spark
Databricks
 

Similar to Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service (20)

PDF
Uber Geo spatial data platform at DataWorks Summit
Zhenxiao Luo
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PDF
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
PDF
Even Faster: When Presto meets Parquet @ Uber
DataWorks Summit
 
PPTX
Geospatial data platform at Uber
DataWorks Summit
 
PDF
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
PDF
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
PDF
Big data should be simple
Dori Waldman
 
PDF
Presto Apache BigData 2017
Zhenxiao Luo
 
PDF
Data Platform in the Cloud
Amihay Zer-Kavod
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PDF
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
PDF
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
confluent
 
PPTX
Sizing Your Scylla Cluster
ScyllaDB
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PPTX
Big data meet_up_08042016
Mark Smith
 
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Uber Geo spatial data platform at DataWorks Summit
Zhenxiao Luo
 
Understanding Hadoop
Ahmed Ossama
 
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Even Faster: When Presto meets Parquet @ Uber
DataWorks Summit
 
Geospatial data platform at Uber
DataWorks Summit
 
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
Big data should be simple
Dori Waldman
 
Presto Apache BigData 2017
Zhenxiao Luo
 
Data Platform in the Cloud
Amihay Zer-Kavod
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
confluent
 
Sizing Your Scylla Cluster
ScyllaDB
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Big data meet_up_08042016
Mark Smith
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
short term internship project on Data visualization
JMJCollegeComputerde
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

  • 1. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service Mayank Bansal, Data Infra, Uber Bo Yang, Data Infra, Uber Igniting opportunity by setting the world in motion
  • 2. 15 billion trips 18M trips per day 6 continents, 69 countries and 10,000 cities 103M active monthly users 5M active drivers 22,000 employees worldwide 3,700 developers worldwide 2
  • 3. Data and ML Use Cases at Uber ○ Uber Eats ○ ETAs ○ Self-Driving Vehicles ○ Customer Support ○ Driver/Rider Match ○ Personalization ○ Demand Modeling ○ Dynamic Pricing ○ Forecasting ○ Maps ○ Fraud ○ Anomaly Detection ○ Capacity Planning ○ And many more...
  • 4. Data and ML at Uber - ETAs ○ ETAs are core to the Uber customer experience ○ ETAs used by myriad internal systems ○ ETA are generated by route-based algorithms ○ ML models predict the route-based ETA error ○ Uber uses the predicted error to correct the ETA ○ ETAs now dramatically more accurate
  • 5. Data and ML at Uber - Driver/Rider Match ○ Optimize matchings of riders and drivers on the Uber platform ○ Predict if open rider app will make trip request
  • 6. Data and ML at Uber - Eats ○ Models used for ○ Ranking of restaurants and dishes ○ Delivery times ○ Search ranking ○ 100s of ML models called to render Eats homepage
  • 7. Data and ML at Uber - Self-Driving Vehicles
  • 8. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber’s Data Stack Mobile App Events Device Telemetry Micro-Service Events Database Events 3rd Party Feeds Bulk Uploads Incremental Ingestion Kafka Realtime, Pre-Aggregated (AthenaX) Ad hoc, Interactive (Presto, Vertica) Complex, Batch (Hive) Dashboards (Summary, Dashbuilder) Ad hoc Query (QueryBuilder) Data Preparation (Piper, uWorc) BI Tools (Tableau, DSW) Stream Processing (Flink) Batch Processing (Spark, Tez, Map Reduce) Compute Fabric (YARN / Mesos + Peloton) Data Analytics Tools In-memory (Pinot, AresDB) Hot (HDFS) Warm (HDFS) Archival (Cloud) Query Engines Data Processing Engines Tiered Data Lake
  • 9. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber’s ML Stack - Michelangelo Kafka Compute Fabric (YARN / Peloton+Mesos) Data Analytics Tools Query EnginesStream Processing (Flink) Batch Processing (Hive, Spark, Tez) Data Preparation Jupyter Notebook Spark Magic Prototype Tensorflow Training Pytorch XGBoost SparkML Feature Store Model Store Metrics Store DataLake (HDFS) Inference Realtime Prediction Service Batch Prediction Jobs
  • 10. Apache Spark @Uber Image Source: www.mindproject.io
  • 11. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. ○ Apache Spark is the primary analytics execution engine teams at Uber use ○ At Uber, 95% batch and ML jobs run on Spark ○ We run Spark on YARN and Peloton/Mesos ○ We use external shuffle service for the shuffle data Apache Spark @ Uber 11 * Apache Hadoop,, Spark, and Mesos logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. TensorFlow and the TensorFlow logo are trademarks of Google Inc. Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use by Uber Technologies is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Uber Technologies.
  • 12. 12 How Does Apache Spark Shuffle Service Work?
  • 13. Limitations of Apache Spark Shuffle Service 13 ● SSD wearing out Issues ● Reliability ● Kubernetes dynamic allocation ● Collocation
  • 14. Different Approaches 14 ● Shuffle manager to external storage ○ Synchronous writes ■ NFS ● 2X slow ■ HDFS ● 5X slow
  • 15. Different Approaches 15 ● Shuffle manager to external storage ○ Semi-asynchronous writes ■ HDFS ● 4x slow
  • 16. Different Approaches 16 ● Remote Shuffle Service ○ Streaming writes to HDFS ■ 1.5x slower than writing to local storage ○ Streaming Writes to Local ■ ~Same Performance like external shuffle service
  • 17. Remote Shuffle Service 17 ● Remote Shuffle Service ○ Streaming Writes to Local Storage ■ Changed Mapreduce paradigm ■ Record Stream -> Shuffle Server -> Disk ■ No temporary spill files in executor side
  • 18. Architecture - Remote Spark Shuffle Service 18
  • 19. Deep Dive Image Source: www.mindproject.io
  • 20. Design Principles 20 ● Scale out horizontally ○ Each server instance works independently ○ Avoid centralized state/storage ● Tackle network latency ○ Reduce waiting times for server response ○ Stream data ● Performance optimization ○ Most Spark Apps optimized for similar performance ○ Rely on YARN/Apache Spark retry for failure recovery
  • 21. Scale Out Tackle Network Latency Performance Optimization 21
  • 22. Horizontal Scalable 22 ● Spark applications share/use different shuffle servers ● No shared state among shuffle servers ● More shuffle servers to scale out
  • 23. Shuffle Server Distribution 23 ● Mappers: m=4 ● Reducers: r=5 ● Shuffle Servers: s=3
  • 24. Shuffle Server Distribution in General 24 ● Mappers: m ● Reducers: r ● Shuffle Servers: s ● Network Connections ○ Mappers: m*s connections ○ Reducers: r connections
  • 25. Scale Out Tackle Network Latency Performance Optimization 25
  • 26. Server Implementation 26 ● Use Netty ○ High performance asynchronous server framework ● Two thread groups ○ Group 1: Accept new socket connection ○ Group 2: Read socket data ○ Thread groups not block each other ● Binary network protocol ○ Efficient encoding/compression
  • 27. Direct Write/Read on Disk File 27 ● Write to OS file directly ○ No application level buffering ● Zero copy ○ Transfer data from disk file to shuffle reader without user space memory ● Sequential write/read ○ No random disk IO
  • 28. Client Side Compression 28 ● Shuffle client compress/decompress data ● Reduce network transport data size ● Reduce CPU usage on shuffle server ● Support client side encryption ○ Encryption key inside each application ○ Encryption key not distributed to shuffle server
  • 29. Parallel Serialization and Network IO 29 ● Shuffle data serialization takes time ● Serialization in executor thread ● Network IO in another thread
  • 30. Connection Pool 30 ● Socket connect latency is not trivial ● Reuse client/server connections
  • 31. Scale Out Tackle Network Latency Performance Optimization 31
  • 32. Asynchronous Shuffle Data Commit 32 ● Map task ○ Stream data to server ○ Not wait for response ● Server flushes (commits) data asynchronously ● Reduce task queries data availability when fetching data
  • 34. Shuffle Server Discovery/Health Check 34 ● ZooKeeper as Server Registry
  • 35. Data Replica 35 ● Server Replication Group ● Duplicate Write in Parallel ● Read from Single Server, switch to another server on failure
  • 36. Local State Flush 36 ● Local state persistence in batch ○ Avoid flushing state for each map task ○ Flush when shuffle stage finishes ● Client not waiting for server side state flush
  • 38. Compatible with Open Source Apache Spark 38 ● Shuffle Manager Plugin ○ spark.shuffle.manager= org.apache.spark.shuffle.RssShuffleManager ● MapStatus / MapOutputTracker ○ Embed remote shuffle service related data inside MapStatus ○ Query MapOutputTracker to retrieve needed information
  • 39. Metrics/Monitoring 39 ● Uber’s open source M3 metrics library ● Important metrics ○ Network connections ○ File descriptors ○ Disk utilization
  • 40. Test Strategy 40 ● Unit Test ● Stress/Random Test ● Production Query Sampling
  • 41. Remote Spark Shuffle Service - Production Status 41 ● In production in last 8+ months for YARN ● Thousand’s of application running every day ● Job latencies are on par with external shuffle ● Open sourcing it soon!
  • 42. Roadmap 42 ● Support all Spark workloads including HiveOnSpark ● Multi-tenancy (quota) ● Load balancing ● Integrate with incoming Spark shuffle metadata APIs
  • 43. Proprietary and confidential © 2020 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. 4 3 Thank you !!!