SlideShare a Scribd company logo
Scaling Spark
on Kubernetes
Li Gao
Introduction
2#UnifiedAnalytics #SparkAISummit
Li Gao
Works in the Data Platform team at Lyft, currently leading
multiple Data Infra initiatives within Data Platform, including
the Spark on Kubernetes project.
Previously held tech leadership roles at Salesforce, Fitbit,
Groupon, and other startups.
Agenda
3#UnifiedAnalytics #SparkAISummit
● Introduction of Data Landscape at Lyft
● The challenges we face
● How Apache Spark on Kubernetes can help
● Remaining work
Data Landscape
4#UnifiedAnalytics #SparkAISummit
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● ML workflow orchestration
● Cloud Platforms
Business
Analysts,
Data
Scientists,
AML
Cloud Infra
Services
Data Infrastructure @ Lyft
5
External
and
Internal
Products
and
Services
Data Portal
(Discovery,
WF, SLx
dashboard
etc.
Data Infra
Batch
Compute
Clusters
Batch Compute @ Lyft
6
Events
Ext Data
RDB/KV
Sys Events
IngestPipelines
AWSS3
AWSS3
HMS
Presto,HiveClient,andBITools
Analysts
Engineers
Scientists
Services
Evolving Batch Architecture
7
Future2016-2017
Vendor-based
Hadoop
2017-2018
Hive on MR
Vendor Presto
Mid 2018
Hive on Tez +
Spark Adhoc
Late
2018
Spark on
Vendor GA
Early-Mid
2019
Spark on K8s
Alpha
Spark on K8s
Beta & Preprod
Initial Batch Architecture
8
Batch Compute Challenges
9
● 3rd Party vendor dependency limitations
● Data ETL expressed solely in SQL, and sometimes in
hard-maintain complex SQL
● Complex logic expressed in Python that hard to adopt
in SQL
● Different dependencies and versions
● Resource load balancing for heterogeneous workloads
Is SQL the complete solution?
10
How Spark can help?
11
RDB/KV
Application
s APIs
Environments
Data Sources
and Data
Sinks
What challenges remain?
12
● Per job custom dependencies and security context
isolation
● Multi version runtime requirements (Py3 v.s. Py2, Spark
versions)
● Still need to run on shared clusters for cost efficiency
● Mixed ML and ETL workloads
How Kubernetes can help?
13
Operators &
Controllers
Pods Ingress Services
Namespaces
Pods
Immutability
Event driven &
Declarative
Vibrant CNCF Community
ServiceMesh
Multi-TenancySupport
Image
Registry
CNCF Landscape
14
What challenges still remain?
● Spark on k8s is still in its early days
● Single cluster scaling limit
● CRD operator choking limit
● Cluster control plane rollout pain points
● Pod churn and IP allocations throttling
● Default k8s scheduler limit
● ECR container registry reliability
15
Current scale
16
● 10s PB data lake
● (O) 100k batch jobs running daily
● ~ 1000s of EC2 nodes spanning multiple
clusters and AZs
● ~ 1000s of workflows running daily
How Lyft scales Spark on K8s
# of Clusters # of Namespaces
# of Pods
Pod Churn Rate
# of Nodes
Pod Size
Job:Pod ratio
IP Alloc Rate Limit
ECR Rate Limit
Affinity &
Isolation
QoS & Quota
Pod Scheduler
The Evolving Architecture
18
A Start of a Spark Job @ k8s
19
Resource
Labels
Job
Cluster
Pool
Cluster
Namespace
Group
Namespace
Spark
CRD
K8s Pods
● (1) and (2) Dispatcher Gateway
● (3) Cluster Controller
● (4) Job Scheduler
● (5) Namespace Group Controller
● (6) Spark Operator
● (7) K8s Pod Scheduler
(1)
(3)
(4)
(5)
(6)
(7)
(2)
Multiple Clusters
20
HA in Cluster Pool
21
Cluster 1
Cluster 2
Cluster 3
Cluster Pool A
Cluster 4
● Cluster rotation within a cluster pool
● Automated provisioning of a new cluster and (manually) add into rotation
● Throttle at lower bound when rotation in progress
Multiple Namespaces (Groups)
22
Pod Pod Pod
Namespace 1
Pod Pod Pod
Namespace 2
Pod Pod Pod
Namespace 3
Node A Node B Node C Node D
Role1 Role1 Role2
Max Pod Size 1 Max Pod Size 2
● Practical ~3K active pods per namespace observed
● Less preemption required when namespace isolated by quota
● Different namespaces can map different IAM roles and sidecar
configurations for security and auditing purposes
Pod Sharing
23
Job
Controller Spark Driver
Pod
Spark Exec
Pods
Job 2 Driver
Pod
Job 2 Exec
Pods
Job 3 Driver
Pod
Job 3 Exec
Pods
Shared Pods
Job 1
Job 4
Job 3
Job 2
AWS
S3
Dep
Dep
Dedicate & Isolated Pods
Dep
Separate DML from DDL
24
DDL Separation to reduce churn
25
Pod Priority and Preemption (WIP)
26
● Priority base
preemption
● Driver pod has higher
priority than executor
pod
● Experimental
D1 D2 E1 E2 E3 E4
K8s Scheduler
D1
E5
New Pod Req
Before
D2 E5 E2 E3 E4
After
E1
Evictedhttps://github.com/kubernetes/kubernetes/issues/71486
https://github.com/kubernetes/enhancements/issues/564
Taints and Tolerations (WIP)
27
Node A Node B Node C Node D Node E Node F
P1 P2 P3 P4 P5 P6 P7 P7 P8 P9 P10
Controllers and Watchers Job 1 Job 2
Core Nodes (Taint) Worker Nodes (Taint)
● Other considerations: Node Labels, Node Selectors to separate GPU and CPU based
workloads
Mutating Admission Hooks
28
K8S API HTTP
Handler
Authn & Authz
Mutating admin
controllers
Schema
validation
validating admin
controllers
ETCD
k8s pod
scheduler
kubelet
Node
Spark Pod
Mutating admin
webhooks
validating admin
webhooks
Pod Request
kubelet
Node
Spark Pod
sidecars
config
credit: https://banzaicloud.com/blog/k8s-admission-webhooks/
Custom k8s Pod scheduler for batch (WIP)
Predicates
Priorities
Round
Robin
Predicates
Weight
Engine
Placement
Engine
Policies
Default k8s scheduler Dynamic Policy Driven k8s scheduler
All Active Notes
All Active Notes
What about ECR reliability?
30
Node 1 Node 2 Node 3
Pods Pods Pods
DaemonSet + Docker In Docker
ECR Container Images
Spark Job Config Overlays (DML)
31
Cluster Pool Defaults
Cluster Defaults
Spark Job User Specified Config
Cluster and Namespace Overrides
Final Spark Job Config
Config
Composer
&
Event
Watcher
Spark
Operator
X-Ray of Job Controller
32
Controllers & Watchers
• Job scheduler
• Spark job config composer
• Namespace group controller
• k8s pod scheduler
• Service controllers (STS, Jupyter)
• K8s metrics & events watchers
• Spark job/crd events & metrics watchers
33
X-Ray of Spark Operator
34
Monitoring and Logging Toolbox
35
JMX
Provision & Automation
36
Kustomize Template
K8S Deploy
Sidecar injectors
Secrets injectors
DaemonSets
KIAM
Remaining work
● More intelligent & efficient job routing, scheduler and
parameter composer
● End-to-End serverless, self-serviceable, and user-
oriented data compute mesh
● Fine grained cost attribution
● Improved docker image distribution
● Spark 3.0 & Kubernetes v1.14+
37
Key Takeaways
● Apache Spark can help unify different batch data compute
use cases
● Kubernetes can help solve the dependency and multi-version
requirements using its containerized approach
● Spark on Kubernetes can scale significantly by using a multi-
cluster compute mesh approach with proper resource
isolation and scheduling techniques
● Challenges remain when running Spark on Kubernetes at
scale
38
Community
39
This effort would not be possible
without the help from the open
source and wider communities:
Q&A
40
Li Gao in/ligao101
41
42
Monitoring Example - OOM Kill
43
What about dependencies?
44
RTree Libraries
Data CodecsSpatial Libraries
3rd Party Vendor Limitations
45
● Proprietary patches
● Inconsistent bootstrap
● Release schedule
● Homogeneous environments
What about Python functions?
46
“I want to express my processing logic in python functions with
external geo libraries (i.e. Geomesa) and interact with Hive
tables” --- Lyft data engineer

More Related Content

What's hot (20)

POTX
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
PDF
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PPTX
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
DataWorks Summit
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Standalone Spark Deployment for Stability and Performance
Romi Kuntsman
 
PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PDF
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
PDF
Akka in Production - ScalaDays 2015
Evan Chan
 
PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
PDF
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
PDF
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
DataWorks Summit
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Introduction to Spark Streaming
datamantra
 
Standalone Spark Deployment for Stability and Performance
Romi Kuntsman
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
Akka in Production - ScalaDays 2015
Evan Chan
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 

Similar to SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft (20)

PDF
Scaling spark on kubernetes at Lyft
Li Gao
 
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
PDF
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Elasticsearch
 
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
PDF
Improving Apache Spark Downscaling
Databricks
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Scalable Clusters On Demand
Bogdan Kyryliuk
 
PPTX
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
PPTX
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
PDF
Track A-2 基於 Spark 的數據分析
Etu Solution
 
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
PDF
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
Ambassador Labs
 
PPTX
Microservices at ibotta pitfalls and learnings
Matthew Reynolds
 
PDF
Apache spark 2.4 and beyond
Xiao Li
 
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
PDF
What's New in Upcoming Apache Spark 2.3
Databricks
 
PDF
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
PPTX
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Scaling spark on kubernetes at Lyft
Li Gao
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Elasticsearch
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Improving Apache Spark Downscaling
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
Ambassador Labs
 
Microservices at ibotta pitfalls and learnings
Matthew Reynolds
 
Apache spark 2.4 and beyond
Xiao Li
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
What's New in Upcoming Apache Spark 2.3
Databricks
 
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Ad

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
PDF
zookeeer+raft-2.pdf
Chester Chen
 
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
PDF
A missing link in the ML infrastructure stack?
Chester Chen
 
PDF
Shopify datadiscoverysf bigdata
Chester Chen
 
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
PDF
Sf big analytics: bighead
Chester Chen
 
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
PPTX
2018 data warehouse features in spark
Chester Chen
 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
PPTX
2018 02 20-jeg_index
Chester Chen
 
PDF
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
PDF
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Chester Chen
 
Ad

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

  • 2. Introduction 2#UnifiedAnalytics #SparkAISummit Li Gao Works in the Data Platform team at Lyft, currently leading multiple Data Infra initiatives within Data Platform, including the Spark on Kubernetes project. Previously held tech leadership roles at Salesforce, Fitbit, Groupon, and other startups.
  • 3. Agenda 3#UnifiedAnalytics #SparkAISummit ● Introduction of Data Landscape at Lyft ● The challenges we face ● How Apache Spark on Kubernetes can help ● Remaining work
  • 4. Data Landscape 4#UnifiedAnalytics #SparkAISummit ● Batch data Ingestion and ETL ● Data Streaming ● ML platforms ● Notebooks and BI tools ● Query and Visualization ● Operational Analytics ● Data Discovery & Lineage ● ML workflow orchestration ● Cloud Platforms
  • 5. Business Analysts, Data Scientists, AML Cloud Infra Services Data Infrastructure @ Lyft 5 External and Internal Products and Services Data Portal (Discovery, WF, SLx dashboard etc. Data Infra
  • 6. Batch Compute Clusters Batch Compute @ Lyft 6 Events Ext Data RDB/KV Sys Events IngestPipelines AWSS3 AWSS3 HMS Presto,HiveClient,andBITools Analysts Engineers Scientists Services
  • 7. Evolving Batch Architecture 7 Future2016-2017 Vendor-based Hadoop 2017-2018 Hive on MR Vendor Presto Mid 2018 Hive on Tez + Spark Adhoc Late 2018 Spark on Vendor GA Early-Mid 2019 Spark on K8s Alpha Spark on K8s Beta & Preprod
  • 9. Batch Compute Challenges 9 ● 3rd Party vendor dependency limitations ● Data ETL expressed solely in SQL, and sometimes in hard-maintain complex SQL ● Complex logic expressed in Python that hard to adopt in SQL ● Different dependencies and versions ● Resource load balancing for heterogeneous workloads
  • 10. Is SQL the complete solution? 10
  • 11. How Spark can help? 11 RDB/KV Application s APIs Environments Data Sources and Data Sinks
  • 12. What challenges remain? 12 ● Per job custom dependencies and security context isolation ● Multi version runtime requirements (Py3 v.s. Py2, Spark versions) ● Still need to run on shared clusters for cost efficiency ● Mixed ML and ETL workloads
  • 13. How Kubernetes can help? 13 Operators & Controllers Pods Ingress Services Namespaces Pods Immutability Event driven & Declarative Vibrant CNCF Community ServiceMesh Multi-TenancySupport Image Registry
  • 15. What challenges still remain? ● Spark on k8s is still in its early days ● Single cluster scaling limit ● CRD operator choking limit ● Cluster control plane rollout pain points ● Pod churn and IP allocations throttling ● Default k8s scheduler limit ● ECR container registry reliability 15
  • 16. Current scale 16 ● 10s PB data lake ● (O) 100k batch jobs running daily ● ~ 1000s of EC2 nodes spanning multiple clusters and AZs ● ~ 1000s of workflows running daily
  • 17. How Lyft scales Spark on K8s # of Clusters # of Namespaces # of Pods Pod Churn Rate # of Nodes Pod Size Job:Pod ratio IP Alloc Rate Limit ECR Rate Limit Affinity & Isolation QoS & Quota Pod Scheduler
  • 19. A Start of a Spark Job @ k8s 19 Resource Labels Job Cluster Pool Cluster Namespace Group Namespace Spark CRD K8s Pods ● (1) and (2) Dispatcher Gateway ● (3) Cluster Controller ● (4) Job Scheduler ● (5) Namespace Group Controller ● (6) Spark Operator ● (7) K8s Pod Scheduler (1) (3) (4) (5) (6) (7) (2)
  • 21. HA in Cluster Pool 21 Cluster 1 Cluster 2 Cluster 3 Cluster Pool A Cluster 4 ● Cluster rotation within a cluster pool ● Automated provisioning of a new cluster and (manually) add into rotation ● Throttle at lower bound when rotation in progress
  • 22. Multiple Namespaces (Groups) 22 Pod Pod Pod Namespace 1 Pod Pod Pod Namespace 2 Pod Pod Pod Namespace 3 Node A Node B Node C Node D Role1 Role1 Role2 Max Pod Size 1 Max Pod Size 2 ● Practical ~3K active pods per namespace observed ● Less preemption required when namespace isolated by quota ● Different namespaces can map different IAM roles and sidecar configurations for security and auditing purposes
  • 23. Pod Sharing 23 Job Controller Spark Driver Pod Spark Exec Pods Job 2 Driver Pod Job 2 Exec Pods Job 3 Driver Pod Job 3 Exec Pods Shared Pods Job 1 Job 4 Job 3 Job 2 AWS S3 Dep Dep Dedicate & Isolated Pods Dep
  • 25. DDL Separation to reduce churn 25
  • 26. Pod Priority and Preemption (WIP) 26 ● Priority base preemption ● Driver pod has higher priority than executor pod ● Experimental D1 D2 E1 E2 E3 E4 K8s Scheduler D1 E5 New Pod Req Before D2 E5 E2 E3 E4 After E1 Evictedhttps://github.com/kubernetes/kubernetes/issues/71486 https://github.com/kubernetes/enhancements/issues/564
  • 27. Taints and Tolerations (WIP) 27 Node A Node B Node C Node D Node E Node F P1 P2 P3 P4 P5 P6 P7 P7 P8 P9 P10 Controllers and Watchers Job 1 Job 2 Core Nodes (Taint) Worker Nodes (Taint) ● Other considerations: Node Labels, Node Selectors to separate GPU and CPU based workloads
  • 28. Mutating Admission Hooks 28 K8S API HTTP Handler Authn & Authz Mutating admin controllers Schema validation validating admin controllers ETCD k8s pod scheduler kubelet Node Spark Pod Mutating admin webhooks validating admin webhooks Pod Request kubelet Node Spark Pod sidecars config credit: https://banzaicloud.com/blog/k8s-admission-webhooks/
  • 29. Custom k8s Pod scheduler for batch (WIP) Predicates Priorities Round Robin Predicates Weight Engine Placement Engine Policies Default k8s scheduler Dynamic Policy Driven k8s scheduler All Active Notes All Active Notes
  • 30. What about ECR reliability? 30 Node 1 Node 2 Node 3 Pods Pods Pods DaemonSet + Docker In Docker ECR Container Images
  • 31. Spark Job Config Overlays (DML) 31 Cluster Pool Defaults Cluster Defaults Spark Job User Specified Config Cluster and Namespace Overrides Final Spark Job Config Config Composer & Event Watcher Spark Operator
  • 32. X-Ray of Job Controller 32
  • 33. Controllers & Watchers • Job scheduler • Spark job config composer • Namespace group controller • k8s pod scheduler • Service controllers (STS, Jupyter) • K8s metrics & events watchers • Spark job/crd events & metrics watchers 33
  • 34. X-Ray of Spark Operator 34
  • 35. Monitoring and Logging Toolbox 35 JMX
  • 36. Provision & Automation 36 Kustomize Template K8S Deploy Sidecar injectors Secrets injectors DaemonSets KIAM
  • 37. Remaining work ● More intelligent & efficient job routing, scheduler and parameter composer ● End-to-End serverless, self-serviceable, and user- oriented data compute mesh ● Fine grained cost attribution ● Improved docker image distribution ● Spark 3.0 & Kubernetes v1.14+ 37
  • 38. Key Takeaways ● Apache Spark can help unify different batch data compute use cases ● Kubernetes can help solve the dependency and multi-version requirements using its containerized approach ● Spark on Kubernetes can scale significantly by using a multi- cluster compute mesh approach with proper resource isolation and scheduling techniques ● Challenges remain when running Spark on Kubernetes at scale 38
  • 39. Community 39 This effort would not be possible without the help from the open source and wider communities:
  • 41. 41
  • 42. 42
  • 43. Monitoring Example - OOM Kill 43
  • 44. What about dependencies? 44 RTree Libraries Data CodecsSpatial Libraries
  • 45. 3rd Party Vendor Limitations 45 ● Proprietary patches ● Inconsistent bootstrap ● Release schedule ● Homogeneous environments
  • 46. What about Python functions? 46 “I want to express my processing logic in python functions with external geo libraries (i.e. Geomesa) and interact with Hive tables” --- Lyft data engineer

Editor's Notes

  • #5: Different users and usecases - ml, streaming, realtime , batch, notebooks multiple cloud platforms
  • #14: declarative predictable & repeatable operators add extensibility multi tenancy container nati ve
  • #15: CNCF is a vibrant community and supports numerous projects
  • #16: patch rollout/updates for crd/control plane is still evolving pod churn - etcd/resource/ttl/ip allocation in ec2 for eg
  • #46: What is homogeneous envs here?