SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

Scaling Spark
on Kubernetes
Li Gao

Introduction
2#UnifiedAnalytics #SparkAISummit
Li Gao
Works in the Data Platform team at Lyft, currently leading
multiple Data Infra initiatives within Data Platform, including
the Spark on Kubernetes project.
Previously held tech leadership roles at Salesforce, Fitbit,
Groupon, and other startups.

Agenda
● Introduction of Data Landscape at Lyft
● The challenges we face
● How Apache Spark on Kubernetes can help
● Remaining work

Data Landscape
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● ML workflow orchestration
● Cloud Platforms

Business
Analysts,
Data
Scientists,
AML
Cloud Infra
Services
Data Infrastructure @ Lyft
5
External
and
Internal
Products
and
Services
Data Portal
(Discovery,
WF, SLx
dashboard
etc.
Data Infra

Batch
Compute
Clusters
Batch Compute @ Lyft
6
Events
Ext Data
RDB/KV
Sys Events
IngestPipelines
AWSS3
AWSS3
HMS
Presto,HiveClient,andBITools
Analysts
Engineers
Scientists
Services

Evolving Batch Architecture
7
Future2016-2017
Vendor-based
Hadoop
2017-2018
Hive on MR
Vendor Presto
Mid 2018
Hive on Tez +
Spark Adhoc
Late
2018
Spark on
Vendor GA
Early-Mid
2019
Spark on K8s
Alpha
Spark on K8s
Beta & Preprod

Batch Compute Challenges
9
● 3rd Party vendor dependency limitations
● Data ETL expressed solely in SQL, and sometimes in
hard-maintain complex SQL
● Complex logic expressed in Python that hard to adopt
in SQL
● Different dependencies and versions
● Resource load balancing for heterogeneous workloads

Is SQL the complete solution?
10

How Spark can help?
11
RDB/KV
Application
s APIs
Environments
Data Sources
and Data
Sinks

What challenges remain?
12
● Per job custom dependencies and security context
isolation
● Multi version runtime requirements (Py3 v.s. Py2, Spark
versions)
● Still need to run on shared clusters for cost efficiency
● Mixed ML and ETL workloads

How Kubernetes can help?
13
Operators &
Controllers
Pods Ingress Services
Namespaces
Pods
Immutability
Event driven &
Declarative
Vibrant CNCF Community
ServiceMesh
Multi-TenancySupport
Image
Registry

What challenges still remain?
● Spark on k8s is still in its early days
● Single cluster scaling limit
● CRD operator choking limit
● Cluster control plane rollout pain points
● Pod churn and IP allocations throttling
● Default k8s scheduler limit
● ECR container registry reliability
15

Current scale
16
● 10s PB data lake
● (O) 100k batch jobs running daily
● ~ 1000s of EC2 nodes spanning multiple
clusters and AZs
● ~ 1000s of workflows running daily

How Lyft scales Spark on K8s
# of Clusters # of Namespaces
# of Pods
Pod Churn Rate
# of Nodes
Pod Size
Job:Pod ratio
IP Alloc Rate Limit
ECR Rate Limit
Affinity &
Isolation
QoS & Quota
Pod Scheduler

A Start of a Spark Job @ k8s
19
Resource
Labels
Job
Cluster
Pool
Cluster
Namespace
Group
Namespace
Spark
CRD
K8s Pods
● (1) and (2) Dispatcher Gateway
● (3) Cluster Controller
● (4) Job Scheduler
● (5) Namespace Group Controller
● (6) Spark Operator
● (7) K8s Pod Scheduler
(1)
(3)
(4)
(5)
(6)
(7)
(2)

HA in Cluster Pool
21
Cluster 1
Cluster 2
Cluster 3
Cluster Pool A
Cluster 4
● Cluster rotation within a cluster pool
● Automated provisioning of a new cluster and (manually) add into rotation
● Throttle at lower bound when rotation in progress

Multiple Namespaces (Groups)
22
Pod Pod Pod
Namespace 1
Pod Pod Pod
Namespace 2
Pod Pod Pod
Namespace 3
Node A Node B Node C Node D
Role1 Role1 Role2
Max Pod Size 1 Max Pod Size 2
● Practical ~3K active pods per namespace observed
● Less preemption required when namespace isolated by quota
● Different namespaces can map different IAM roles and sidecar
configurations for security and auditing purposes

Pod Sharing
23
Job
Controller Spark Driver
Pod
Spark Exec
Pods
Job 2 Driver
Pod
Job 2 Exec
Pods
Job 3 Driver
Pod
Job 3 Exec
Pods
Shared Pods
Job 1
Job 4
Job 3
Job 2
AWS
S3
Dep
Dep
Dedicate & Isolated Pods
Dep

DDL Separation to reduce churn
25

Pod Priority and Preemption (WIP)
26
● Priority base
preemption
● Driver pod has higher
priority than executor
pod
● Experimental
D1 D2 E1 E2 E3 E4
K8s Scheduler
D1
E5
New Pod Req
Before
D2 E5 E2 E3 E4
After
E1
Evictedhttps://github.com/kubernetes/kubernetes/issues/71486
https://github.com/kubernetes/enhancements/issues/564

Taints and Tolerations (WIP)
27
Node A Node B Node C Node D Node E Node F
P1 P2 P3 P4 P5 P6 P7 P7 P8 P9 P10
Controllers and Watchers Job 1 Job 2
Core Nodes (Taint) Worker Nodes (Taint)
● Other considerations: Node Labels, Node Selectors to separate GPU and CPU based
workloads

Mutating Admission Hooks
28
K8S API HTTP
Handler
Authn & Authz
Mutating admin
controllers
Schema
validation
validating admin
controllers
ETCD
k8s pod
scheduler
kubelet
Node
Spark Pod
Mutating admin
webhooks
validating admin
webhooks
Pod Request
kubelet
Node
Spark Pod
sidecars
config
credit: https://banzaicloud.com/blog/k8s-admission-webhooks/

Custom k8s Pod scheduler for batch (WIP)
Predicates
Priorities
Round
Robin
Predicates
Weight
Engine
Placement
Engine
Policies
Default k8s scheduler Dynamic Policy Driven k8s scheduler
All Active Notes
All Active Notes

What about ECR reliability?
30
Node 1 Node 2 Node 3
Pods Pods Pods
DaemonSet + Docker In Docker
ECR Container Images

Spark Job Config Overlays (DML)
31
Cluster Pool Defaults
Cluster Defaults
Spark Job User Specified Config
Cluster and Namespace Overrides
Final Spark Job Config
Config
Composer
&
Event
Watcher
Spark
Operator

Controllers & Watchers
• Job scheduler
• Spark job config composer
• Namespace group controller
• k8s pod scheduler
• Service controllers (STS, Jupyter)
• K8s metrics & events watchers
• Spark job/crd events & metrics watchers
33

Monitoring and Logging Toolbox
35
JMX

Provision & Automation
36
Kustomize Template
K8S Deploy
Sidecar injectors
Secrets injectors
DaemonSets
KIAM

Remaining work
● More intelligent & efficient job routing, scheduler and
parameter composer
● End-to-End serverless, self-serviceable, and user-
oriented data compute mesh
● Fine grained cost attribution
● Improved docker image distribution
● Spark 3.0 & Kubernetes v1.14+
37

Key Takeaways
● Apache Spark can help unify different batch data compute
use cases
● Kubernetes can help solve the dependency and multi-version
requirements using its containerized approach
● Spark on Kubernetes can scale significantly by using a multi-
cluster compute mesh approach with proper resource
isolation and scheduling techniques
● Challenges remain when running Spark on Kubernetes at
scale
38

Community
39
This effort would not be possible
without the help from the open
source and wider communities:

Monitoring Example - OOM Kill
43

What about dependencies?
44
RTree Libraries
Data CodecsSpatial Libraries

3rd Party Vendor Limitations
45
● Proprietary patches
● Inconsistent bootstrap
● Release schedule
● Homogeneous environments

What about Python functions?
46
“I want to express my processing logic in python functions with
external geo libraries (i.e. Geomesa) and interact with Hive
tables” --- Lyft data engineer

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

More Related Content

What's hot (20)

Similar to SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft (20)

More from Chester Chen (20)

Recently uploaded (20)

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

Editor's Notes