SlideShare a Scribd company logo
Migrating to
Spark at Netflix
Ryan Blue
Spark Summit 2019
Spark at Netflix
● ETL was mostly written in Pig, with some in Hive
● Pipelines required data engineering
● Data engineers had to understand the processing engine
Long ago . . .
Today
Job executions
Today
Cluster runtime
Today
S3 bytes read S3 bytes written
● Spark is > 90% of job executions – high tens-of-thousands daily
● Data platform is easier to use and more efficient
● Customers from all parts of the business
Today
How did we get there?
● High-profile Spark features: DataFrames, codegen, etc.
● S3 optimizations and committers
● Parquet filtering, tuning, and compression
● Notebook environment
Not included
Spark deployments
● Rebase
○ Pull in a new version
○ Easy to get new features
○ Easy to break things
Following upstream Spark
● Backport
○ Pick only what’s needed
○ Time consuming
○ Safe?
● Maintain supported versions in parallel using backports
● Periodic rebase to add new minor versions: 1.6, 2.0, 2.1, 2.3
● Recommend version based on actual use and experience
● Requires patching job submission
Netflix: Parallel branches
● Easily test another branch before spending time
● Avoids coordinating versions across major applications
● Fast iteration: deploy changes several times per week
Benefits of parallel branches
● Unstable branches
● Nightly canaries for stable and unstable
● CI runs unit tests for unstable
● Integration tests validate every deployment
Testing
● 1.6 – scale problems
● 2.0 – a little too unpolished
● 2.1 – solid, with some additional love
● 2.3 – slow migration, faster in some cases
Supported versions
Challenges
● 1.6 is unstable above 500 executors
○ Use of the Actor model caused coarse locking
○ RPC dependencies make lock issues worse
○ Runaway retry storms
● Spark needs distributed tracing
Stability
● Much better in 2.1, plus patches
○ Remove block status data from heartbeats (SPARK-20084)
○ Multi-threaded listener bus (SPARK-18838)
○ Unstable executor requests (SPARK-20540)
● 2.1 and 2.3 still have problems with 100,000+ tasks
○ Applications hang after shutdown
○ Increase job maxPartitionBytes or coalesce
Stability
● Happen all the time at scale
● Scale in several dimensions
○ Large clusters, lots of disks to fail
○ High tens-of-thousands of executions
○ Many executors, many tasks, diverse workloads
Unlikely problems
● Fix CommitCoordinator and OutputCommitter problems
● Turn off YARN preemption in production
● Use cgroups to contain greedy apps
● Use general-purpose features
○ Blacklisting to avoid cascading failure
○ Speculative execution to tolerate slow nodes
○ Adaptive execution reduces risk
Unlikely problems
● Fix persistent OOM causes
○ Use less driver memory for broadcast joins (SPARK-22170)
○ Add PySpark memory region and limits (SPARK-25004)
○ Base stats on row count, not size on disk
Memory management
● Educate users about memory regions
○ Spark memory vs JVM memory vs overhead
○ Know what region fixes your problem (e.g., spilling)
○ Never set spark.executor.memory without
also setting spark.memory.fraction
Memory management
Best practices
● Avoid RDDs
○ Kryo problems plagued 1.6 apps
○ Let the optimizer improve jobs over time
● Aggressively broadcast
○ Remove the broadcast timeout
○ Set broadcast threshold much higher
Basics
● 3 rules:
○ Don’t copy configuration
○ If you don’t know what it does, don’t change it
○ Never change timeouts
● Document defaults and recommendations
Configuration
● Know how to control parallelism
○ spark.sql.shuffle.partitions,
spark.sql.files.maxPartitionBytes
○ repartition vs coalesce
● Use the least-intrusive option
○ Set shuffle parallelism high and use adaptive execution
○ Allow Spark to improve
Parallelism
● Keep tasks in low tens-of-thousands
○ Too many tasks and the driver can’t handle heartbeats
○ Jobs hang for 10+ minutes after shutdown
● Reduce pressure on shuffle service
○ map tasks * reduce tasks = shuffle shards
Avoid wide stages
● Fixed --num-executors accidents (SPARK-13723)
● Use materialize instead of caching
○ Materialize: convert to RDD, back to DF, and count
○ Stores cache data in shuffle servers
○ Also avoids over-optimization
Dynamic Allocation
● Add ORDER BY
○ Partition columns, filter columns, and one high cardinality column
● Benefits
○ Cluster by partition columns – minimize output files
○ Cluster by common filter columns – faster reads
○ Automatic skew estimation – faster writes (wall time)
● Needs adaptive execution support
Sort before writing
Current problems
● Easy to overload one node
○ Skewed data, not enough threads, GC
● Prevents graceful shrink
● Causes huge runtime variance
Shuffle service
● Collect is wasteful
○ Iterate through compressed result blocks to collect
● Configuration is confusing
○ Memory fraction is often ignored
○ Simpler is better
● Should build broadcast tables on executors
Memory management
● Forked the write path for 2.x releases
○ Consistent rules across “datasource” and Hive tables
○ Remove unsafe operations, like implicit unsafe casts
○ Dynamic partition overwrites and Netflix “batch” pattern
● Fix upstream behavior and consistency with DSv2
● Fix table usability with Iceberg
○ Schema evolution and partitioning
DataSourceV2
Thank you!
Questions?

More Related Content

What's hot (20)

PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
Druid
Dori Waldman
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
PDF
Spark Meetup at Uber
Databricks
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks
 
PDF
Spark shuffle introduction
colorant
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to apache spark
Aakashdata
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Intro to Apache Spark
Robert Sanders
 
Introduction to Apache Spark
Rahul Jain
 
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Spark Meetup at Uber
Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Optimizing Apache Spark SQL Joins
Databricks
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks
 
Spark shuffle introduction
colorant
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 

Similar to Migrating to Apache Spark at Netflix (20)

PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PDF
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
ODP
Spark Deep Dive
Corey Nolet
 
DOCX
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PDF
Apache Spark: What's under the hood
Adarsh Pannu
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark - A High Level overview
Karan Alang
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
Spark Deep Dive
Corey Nolet
 
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Apache Spark: What's under the hood
Adarsh Pannu
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Sohil Jain
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Introduction to Spark Training
Spark Summit
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Intro to Spark development
Spark Summit
 
Apache Spark Introduction.pdf
MaheshPandit16
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 

Migrating to Apache Spark at Netflix

  • 1. Migrating to Spark at Netflix Ryan Blue Spark Summit 2019
  • 3. ● ETL was mostly written in Pig, with some in Hive ● Pipelines required data engineering ● Data engineers had to understand the processing engine Long ago . . .
  • 6. Today S3 bytes read S3 bytes written
  • 7. ● Spark is > 90% of job executions – high tens-of-thousands daily ● Data platform is easier to use and more efficient ● Customers from all parts of the business Today
  • 8. How did we get there?
  • 9. ● High-profile Spark features: DataFrames, codegen, etc. ● S3 optimizations and committers ● Parquet filtering, tuning, and compression ● Notebook environment Not included
  • 11. ● Rebase ○ Pull in a new version ○ Easy to get new features ○ Easy to break things Following upstream Spark ● Backport ○ Pick only what’s needed ○ Time consuming ○ Safe?
  • 12. ● Maintain supported versions in parallel using backports ● Periodic rebase to add new minor versions: 1.6, 2.0, 2.1, 2.3 ● Recommend version based on actual use and experience ● Requires patching job submission Netflix: Parallel branches
  • 13. ● Easily test another branch before spending time ● Avoids coordinating versions across major applications ● Fast iteration: deploy changes several times per week Benefits of parallel branches
  • 14. ● Unstable branches ● Nightly canaries for stable and unstable ● CI runs unit tests for unstable ● Integration tests validate every deployment Testing
  • 15. ● 1.6 – scale problems ● 2.0 – a little too unpolished ● 2.1 – solid, with some additional love ● 2.3 – slow migration, faster in some cases Supported versions
  • 17. ● 1.6 is unstable above 500 executors ○ Use of the Actor model caused coarse locking ○ RPC dependencies make lock issues worse ○ Runaway retry storms ● Spark needs distributed tracing Stability
  • 18. ● Much better in 2.1, plus patches ○ Remove block status data from heartbeats (SPARK-20084) ○ Multi-threaded listener bus (SPARK-18838) ○ Unstable executor requests (SPARK-20540) ● 2.1 and 2.3 still have problems with 100,000+ tasks ○ Applications hang after shutdown ○ Increase job maxPartitionBytes or coalesce Stability
  • 19. ● Happen all the time at scale ● Scale in several dimensions ○ Large clusters, lots of disks to fail ○ High tens-of-thousands of executions ○ Many executors, many tasks, diverse workloads Unlikely problems
  • 20. ● Fix CommitCoordinator and OutputCommitter problems ● Turn off YARN preemption in production ● Use cgroups to contain greedy apps ● Use general-purpose features ○ Blacklisting to avoid cascading failure ○ Speculative execution to tolerate slow nodes ○ Adaptive execution reduces risk Unlikely problems
  • 21. ● Fix persistent OOM causes ○ Use less driver memory for broadcast joins (SPARK-22170) ○ Add PySpark memory region and limits (SPARK-25004) ○ Base stats on row count, not size on disk Memory management
  • 22. ● Educate users about memory regions ○ Spark memory vs JVM memory vs overhead ○ Know what region fixes your problem (e.g., spilling) ○ Never set spark.executor.memory without also setting spark.memory.fraction Memory management
  • 24. ● Avoid RDDs ○ Kryo problems plagued 1.6 apps ○ Let the optimizer improve jobs over time ● Aggressively broadcast ○ Remove the broadcast timeout ○ Set broadcast threshold much higher Basics
  • 25. ● 3 rules: ○ Don’t copy configuration ○ If you don’t know what it does, don’t change it ○ Never change timeouts ● Document defaults and recommendations Configuration
  • 26. ● Know how to control parallelism ○ spark.sql.shuffle.partitions, spark.sql.files.maxPartitionBytes ○ repartition vs coalesce ● Use the least-intrusive option ○ Set shuffle parallelism high and use adaptive execution ○ Allow Spark to improve Parallelism
  • 27. ● Keep tasks in low tens-of-thousands ○ Too many tasks and the driver can’t handle heartbeats ○ Jobs hang for 10+ minutes after shutdown ● Reduce pressure on shuffle service ○ map tasks * reduce tasks = shuffle shards Avoid wide stages
  • 28. ● Fixed --num-executors accidents (SPARK-13723) ● Use materialize instead of caching ○ Materialize: convert to RDD, back to DF, and count ○ Stores cache data in shuffle servers ○ Also avoids over-optimization Dynamic Allocation
  • 29. ● Add ORDER BY ○ Partition columns, filter columns, and one high cardinality column ● Benefits ○ Cluster by partition columns – minimize output files ○ Cluster by common filter columns – faster reads ○ Automatic skew estimation – faster writes (wall time) ● Needs adaptive execution support Sort before writing
  • 31. ● Easy to overload one node ○ Skewed data, not enough threads, GC ● Prevents graceful shrink ● Causes huge runtime variance Shuffle service
  • 32. ● Collect is wasteful ○ Iterate through compressed result blocks to collect ● Configuration is confusing ○ Memory fraction is often ignored ○ Simpler is better ● Should build broadcast tables on executors Memory management
  • 33. ● Forked the write path for 2.x releases ○ Consistent rules across “datasource” and Hive tables ○ Remove unsafe operations, like implicit unsafe casts ○ Dynamic partition overwrites and Netflix “batch” pattern ● Fix upstream behavior and consistency with DSv2 ● Fix table usability with Iceberg ○ Schema evolution and partitioning DataSourceV2