SlideShare a Scribd company logo
Tracing the breadcrumbs:
Spark workload diagnostics
Kris Mok @rednaxelafx
Cheng Lian @liancheng
About us
▪ Sr. software engineer at Databricks
▪ Worked on OpenJDK HotSpot VM &
Zing VM implementations
Cheng Lian
▪ Sr. software engineer at Databricks
▪ Apache Spark PMC member
▪ Apache Parquet committer
Kris Mok
Unified data analytics platform for accelerating innovation across
data science, data engineering, and business analytics
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners
Distributed applications are hard
▪ Have you ever hit the following categories of issues in your distributed
Spark applications?
▪ Mysterious performance regression
▪ Mysterious job hang
Distributed applications are hard
▪ Distributed applications are inherently hard to
▪ Develop
▪ Tune
▪ Diagnose
▪ Due to
▪ Longer iteration cycle
▪ Variety of input data (volume, quality, distribution, etc.)
▪ Broader range of infra/external dependencies (network, cloud services, etc.)
▪ Fractured criminal scenes (incomplete and scattered logs)
What this talk is about
▪ Demonstrate tools and methodologies for diagnosing distributed Spark
applications with real world use cases.
Performance Regression
A query runs slower in a newer version of Spark than in an older version.
Check the Spark SQL query plan,
▪ If the query plan has changed
▪ the regression may likely be from an Catalyst optimizer change
▪ If the query plan is the same
▪ the regression could be coming from various sources
▪ More time spent in query optimization / compilation?
▪ More time spent in scheduling?
▪ More time spent in network operations?
▪ More time spent on task execution?
▪ More time spent on GC?
▪ ...
Symptom
Tools and methodologies demonstrated
▪ Build a benchmark to reproduce the performance regression
▪ Gather performance data using a profiler
▪ Conventional JVM profilers (JProfiler, YourKit, jvisualvm, etc.)
▪ Java Flight Recorder
▪ async-profiler and Flame Graph
Case study
▪ A performance regression in Spark 2.4 development
Symptom
Databricks’ Spark Benchmarking team’s performance sign-off for DBR
5.0-beta had found a significant performance regression vs DBR 4.3.
Multiple TPC-DS queries were slower, e.g. q67
FlameGraph: DBR 4.3 (Spark 2.3-based)
FlameGraph: DBR 5.0-beta (Spark 2.4-SNAPSHOT-based)
Zoom in on the difference in hot spot
DBR 4.3
DBR 5.0-beta
Zoom in on the difference in hot spot
DBR 4.3
hot loop calling monomorphic function;
also avoided extra buffer copy
hot loop calling polymorphic function;
extra buffer copyDBR 5.0-beta
Job hangs
Spark job
Symptoms
Tools and methodologies demonstrated
▪ Thread dump from Spark UI
▪ Network debugging
▪ JVM debugging
▪ Log exploration and visualization
Case study 1
▪ Shuffle fetch on a dead executor causes entire cluster hung
The customer found there are a lot of exceptions in the Spark logs:
org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply from null in 120 seconds.
During about the same time (16:35 - 16:45) when the exceptions happened, the entire
cluster hung (extremely low cluster usage).
”
Early triaging questions
▪ Anything special happened before the cluster hang?
▪ That might be the cause of the hang.
▪ Anything happening while the cluster hung?
▪ Is it completely silent, or
▪ Is it busy doing something?
Tools
▪ Spark History Server
▪ Executor, job, stage, and task events visualization
▪ Spark logs in Delta Lake
▪ Within Databricks, with the consent of our customers, we ETL Spark logs into
Delta tables for fast exploration
▪ Interactive notebook environment for log exploration and visualization
Spark History Server - Historical Spark UI
▪ Executor 29 was removed around
16:36, right before the cluster
hung
▪ Executor 103 was added around
16:43, right before the cluster
went back active
Executor and job events
Cluster hanging
Spark logs exploration and visualization
▪ Checking per minute driver side
log message counts
▪ The driver turned out to be quiet
during the cluster hang
Checking driver activities
Cluster hanging
Spark logs exploration and visualization
▪ Checking per minute executor
side log message counts
▪ Mild executor side activities
during the cluster hang
Checking executors activities
Cluster hanging
Spark logs exploration and visualization
▪ Incrementally filter out logs of various retry events and timeout events
▪ Empty result set, so no new tasks scheduled during this time
▪ But already scheduled tasks could be running quietly
Zoom into the cluster hang period
Spark logs exploration and visualization
▪ Checking per minute executor
side log message counts
▪ A clear pattern repeated 3 times
every 120 seconds
▪ Turned out to be executor side
shuffle connection timeout events
Zoom into the cluster hang period
Conclusion
▪ The behavior is expected
▪ Later investigation revealed that
▪ A large stage consisting 2,000 tasks happened to occupy all CPU cores
▪ These tasks were waiting for shuffle map output from the failed executor until timeout
▪ So the cluster appeared to be “hanging”
▪ After executor 29 got lost, other active executors retried to connect to it for 3
times every 120 seconds, which conform to the default values of the following
two Spark configurations:
▪ spark.shuffle.io.connectionTimeout
▪ spark.shuffle.io.maxRetries
Case study 2
A customer reported that a Spark SQL query had been stuck for multiple
hours, and gave permissions to do live debugging.
Through Spark UI, we’ve determined that the query was almost done, and
the only tasks still running were all in the final stage, writing out results.
Get thread dump from executor via Spark UI
...
Relevant threads’ stack traces
Obviously stuck on a socket read
Find executor thread via thread name
Are there any zombie connections?
Run command: netstat -o
tcp6 0 0 <src_ip>:36332 <dest_ip>:https ESTABLISHED off (0.00/0/0)
Is the thread related to the zombie connection?
Introspect Java stack and objects via CLHSDB
(Command-Line HotSpot Debugger)
java -cp .:$JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB
CLHSDB session example
Inspect threads list (can also use jstack or pstack in CLHSDB)
hsdb> jseval "jvm.threads"
{Thread (address=0x00007f06648fae08, name=Attach Listener),
Thread (address=0x00007f065fd538b8, name=pool-339-thread-7),
Thread (address=0x00007f065fcb9cc8, name=pool-339-thread-6),
Thread (address=0x00007efc9249fef0, name=pool-339-thread-5),
Thread (address=0x00007efc9249e8b8, name=pool-339-thread-4),
...}
CLHSDB session example
Inspect stack frames of a given thread
hsdb> jseval "jvm.threads[4].frames"
{Frame
(method=java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
byte[], int, int, int), bci=0, line=0),
Frame
(method=java.net.SocketInputStream.socketRead(java.io.FileDescriptor,
byte[], int, int, int), bci=8, line=116),
...}
CLHSDB session example
Inspect the receiver object of a specific frame
hsdb> jseval
"sa.threads.first().next().next().next().next().getLastJavaVFrameDbg().javaSender().java
Sender().locals.get(0).print()"
<0x00007f06648fb090>
hsdb> inspect 0x00007f06648fb090
instance of Oop for java/net/SocketInputStream @ 0x00007f06648fb090 (size = 88)
_mark: 29
_metadata._klass: InstanceKlass for java/net/SocketInputStream
fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068 Oop for java/io/FileDescriptor
path: null
channel: null
closeLock: Oop for java/lang/Object @ 0x00007f0668176858
closed: false
eof: false
impl: Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08
...
CLHSDB session example
Inspect the SocksSocketImpl object that we care about
hsdb> inspect 0x00007f0666569b08
instance of Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08 (size = 168)
_mark: 139618986514461
_metadata._klass: InstanceKlass for java/net/SocksSocketImpl
socket: Oop for sun/security/ssl/SSLSocketImpl @ 0x00007f06648fb0e8
serverSocket: null
fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068
address: Oop for java/net/Inet4Address @ 0x00007efc928a4e00
port: 443
localport: 36332
timeout: 0
...
Found matching port as the zombie connection,
And found “timeout = 0”
Run GDB and attach to the Java process.
Run t a a bt (short for thread all apply backtrace)
GDB example
...
Thread 100 (Thread 0x7efb1926e700 (LWP 3166)):
#0 0x00007f07417252bf in __libc_recv (fd=fd@entry=636, buf=buf@entry=0x7efb1925ca70, n=n@entry=5,
flags=flags@entry=0)
at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
#1 0x00007efb9944b25d in NET_Read (__flags=0, __n=5, __buf=0x7efb1925ca70, __fd=636) at
/usr/include/x86_64-linux-gnu/bits/socket2.h:44
#2 0x00007efb9944b25d in NET_Read (s=s@entry=636, buf=buf@entry=0x7efb1925ca70, len=len@entry=5)
at
/build/openjdk-8-lTwZJE/openjdk-8-8u181-b13/src/jdk/src/solaris/native/java/net/linux_close.c:273
#3 0x00007efb9944ab8e in Java_java_net_SocketInputStream_socketRead0 (env=0x7efb941859e0,
this=<optimized out>, fdObj=<optimized out>, data=0x7efb1926cad8, off=0, len=5, timeout=0)
...
Conclusion
▪ Use Spark UI to identify at what stage a query is running, which tasks
are still running (or have gotten stuck)
▪ Use Spark UI to get a thread dump on the executor running the stuck
task to get an idea of what it’s doing
▪ When a task seem to be stuck on network I/O, use netstat to check if
there are any connections in a bad state
▪ It’s possible to introspect JVM state (thread stacks and heap) via
CLHSDB (or jhsdb, officially supported since JDK9)
▪ Native stack frame state can be introspected via GDB
P.S. This particular bug was caused by JDK-8238579
Q&A

More Related Content

Similar to Tracing the Breadcrumbs: Apache Spark Workload Diagnostics (20)

PPTX
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
PPTX
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PDF
De Java 8 a Java 17
Víctor Leonel Orozco López
 
PDF
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
PDF
Stream Processing using Apache Spark and Apache Kafka
Abhinav Singh
 
PDF
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
PDF
Leveraging Databricks for Spark pipelines
Rose Toomey
 
PDF
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
PDF
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
 
PPTX
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Dmitri Zimine
 
PPTX
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PPTX
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB
 
PPTX
Spark to DocumentDB connector
Denny Lee
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
De Java 8 a Java 17
Víctor Leonel Orozco López
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Stream Processing using Apache Spark and Apache Kafka
Abhinav Singh
 
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
 
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Dmitri Zimine
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB
 
Spark to DocumentDB connector
Denny Lee
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Ad

Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

  • 1. Tracing the breadcrumbs: Spark workload diagnostics Kris Mok @rednaxelafx Cheng Lian @liancheng
  • 2. About us ▪ Sr. software engineer at Databricks ▪ Worked on OpenJDK HotSpot VM & Zing VM implementations Cheng Lian ▪ Sr. software engineer at Databricks ▪ Apache Spark PMC member ▪ Apache Parquet committer Kris Mok
  • 3. Unified data analytics platform for accelerating innovation across data science, data engineering, and business analytics Original creators of popular data and machine learning open source projects Global company with 5,000 customers and 450+ partners
  • 4. Distributed applications are hard ▪ Have you ever hit the following categories of issues in your distributed Spark applications? ▪ Mysterious performance regression ▪ Mysterious job hang
  • 5. Distributed applications are hard ▪ Distributed applications are inherently hard to ▪ Develop ▪ Tune ▪ Diagnose ▪ Due to ▪ Longer iteration cycle ▪ Variety of input data (volume, quality, distribution, etc.) ▪ Broader range of infra/external dependencies (network, cloud services, etc.) ▪ Fractured criminal scenes (incomplete and scattered logs)
  • 6. What this talk is about ▪ Demonstrate tools and methodologies for diagnosing distributed Spark applications with real world use cases.
  • 8. A query runs slower in a newer version of Spark than in an older version. Check the Spark SQL query plan, ▪ If the query plan has changed ▪ the regression may likely be from an Catalyst optimizer change ▪ If the query plan is the same ▪ the regression could be coming from various sources ▪ More time spent in query optimization / compilation? ▪ More time spent in scheduling? ▪ More time spent in network operations? ▪ More time spent on task execution? ▪ More time spent on GC? ▪ ... Symptom
  • 9. Tools and methodologies demonstrated ▪ Build a benchmark to reproduce the performance regression ▪ Gather performance data using a profiler ▪ Conventional JVM profilers (JProfiler, YourKit, jvisualvm, etc.) ▪ Java Flight Recorder ▪ async-profiler and Flame Graph
  • 10. Case study ▪ A performance regression in Spark 2.4 development
  • 11. Symptom Databricks’ Spark Benchmarking team’s performance sign-off for DBR 5.0-beta had found a significant performance regression vs DBR 4.3. Multiple TPC-DS queries were slower, e.g. q67
  • 12. FlameGraph: DBR 4.3 (Spark 2.3-based)
  • 13. FlameGraph: DBR 5.0-beta (Spark 2.4-SNAPSHOT-based)
  • 14. Zoom in on the difference in hot spot DBR 4.3 DBR 5.0-beta
  • 15. Zoom in on the difference in hot spot DBR 4.3 hot loop calling monomorphic function; also avoided extra buffer copy hot loop calling polymorphic function; extra buffer copyDBR 5.0-beta
  • 18. Tools and methodologies demonstrated ▪ Thread dump from Spark UI ▪ Network debugging ▪ JVM debugging ▪ Log exploration and visualization
  • 19. Case study 1 ▪ Shuffle fetch on a dead executor causes entire cluster hung The customer found there are a lot of exceptions in the Spark logs: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from null in 120 seconds. During about the same time (16:35 - 16:45) when the exceptions happened, the entire cluster hung (extremely low cluster usage). ”
  • 20. Early triaging questions ▪ Anything special happened before the cluster hang? ▪ That might be the cause of the hang. ▪ Anything happening while the cluster hung? ▪ Is it completely silent, or ▪ Is it busy doing something?
  • 21. Tools ▪ Spark History Server ▪ Executor, job, stage, and task events visualization ▪ Spark logs in Delta Lake ▪ Within Databricks, with the consent of our customers, we ETL Spark logs into Delta tables for fast exploration ▪ Interactive notebook environment for log exploration and visualization
  • 22. Spark History Server - Historical Spark UI ▪ Executor 29 was removed around 16:36, right before the cluster hung ▪ Executor 103 was added around 16:43, right before the cluster went back active Executor and job events Cluster hanging
  • 23. Spark logs exploration and visualization ▪ Checking per minute driver side log message counts ▪ The driver turned out to be quiet during the cluster hang Checking driver activities Cluster hanging
  • 24. Spark logs exploration and visualization ▪ Checking per minute executor side log message counts ▪ Mild executor side activities during the cluster hang Checking executors activities Cluster hanging
  • 25. Spark logs exploration and visualization ▪ Incrementally filter out logs of various retry events and timeout events ▪ Empty result set, so no new tasks scheduled during this time ▪ But already scheduled tasks could be running quietly Zoom into the cluster hang period
  • 26. Spark logs exploration and visualization ▪ Checking per minute executor side log message counts ▪ A clear pattern repeated 3 times every 120 seconds ▪ Turned out to be executor side shuffle connection timeout events Zoom into the cluster hang period
  • 27. Conclusion ▪ The behavior is expected ▪ Later investigation revealed that ▪ A large stage consisting 2,000 tasks happened to occupy all CPU cores ▪ These tasks were waiting for shuffle map output from the failed executor until timeout ▪ So the cluster appeared to be “hanging” ▪ After executor 29 got lost, other active executors retried to connect to it for 3 times every 120 seconds, which conform to the default values of the following two Spark configurations: ▪ spark.shuffle.io.connectionTimeout ▪ spark.shuffle.io.maxRetries
  • 28. Case study 2 A customer reported that a Spark SQL query had been stuck for multiple hours, and gave permissions to do live debugging. Through Spark UI, we’ve determined that the query was almost done, and the only tasks still running were all in the final stage, writing out results.
  • 29. Get thread dump from executor via Spark UI ...
  • 30. Relevant threads’ stack traces Obviously stuck on a socket read Find executor thread via thread name
  • 31. Are there any zombie connections? Run command: netstat -o tcp6 0 0 <src_ip>:36332 <dest_ip>:https ESTABLISHED off (0.00/0/0)
  • 32. Is the thread related to the zombie connection? Introspect Java stack and objects via CLHSDB (Command-Line HotSpot Debugger) java -cp .:$JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB
  • 33. CLHSDB session example Inspect threads list (can also use jstack or pstack in CLHSDB) hsdb> jseval "jvm.threads" {Thread (address=0x00007f06648fae08, name=Attach Listener), Thread (address=0x00007f065fd538b8, name=pool-339-thread-7), Thread (address=0x00007f065fcb9cc8, name=pool-339-thread-6), Thread (address=0x00007efc9249fef0, name=pool-339-thread-5), Thread (address=0x00007efc9249e8b8, name=pool-339-thread-4), ...}
  • 34. CLHSDB session example Inspect stack frames of a given thread hsdb> jseval "jvm.threads[4].frames" {Frame (method=java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int), bci=0, line=0), Frame (method=java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, int, int), bci=8, line=116), ...}
  • 35. CLHSDB session example Inspect the receiver object of a specific frame hsdb> jseval "sa.threads.first().next().next().next().next().getLastJavaVFrameDbg().javaSender().java Sender().locals.get(0).print()" <0x00007f06648fb090> hsdb> inspect 0x00007f06648fb090 instance of Oop for java/net/SocketInputStream @ 0x00007f06648fb090 (size = 88) _mark: 29 _metadata._klass: InstanceKlass for java/net/SocketInputStream fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068 Oop for java/io/FileDescriptor path: null channel: null closeLock: Oop for java/lang/Object @ 0x00007f0668176858 closed: false eof: false impl: Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08 ...
  • 36. CLHSDB session example Inspect the SocksSocketImpl object that we care about hsdb> inspect 0x00007f0666569b08 instance of Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08 (size = 168) _mark: 139618986514461 _metadata._klass: InstanceKlass for java/net/SocksSocketImpl socket: Oop for sun/security/ssl/SSLSocketImpl @ 0x00007f06648fb0e8 serverSocket: null fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068 address: Oop for java/net/Inet4Address @ 0x00007efc928a4e00 port: 443 localport: 36332 timeout: 0 ... Found matching port as the zombie connection, And found “timeout = 0”
  • 37. Run GDB and attach to the Java process. Run t a a bt (short for thread all apply backtrace) GDB example ... Thread 100 (Thread 0x7efb1926e700 (LWP 3166)): #0 0x00007f07417252bf in __libc_recv (fd=fd@entry=636, buf=buf@entry=0x7efb1925ca70, n=n@entry=5, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28 #1 0x00007efb9944b25d in NET_Read (__flags=0, __n=5, __buf=0x7efb1925ca70, __fd=636) at /usr/include/x86_64-linux-gnu/bits/socket2.h:44 #2 0x00007efb9944b25d in NET_Read (s=s@entry=636, buf=buf@entry=0x7efb1925ca70, len=len@entry=5) at /build/openjdk-8-lTwZJE/openjdk-8-8u181-b13/src/jdk/src/solaris/native/java/net/linux_close.c:273 #3 0x00007efb9944ab8e in Java_java_net_SocketInputStream_socketRead0 (env=0x7efb941859e0, this=<optimized out>, fdObj=<optimized out>, data=0x7efb1926cad8, off=0, len=5, timeout=0) ...
  • 38. Conclusion ▪ Use Spark UI to identify at what stage a query is running, which tasks are still running (or have gotten stuck) ▪ Use Spark UI to get a thread dump on the executor running the stuck task to get an idea of what it’s doing ▪ When a task seem to be stuck on network I/O, use netstat to check if there are any connections in a bad state ▪ It’s possible to introspect JVM state (thread stacks and heap) via CLHSDB (or jhsdb, officially supported since JDK9) ▪ Native stack frame state can be introspected via GDB P.S. This particular bug was caused by JDK-8238579
  • 39. Q&A