Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

Tracing the breadcrumbs:
Spark workload diagnostics
Kris Mok @rednaxelafx
Cheng Lian @liancheng

About us
▪ Sr. software engineer at Databricks
▪ Worked on OpenJDK HotSpot VM &
Zing VM implementations
Cheng Lian
▪ Sr. software engineer at Databricks
▪ Apache Spark PMC member
▪ Apache Parquet committer
Kris Mok

Uniﬁed data analytics platform for accelerating innovation across
data science, data engineering, and business analytics
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners

Distributed applications are hard
▪ Have you ever hit the following categories of issues in your distributed
Spark applications?
▪ Mysterious performance regression
▪ Mysterious job hang

Distributed applications are hard
▪ Distributed applications are inherently hard to
▪ Develop
▪ Tune
▪ Diagnose
▪ Due to
▪ Longer iteration cycle
▪ Variety of input data (volume, quality, distribution, etc.)
▪ Broader range of infra/external dependencies (network, cloud services, etc.)
▪ Fractured criminal scenes (incomplete and scattered logs)

What this talk is about
▪ Demonstrate tools and methodologies for diagnosing distributed Spark
applications with real world use cases.

A query runs slower in a newer version of Spark than in an older version.
Check the Spark SQL query plan,
▪ If the query plan has changed
▪ the regression may likely be from an Catalyst optimizer change
▪ If the query plan is the same
▪ the regression could be coming from various sources
▪ More time spent in query optimization / compilation?
▪ More time spent in scheduling?
▪ More time spent in network operations?
▪ More time spent on task execution?
▪ More time spent on GC?
▪ ...
Symptom

Tools and methodologies demonstrated
▪ Build a benchmark to reproduce the performance regression
▪ Gather performance data using a profiler
▪ Conventional JVM profilers (JProfiler, YourKit, jvisualvm, etc.)
▪ Java Flight Recorder
▪ async-profiler and Flame Graph

Case study
▪ A performance regression in Spark 2.4 development

Symptom
Databricks’ Spark Benchmarking team’s performance sign-off for DBR
5.0-beta had found a signiﬁcant performance regression vs DBR 4.3.
Multiple TPC-DS queries were slower, e.g. q67

FlameGraph: DBR 4.3 (Spark 2.3-based)

FlameGraph: DBR 5.0-beta (Spark 2.4-SNAPSHOT-based)

Zoom in on the difference in hot spot
DBR 4.3
DBR 5.0-beta

Zoom in on the difference in hot spot
DBR 4.3
hot loop calling monomorphic function;
also avoided extra buffer copy
hot loop calling polymorphic function;
extra buffer copyDBR 5.0-beta

Tools and methodologies demonstrated
▪ Thread dump from Spark UI
▪ Network debugging
▪ JVM debugging
▪ Log exploration and visualization

Case study 1
▪ Shuffle fetch on a dead executor causes entire cluster hung
The customer found there are a lot of exceptions in the Spark logs:
org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply from null in 120 seconds.
During about the same time (16:35 - 16:45) when the exceptions happened, the entire
cluster hung (extremely low cluster usage).
”

Early triaging questions
▪ Anything special happened before the cluster hang?
▪ That might be the cause of the hang.
▪ Anything happening while the cluster hung?
▪ Is it completely silent, or
▪ Is it busy doing something?

Tools
▪ Spark History Server
▪ Executor, job, stage, and task events visualization
▪ Spark logs in Delta Lake
▪ Within Databricks, with the consent of our customers, we ETL Spark logs into
Delta tables for fast exploration
▪ Interactive notebook environment for log exploration and visualization

Spark History Server - Historical Spark UI
▪ Executor 29 was removed around
16:36, right before the cluster
hung
▪ Executor 103 was added around
16:43, right before the cluster
went back active
Executor and job events
Cluster hanging

Spark logs exploration and visualization
▪ Checking per minute driver side
log message counts
▪ The driver turned out to be quiet
during the cluster hang
Checking driver activities
Cluster hanging

▪ Checking per minute executor
side log message counts
▪ Mild executor side activities
during the cluster hang
Checking executors activities
Cluster hanging

▪ Incrementally ﬁlter out logs of various retry events and timeout events
▪ Empty result set, so no new tasks scheduled during this time
▪ But already scheduled tasks could be running quietly
Zoom into the cluster hang period

▪ Checking per minute executor
side log message counts
▪ A clear pattern repeated 3 times
every 120 seconds
▪ Turned out to be executor side
shuffle connection timeout events
Zoom into the cluster hang period

Conclusion
▪ The behavior is expected
▪ Later investigation revealed that
▪ A large stage consisting 2,000 tasks happened to occupy all CPU cores
▪ These tasks were waiting for shuffle map output from the failed executor until timeout
▪ So the cluster appeared to be “hanging”
▪ After executor 29 got lost, other active executors retried to connect to it for 3
times every 120 seconds, which conform to the default values of the following
two Spark conﬁgurations:
▪ spark.shuffle.io.connectionTimeout
▪ spark.shuffle.io.maxRetries

Case study 2
A customer reported that a Spark SQL query had been stuck for multiple
hours, and gave permissions to do live debugging.
Through Spark UI, we’ve determined that the query was almost done, and
the only tasks still running were all in the ﬁnal stage, writing out results.

Get thread dump from executor via Spark UI
...

Relevant threads’ stack traces
Obviously stuck on a socket read
Find executor thread via thread name

Are there any zombie connections?
Run command: netstat -o
tcp6 0 0 <src_ip>:36332 <dest_ip>:https ESTABLISHED off (0.00/0/0)

Is the thread related to the zombie connection?
Introspect Java stack and objects via CLHSDB
(Command-Line HotSpot Debugger)
java -cp .:$JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB

CLHSDB session example
Inspect threads list (can also use jstack or pstack in CLHSDB)
hsdb> jseval "jvm.threads"
{Thread (address=0x00007f06648fae08, name=Attach Listener),
Thread (address=0x00007f065fd538b8, name=pool-339-thread-7),
Thread (address=0x00007f065fcb9cc8, name=pool-339-thread-6),
Thread (address=0x00007efc9249fef0, name=pool-339-thread-5),
Thread (address=0x00007efc9249e8b8, name=pool-339-thread-4),
...}

Inspect stack frames of a given thread
hsdb> jseval "jvm.threads[4].frames"
{Frame
(method=java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
byte[], int, int, int), bci=0, line=0),
Frame
(method=java.net.SocketInputStream.socketRead(java.io.FileDescriptor,
byte[], int, int, int), bci=8, line=116),
...}

Inspect the receiver object of a speciﬁc frame
hsdb> jseval
"sa.threads.first().next().next().next().next().getLastJavaVFrameDbg().javaSender().java
Sender().locals.get(0).print()"
<0x00007f06648fb090>
hsdb> inspect 0x00007f06648fb090
instance of Oop for java/net/SocketInputStream @ 0x00007f06648fb090 (size = 88)
_mark: 29
_metadata._klass: InstanceKlass for java/net/SocketInputStream
fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068 Oop for java/io/FileDescriptor
path: null
channel: null
closeLock: Oop for java/lang/Object @ 0x00007f0668176858
closed: false
eof: false
impl: Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08
...

Inspect the SocksSocketImpl object that we care about
hsdb> inspect 0x00007f0666569b08
instance of Oop for java/net/SocksSocketImpl @ 0x00007f0666569b08 (size = 168)
_mark: 139618986514461
_metadata._klass: InstanceKlass for java/net/SocksSocketImpl
socket: Oop for sun/security/ssl/SSLSocketImpl @ 0x00007f06648fb0e8
serverSocket: null
fd: Oop for java/io/FileDescriptor @ 0x00007f06648fb068
address: Oop for java/net/Inet4Address @ 0x00007efc928a4e00
port: 443
localport: 36332
timeout: 0
...
Found matching port as the zombie connection,
And found “timeout = 0”

Run GDB and attach to the Java process.
Run t a a bt (short for thread all apply backtrace)
GDB example
...
Thread 100 (Thread 0x7efb1926e700 (LWP 3166)):
#0 0x00007f07417252bf in __libc_recv (fd=fd@entry=636, buf=buf@entry=0x7efb1925ca70, n=n@entry=5,
flags=flags@entry=0)
at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
#1 0x00007efb9944b25d in NET_Read (__flags=0, __n=5, __buf=0x7efb1925ca70, __fd=636) at
/usr/include/x86_64-linux-gnu/bits/socket2.h:44
#2 0x00007efb9944b25d in NET_Read (s=s@entry=636, buf=buf@entry=0x7efb1925ca70, len=len@entry=5)
at
/build/openjdk-8-lTwZJE/openjdk-8-8u181-b13/src/jdk/src/solaris/native/java/net/linux_close.c:273
#3 0x00007efb9944ab8e in Java_java_net_SocketInputStream_socketRead0 (env=0x7efb941859e0,
this=<optimized out>, fdObj=<optimized out>, data=0x7efb1926cad8, off=0, len=5, timeout=0)
...

Conclusion
▪ Use Spark UI to identify at what stage a query is running, which tasks
are still running (or have gotten stuck)
▪ Use Spark UI to get a thread dump on the executor running the stuck
task to get an idea of what it’s doing
▪ When a task seem to be stuck on network I/O, use netstat to check if
there are any connections in a bad state
▪ It’s possible to introspect JVM state (thread stacks and heap) via
CLHSDB (or jhsdb, officially supported since JDK9)
▪ Native stack frame state can be introspected via GDB
P.S. This particular bug was caused by JDK-8238579

Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

More Related Content

Similar to Tracing the Breadcrumbs: Apache Spark Workload Diagnostics (20)

More from Databricks (20)

Recently uploaded (20)

Tracing the Breadcrumbs: Apache Spark Workload Diagnostics