Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Rosen, Databricks)

Deep Dive into Project
Tungsten: Bringing Spark
Closer to Bare Metal
Josh Rosen (@jshrsn)
June 16, 2015

About Databricks
Oﬀers a hosted service:
•  Spark on EC2
•  Notebooks
•  Plot visualizations
•  Cluster management
•  Scheduled jobs
2
Founded by creators of Spark and remains largest contributor

Goals of Project Tungsten
Substantially improve the memory and CPU eﬀiciency of
Spark applications .
Push performance closer to the limits of modern
hardware.
3

In this talk
4
• Motivation: why we’re focusing on compute instead of IO
• How Tungsten optimizes memory + CPU
• Case study: aggregation
• Case study: record sorting
• Performance results
• Roadmap + next steps

Many big data workloads are now
compute bound
5
NSDI’15:
•  “Network optimizations can only reduce job completion time by
a median of at most 2%.”
•  “Optimizing or eliminating disk accesses can only reduce job
completion time by a median of at most 19%.”
•  We’ve observed similar characteristics in many Databricks Cloud
customer workloads.

Why is CPU the new bottleneck?
6
•  Hardware has improved:
–  Increasingly large aggregate IO bandwidth, such as 10Gbps links in
networks
–  High bandwidth SSD’s or striped HDD arrays for storage
•  Spark’s IO has been optimized:
–  many workloads now avoid significant disk IO by pruning input data
that is not needed in a given job
–  new shuﬀle and network layer implementations
•  Data formats have improved:
–  Parquet, binary data formats
•  Serialization and hashing are CPU-bound bottlenecks

How Tungsten improves CPU & memory
efficiency
•  Memory Management and Binary Processing: leverage
application semantics to manage memory explicitly and
eliminate the overhead of JVM object model and garbage
collection
•  Cache-aware computation: algorithms and data structures to
exploit memory hierarchy
•  Code generation: exploit modern compilers and CPUs; allow
eﬀicient operation directly on binary data
7

The overheads of Java objects
“abcd”
9
•  Native: 4 bytes with UTF-8 encoding
•  Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes of overhead + 8 bytes for chars

Garbage collection challenges
•  Many big data workloads create objects in ways that are
unfriendly to regular Java GC.
•  Guest blog on GC tuning: tinyurl.com/db-gc-tuning
10
eden
S0
S1
tenured
permanent

Permanent GenerationOld GenerationYoung Generation
Survivor Space

sun.misc.Unsafe
11
•  JVM internal API for directly manipulating memory without
safety checks (hence “unsafe”)
•  We use this API to build data structures in both on- and oﬀ-heap
memory
Data

structures

with
pointers

Flat
data

structures

Complex

examples

Java object-based row representation
12
3 fields of type (int, string, string)
with value (123, “data”, “bricks”)
GenericMutableRow

Array
String(“data”)

String(“bricks”)

5+ objects; high space overhead; expensive hashCode()
BoxedInteger(123)

Tungsten’s UnsafeRow format
13
•  Bit set for tracking null values
•  Every column appears in the fixed-length values region:
–  Small values are inlined
–  For variable-length values (strings), we store a relative offset into the variable-
length data section
•  Rows are always 8-byte word aligned (size is multiple of 8 bytes)
•  Equality comparison and hashing can be performed on raw bytes without
requiring additional interpretation
null
bit
set
(1
bit/field)

values
(8
bytes
/
field)

variable
length

Offset to var. length data

6 “bricks”
Example of an UnsafeRow
14
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null tracking bitmap
Oﬀset to var. length data
Oﬀset to var. length data Field lengths

How we encode memory addresses
15
•  Off heap: addresses are raw memory pointers.
•  On heap: addresses are base object + offset pairs.
•  We use our own “page table” abstraction to enable more
compact encoding of on-heap addresses:
0

1

…

N
–
1

Page table
Data
page

(Java
object)

page
offset
in
page

16
java.util.HashMap
…

key
ptr
value
ptr
next

key
value

array
•  Huge object overheads
•  Poor memory locality
•  Size estimation is hard

Memory
page

hc

17
Tungsten’s BytesToBytesMap
ptr

…

array
•  Low space overheads
•  Good memory locality, especially for scans
key
value
key
value

key
value
key
value

key
value
key
value

Code generation
•  Generic evaluation of expression logic
is very expensive on the JVM
–  Virtual function calls
–  Branches based on expression type
–  Object creation due to primitive boxing
–  Memory consumption by boxed
primitive objects
•  Generating custom bytecode can
eliminate these overheads
18
9.33
9.36
36.65
Hand written
Code gen
Interpreted Projection
Evaluating “SELECT a + a + a”
(query time in seconds)

Code generation
•  Project Tungsten uses the Janino compiler to reduce code generation time.
•  Spark 1.5 will greatly expand the number of expressions that support code
generation:
–  SPARK-8159
19

Example: aggregation optimizations in
DataFrames and Spark SQL
20
df.groupBy("department").agg(max("age"), sum("expense"))

Example: aggregation optimizations in
DataFrames and Spark SQL
21
Input
Row
Grouping
Key
UnsafeRow

project convert
BytesToBytesMap
scan
Update

Aggregates

Agg.
Result

update in place
probe
SPARK-7080

Optimized record sorting in Spark SQL +
DataFrames (SPARK-7082)
22
pointer

•  AlphaSort-style prefix sort:
–  Store prefixes of sort keys inside the sort pointer array
–  During sort, compare prefixes to short-circuit and avoid full record comparisons
•  Use this to build external sort-merge join to support joins larger than memory
record

Key
preﬁx
pointer
record

Naïve layout
Cache friendly layout

Initial performance results for agg. query
23
0
200
400
600
800
1000
1200
1x 2x 4x 8x 16x
Run time
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten oﬀheap

Initial performance results for agg. query
24
0
50
100
150
200
1x 2x 4x 8x 16x
Average GC
time per
node
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten oﬀheap

Project Tungsten Roadmap
25
Spark
1.4
Spark
1.5
Spark
1.6

•  Binary processing for
aggregation in Spark
SQL / DataFrames
•  New Tungsten shuﬀle
manager
•  Compression &
serialization
optimizations
•  Optimized code
generation
•  Optimized sorting in
Spark SQL /
DataFrames
•  End-to-end processing
using binary data
representations
•  External aggregation
•  Vectorized / batched
processing
•  ???

Which Spark jobs can benefit from
Tungsten?
26
•  DataFrames
–  Java
–  Scala
–  Python
–  R
•  Spark SQL queries
•  Some Spark RDD API programs, via general serialization + compression
optimizations
logs.join(!
"users,!
"logs.userId == users.userId,!
""left_outer") !
.groupBy("userId").agg({"*": "count"})!

How to enable all of Spark 1.4’s
Tungsten optimizations
27
spark.sql.codegen = true
spark.sql.unsafe.enabled = true
spark.shuffle.manager = tungsten-sort
Warning!
These
features

are
experimental
in
1.4!

Thank you.
Follow our progress on JIRA: SPARK-7075

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Rosen, Databricks)

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Rosen, Databricks) (20)

More from Spark Summit (20)

Recently uploaded (20)

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Rosen, Databricks)