Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! with Tiark Rompf

Tiark Rompf
Purdue University
FLARE Scale up Spark SQL
with Native Compilation
and set your Data on Fire!

ScalaMartin Odersky + the Scala team
Rompf, Iulian Dragos, Adriaan Moors, Gilles Dubochet, Philipp Haller, Lukas Rytz, Ingo Maier, Antonio Cunei, Donna Malayeri, Miguel Garcia, Hubert Plociniczak, Aleksandar Pro
st: Geoffrey Washburn, Stéphane Micheloud, Lex Spoon, Sean Mc Dirmid, Burak Emir, Nikolay Mihaylov, Philippe Altherr, Vincent Cremet, Michel Schinz, Erik Stenman, Matthias
external/visiting contributors: Paul Phillips, Miles Sabin, Stepan Koltsov and others

User Programs
(Java, Scala, Python, R)
SQL
(JDBC, Console)
Spark
Resilient Distributed Dataset
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL

User Programs
SQL
(JDBC, Console)
Spark
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL
Spark Architecture

Flare: a New Back-end for Spark
User Programs
SQL
(JDBC, Console)
Spark
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL
Delite’s Back-end
DMLL
LMS Code Generation
Optimized Scala, C
(a) Spark SQL
Delite’s Runtime
native code
OptiQL OptiML OptiGraph
(b) Flare Level 1 (c) Flare Level 2 (d) Flare Level 3
Front-end
Flare’s
Code Generation
Flare’s
Code Generation
Flare’s Runtime
Export query plan
JNI
Front-end

Single-Core Running Time: TPCH
Absolute running time in milliseconds (ms) for Postgres, Spark, HyPer and Flare in SF10
1
10
100
1000
10000
100000
1x106
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
RunningTime(ms)
PostgreSQL Spark HyPer Flare

Apache Parquet Format
1
10
100
1000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
Speedup
Spark CSV Spark Parquet Flare CSV Flare Parquet
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11
Spark CSV 16762 12244 21730 19836 19316 12278 24484 17726 30050 29533 5224
Spark Parquet 3728 13520 9099 6083 8706 535 13555 5512 19413 21822 3926
Flare CSV 641 168 757 698 758 568 788 875 1417 854 128
Flare Parquet 187 17 125 127 151 99 183 160 698 309 9
Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Spark CSV 21688 8554 12962 26721 12941 24690 27012 12409 19369 57330 7050
Spark Parquet 5570 7034 719 4506 21834 5176 6757 2681 8562 25089 5295
Flare CSV 701 388 573 551 150 1426 1229 605 792 1868 178
Flare Parquet 133 246 86 88 66 264 181 178 165 324 22

Parallel Scaling Experiment
Scaling-up Flare and Spark SQL in SF20
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32
Speedup Flare
Q6 aggregate
Q13 outer-join
Q14 join
Q22 semi/anti-join
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 32
# Cores
Q22
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32
RunningTime(ms)
# Cores
Q14
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16 32
Q13
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32
RunningTime(ms)
Q6
Spark SQL
Flare Level 2
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32
Spark
Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and
256GB RAM per socket (1 TB total).

NUMA Optimization
C
O
R
E
0
C
O
R
E
1
C
O
R
E
2
C
O
R
E
3
C
O
R
E
4
C
O
R
E
5
C
O
R
E
6
C
O
R
E
7
C
O
R
E
8
C
O
R
E
9
C
O
R
E
10
C
O
R
E
11
C
O
R
E
12
C
O
R
E
13
C
O
R
E
14
C
O
R
E
15
Memory
Columnar data

NUMA Optimization
5300
5400
5500
Q1
1
0
100
200
300
400
500
600
1 18 36 72
RunningTime(ms)
# Cores
12 12
23
12
24
46
3500
3600
3700
Q6
one socket
two sockets
four sockets
1
0
100
200
300
1 18 36 72
# Cores
14
22
29
23
44
58
Scaling-up Flare for SF100 with NUMA optimizations on different configurations: threads pinned to one, two and four sockets
• Q6 performs better when threads are dispatched on different sockets.
• Q1 is computation-bound, little effect
• On scaling-up Q1 and Q6 up to 72 cores (the capacity of the
machine), the maximum speedup is 46x and 58x.
Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and
256GB RAM per socket (1 TB total).

Heterogeneous Workloads:
UDFs and ML Kernels

Example: k-Means Clustering
untilconverged(mu, tol) { mu =>
// calculate distances to current centroids
val c = (0::m) {i =>
val allDistances = mu mapRows { centroid =>
dist(x(i), centroid)
}
allDistances.minIndex
}
// move each cluster centroid to the
// mean of the points assigned to it
val newMu = (0::k,*) { i =>
val (weightedpoints, points) = sum(0,m) { j =>
if (c(i) == j) (x(i),1)
}
if (points == 0) Vector.zeros(n)
else weightedpoints / points
}
newMu

5
10
15
20
25
30
35
40
45
50
1 12 24 48
# Cores
GDA
5
10
15
20
25
30
35
40
45
50
1 12 24 48
Speedup
# Cores
Gene
5
10
15
20
25
30
35
40
45
50
1 12 24 48
k-means
5
10
15
20
25
30
35
40
45
50
1 12 24 48
Speedup
LogReg
Spark
C++(nopin)
C++(pin)
C++(numa)
Level 3: Machine learning kernels, scaling on shared memory NUMA
with thread pinning and data partitioning
Flare Level 3: ML Performance

Flare Level 3: ML Performance
0
1
2
3
4
5
6
7
8
3.4 GB 17 GB
SpeedupoverSpark
LogReg
0
1
2
3
4
5
6
7
8
1.7 GB 17 GB
k-means
Spark
Delite-CPU
Delite-GPU
0
1
2
3
4
5
6
7
8
k-means LogReg
GPU Cluster
Level 3: Machine learning kernels run on a 20 node Amazon cluster (left, center)
and on a 4 node GPU cluster connected within a single rack.

Relational + ML
/* TensorFlow inference as UDF */
val q = spark.sql("select ... from data
where class = findNearestCluster(...)
group by class")
flare(q).show

Grégory Essertel Ruby Tahboub James Decker
FLARE TEAM

Thank You.
Web: flaredata.github.io
Twitter: @flaredata
FLARE

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! with Tiark Rompf

More Related Content

What's hot (20)

Similar to Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! with Tiark Rompf (20)

More from Databricks (20)

Recently uploaded (20)

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! with Tiark Rompf