Enabling Vectorized Engine
in Apache Spark
Kazuaki Ishizaki
IBM Research - Tokyo
About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Apache Spark committer from 2018/9 (SQL module)
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– https://www.slideshare.net/ishizaki/
2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Table of Contents
▪ What are vectorization and SIMD?
– How can SIMD improve performance?
▪ What is VectorAPI?
– Why can’t the current Spark use SIMD?
▪ How to use SIMD with performance analysis
1. Replace external libraries
2. Use vectorized runtime routines such as sort
3. Generate vectorized Java code from a given SQL query by Catalyst
3 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
4 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Scalar Vectorization
Read one row at a time Read four rows at a time
table table
What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
▪ Spark already implemented multiple vectorizations
– Vectorized Parquet Reader
– Vectorized ORC Reader
– Pandas UDF (a.k.a. vectorized UDF)
5 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
What is SIMD?
6 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction (8x in the example)
What is SIMD?
7 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Vector register
SIMD instruction
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
add add add add
input 1
input 2
output
add gr1,gr2,gr3 vadd vr1,vr2,vr3
Scalar instruction SIMD instruction
A4 A5 A6 A7
B4 B5 B6 B7
C4 C5 C6 C7
add add add add
A0
B0
C0
add
input 1
input 2
output
▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
– Increase the parallelism in an instruction
▪ SIMD can be used to implement vectorization
What is SIMD?
8 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
SIMD is Used in Various BigData Software
▪ Database
– DB2, Oracle, PostgreSQL, …
▪ SQL Query Engine
– Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, …
9 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Why Current Spark Does Not Use SIMD?
▪ Java Virtual Machine (JVM) cannot ensure whether a given Java
program will use SIMD
10 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
11 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
JVM
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
12 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
Slower scalar code
JVM
Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
program will use SIMD
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
13 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Faster SIMD code
Slower scalar code
JVM
New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
14 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
SIMD can be always generated
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Scalar code
SIMD may be generated or not
SIMD length (e.g. 8)
New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
15 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Pseudo native SIMD code
Where We Can Use SIMD in Spark
16 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
17 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
18 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where We Can Use SIMD in Spark
▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library
– Sort, Join, …
▪ Generated code at runtime
– Java program translated from DataFrame program by Catalyst
19 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Where and How We Can Use SIMD in Spark
▪ External library – Write VectorAPI code by hand
– BLAS library (matrix operation)
▪ SPARK-33882
▪ Internal library – Write VectorAPI code by hand
– Sort, Join, …
▪ Generated code at runtime – Generate VectorAPI code by Catalyst
– Catalyst translates DataFrame program info Java program
20 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
External Library
More text on one line in this location if needed
Three Approaches
▪ JNI (Java Native Interface) library
– Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library
▪ SIMD code
– Call Java VectorAPI code if JVM supports VectorAPI
▪ Scalar code
– Call naïve Java code that runs on all JVMs
22 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Implementation using VectorAPI
▪ An example of matrix operation kernels
23 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
// y += alpha * x
public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) {
...
DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha);
int i = 0;
// vectorized part
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
DoubleVector vy = DoubleVector.fromArray(DMAX, y, i);
vx.fma(valpha, vy).intoArray(y, i);
}
// residual part
for (; i < n; i += 1) {
y[i] += alpha * x[i];
}
...
}
SPARK-33882
Benchmark for Large-size Data
▪ JNI achieves the best performance
24 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Algorithm
Data size
(double type)
elapsed time (ms)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 10,000,000 1.3 14.6 18.2
dgemm
Z = X * Y
1000x1000
* 1000x100
1.3 40.6 81.1
Benchmark for Small-size Data
▪ VectorAPI achieves the best performance
25 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Algorithm
Data size
(double type)
elapsed time (ns)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 256 118 27 140
dgemm
Z = X * Y
8x8 * 8x8 555 365 679
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Summary of Three Approaches
26 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Performance Overhead Portability Choice
JNI library Best
High
(Data copy
between Java
heap and native
memory)
Readyness of
Native library
Good for large
data
SIMD code Moderate No Java 16 or later
Good for small
data
and better than
scalar code
Scalar code Slow No
Any Java
version
Backup path
Internal Library
More text on one line in this location if needed
Lots of Research for SIMD Sort and Join
28 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
29 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
30 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
What Sort Algorithm We Can Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
31 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Fast for data in CPU data cache
Comb Sort is 2.5x Faster than Tim Sort
32 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
Radix Sort is 1.4x Faster than Comb Sort
▪ Radix sort order is smaller than that of Comb sort
– O(N) v.s. O(N log N)
▪ VectorAPI cannot exploit platform-specific SIMD instructions
33 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Tim sort
(Scalar) 292ms
Shorter is better
Sort a Pair of Key and Value
▪ Compare two 64-bit keys and get the pair with a smaller key
– This is a frequently executed operation
34 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
Sort a Pair of Key and Value
▪ Sort the first pair
35 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1 < 5
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1
Sort a Pair of Key and Value
▪ Sort the second pair
36 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
in1
Parallel Sort a Pair using SIMD
▪ In parallel, compare two 64-bit keys and get the pair with a smaller
key at once
37 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
An example of 256-bit width instruction
1 < 5
in1
No shuffle in C Version
▪ The result of compare can be logically shifted without shuffle.
38 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
__mmask8 mask = 0b10101010;
void shufflePair(__m256 *x) {
__mmask8 maska, maskb, maskar, maskbr, maskzero;
maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask);
maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask);
maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1));
maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1));
x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]);
x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]);
x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]);
x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]);
}
0 shuffle + 6 shift/or + 2 compare instructions
1
7
x[0-3]
maska
maskA
It is an important optimization to reduce the number of shuffle instruction on x86_64
“reduce port 5 pressure”
3
-1
-7
-5
-3 5
x[4-7]
compare
4 Shuffles in VectorAPI Version
▪ Since the result of the comparison (VectorMask) cannot be shifted,
all four values should be shuffled before the comparison
39 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
final VectorShuffle pair =
VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2);
private void swapPair(long x[], int i) {
LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt;
xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7];
ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15];
xpa = xa.rearrange(pair);
xpb = xb.rearrange(pair);
ypa = ya.rearrange(pair);
ypb = yb.rearrange(pair);
VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa);
VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb);
xs = xa.blend(ya, maskA);
xt = xb.blend(yb, maskB);
ys = ya.blend(xa, maskA);
yt = yb.blend(xb, maskB);
xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]);
xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]);
}
4 shuffle + 2 compare instructions
maskA
1
7 1
7
rearrange
5
3 5
3
5
3
rearrange
1
7
compare
xa
xb
Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
40 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Radix sort 563
Tim sort 757
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
41 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Sort
algorithm
Elapsed
time (ms)
Estimated time
with SIMD (ms)
Radix sort 563 563
Tim sort 757 587
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()
Radix sort took 84ms
in the previous benchmark
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Sort Requires Additional Operation
▪ df.sort() always involve in a costly exchange operation
– Data transfer among nodes
42 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
== Physical Plan ==
Sort [a#5L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54]
+- InMemoryTableScan [a#5L]
+- ...
Lessons Learned
▪ SIMD Comb sort is faster than the current Tim sort
▪ Radix sort is smart
– Order is O(N), where N is the number of elements
▪ Sort operation involves other costly operations
▪ There is room to exploit platform-specific SIMD instructions in
VectorAPI
43 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Generated Code
More text on one line in this location if needed
How DataFrame Program is Translated?
45 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
class … {
…
}
DataFrame source program
Generated Java code
Catalyst Translates into Java Code
46 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
Current Generated Code
▪ Read data in a vector style, but computation is executed
in a sequential style at a row
47 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = colb.getFloat(batchIdx);
float val0 = valA + valB;
float val1 = valA * valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Simplified generated code
Computation is Inefficient in Current Code
▪ To read data is efficient in a vector style
48 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = cola.getFloat(batchIdx);
float val0 = valA * valB;
float val1 = valA + valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Read data in a vector style
Compute data at a row
Put data at a row
Prototyped Generated Code
▪ To read and compute data in a vector style. To put data is still in a
sequential style
49 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
Enhanced Code Generation in Catalyst
50 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
DataFrame source program
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
with vectorized computation
Prototyped Two Code Generations
▪ Perform computations using scalar variables
▪ Perform computations using VectorAPI
51 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Using Scalar Variables
▪ Perform computation for multiple rows in a batch
52 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += 1) {
float valA = cola.getFloat(i);
float valB = colb.getFloat(i);
col0[i] = valA + valB;
col1[i] = valA * valB;
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
Using VectorAPI
▪ Perform computation for multiple rows using SIMD in a batch
53 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
class GeneratedCodeGenStage {
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
FloatVector vb = FloatVector.fromArray(SPECIES, vb, i);
FloatVector v0 = va.mul(vb);
FloatVector v1 = va.add(vb);
v0.intoArray(col0, i); v1.intoArray(col1, i);
}
}
}
void processNext() {
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}
Simplified generated code
Up to 1.7x Faster at Micro Benchmark
▪ Vectorized version achieve up to 1.7x performance improvement
▪ SIMD version achieves about 1.2x improvement over Vectorized
Scalar version
54 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Current
version
Vectorized
(Scalar)
Vectorized
(SIMD)
34.2ms
26.6ms
20.0ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
// start measuring time
df.selectExpr("a+b", "a*b“).noop()
Shorter is better
2.8x Faster at Nano Benchmark
▪ Perform the same computation as in the previous benchmark
– Add and multiple operations against 16384 float elements
55 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
void scalar(float a[], float b[],
float c[], float d[],
int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
d[i] = a[i] * b[i];
}
}
void simd(float a[], float b[], float c[],
float d[], int n) {
for (int i = 0; i < n; i += SPECIES.length()) {
FloatVector va = FloatVector
.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector
.fromArray(SPECIES, b, i);
FloatVector vc = va.add(vb);
FloatVector vd = va.mul(vb);
vc.intoArray(c, i);
vd.intoArray(d, i);
}
}
Scalar version SIMD version
2.8x faster
Now, To Put Data is Bottleneck
▪ To read and compute data in a vector style. To put data is
in a sequential style
56 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Read data in a vector style
Compute data in a vector style
Put data at a row
Lessons Learned
▪ To vectorize computation is effective
▪ To use SIMD is also effective, but not huge improvement
▪ There is room to improve performance at an interface
between the generated code and its successor unit
57 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Takeaway
▪ How we can use SIMD instructions in Java
▪ Use SIMD at three areas
– Good result for matrix library (SPARK-33882 has been merged)
▪ Better than Java implementation
▪ Better for small data than native implementation
– Room to improve the performance of sort program
▪ VectorAPI implementation in Java virtual machine
▪ Other parts to be improved in Apache Spark
– Good result for catalyst
▪ To vectorize computation is effective
▪ Interface between computation units is important for performance
• c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019
58 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
Visit https://www.slideshare.net/ishizaki if you are interested in this slide

Enabling Vectorized Engine in Apache Spark

  • 1.
    Enabling Vectorized Engine inApache Spark Kazuaki Ishizaki IBM Research - Tokyo
  • 2.
    About Me –Kazuaki Ishizaki ▪ Researcher at IBM Research – Tokyo https://ibm.biz/ishizaki – Compiler optimization, language runtime, and parallel processing ▪ Apache Spark committer from 2018/9 (SQL module) ▪ Work for IBM Java (Open J9, now) from 1996 – Technical lead for Just-in-time compiler for PowerPC ▪ ACM Distinguished Member ▪ SNS – @kiszk – https://www.slideshare.net/ishizaki/ 2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 3.
    Table of Contents ▪What are vectorization and SIMD? – How can SIMD improve performance? ▪ What is VectorAPI? – Why can’t the current Spark use SIMD? ▪ How to use SIMD with performance analysis 1. Replace external libraries 2. Use vectorized runtime routines such as sort 3. Generate vectorized Java code from a given SQL query by Catalyst 3 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 4.
    What is Vectorization? ▪Do multiple jobs in a batch to improve performance – Read multiple rows at a time – Compute multiple rows at a time 4 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Scalar Vectorization Read one row at a time Read four rows at a time table table
  • 5.
    What is Vectorization? ▪Do multiple jobs in a batch to improve performance – Read multiple rows at a time – Compute multiple rows at a time ▪ Spark already implemented multiple vectorizations – Vectorized Parquet Reader – Vectorized ORC Reader – Pandas UDF (a.k.a. vectorized UDF) 5 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 6.
    ▪ Apply thesame operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double What is SIMD? 6 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 7.
    ▪ Apply thesame operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double – Increase the parallelism in an instruction (8x in the example) What is SIMD? 7 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Vector register SIMD instruction A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 add add add add input 1 input 2 output add gr1,gr2,gr3 vadd vr1,vr2,vr3 Scalar instruction SIMD instruction A4 A5 A6 A7 B4 B5 B6 B7 C4 C5 C6 C7 add add add add A0 B0 C0 add input 1 input 2 output
  • 8.
    ▪ Apply thesame operation to primitive-type multiple data in an instruction (Single Instruction Multiple Data: SIMD) – Boolean, Short, Integer, Long, Float, and Double – Increase the parallelism in an instruction ▪ SIMD can be used to implement vectorization What is SIMD? 8 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 9.
    SIMD is Usedin Various BigData Software ▪ Database – DB2, Oracle, PostgreSQL, … ▪ SQL Query Engine – Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, … 9 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 10.
    Why Current SparkDoes Not Use SIMD? ▪ Java Virtual Machine (JVM) cannot ensure whether a given Java program will use SIMD 10 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code
  • 11.
    Why Current SparkDo Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 11 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not JVM
  • 12.
    Why Current SparkDo Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 12 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not for (int i = 0; i < n; i++) { load r1, a[i * 4] load r2, b[i * 4] add r3, r1, r2 store r3, c[i * 4] } Slower scalar code JVM
  • 13.
    Why Current SparkDo Not Use SIMD? ▪ Java Virtual Machine (JVM) can not ensure whether a given Java program will use SIMD – We rely on HotSpot compiler in JVM to generate SIMD instructions or not 13 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Java code SIMD may be generated or not for (int i = 0; i < n; i++) { load r1, a[i * 4] load r2, b[i * 4] add r3, r1, r2 store r3, c[i * 4] } for (int i = 0; i < n / 8; i++) { vload vr1, a[i * 4 * 8] vload vr2, a[i * 4 * 8] vadd vr3, vr1, vr2 vstore vr3, c[i * 4 * 8] } Faster SIMD code Slower scalar code JVM
  • 14.
    New Approach: VectorAPI ▪VectorAPI can ensure the generated code will use SIMD 14 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki import jdk.incubator.vector.*; int a[], b[], c[]; ... for (int i = 0; i < n; i += SPECIES.length()) { var va = IntVector.fromArray(SPECIES, a, i); var vb = IntVector.fromArray(SPECIES, b, i); var vc = va.add(vb); vc.intoArray(c, i); } VectorAPI SIMD can be always generated for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } Scalar code SIMD may be generated or not SIMD length (e.g. 8)
  • 15.
    New Approach: VectorAPI ▪VectorAPI can ensure the generated code will use SIMD 15 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki import jdk.incubator.vector.*; int a[], b[], c[]; ... for (int i = 0; i < n; i += SPECIES.length()) { var va = IntVector.fromArray(SPECIES, a, i); var vb = IntVector.fromArray(SPECIES, b, i); var vc = va.add(vb); vc.intoArray(c, i); } VectorAPI for (int i = 0; i < n / 8; i++) { vload vr1, a[i * 4 * 8] vload vr2, a[i * 4 * 8] vadd vr3, vr1, vr2 vstore vr3, c[i * 4 * 8] } Pseudo native SIMD code
  • 16.
    Where We CanUse SIMD in Spark 16 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 17.
    Where We CanUse SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 17 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 18.
    Where We CanUse SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Sort, Join, … 18 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 19.
    Where We CanUse SIMD in Spark ▪ External library – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Sort, Join, … ▪ Generated code at runtime – Java program translated from DataFrame program by Catalyst 19 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 20.
    Where and HowWe Can Use SIMD in Spark ▪ External library – Write VectorAPI code by hand – BLAS library (matrix operation) ▪ SPARK-33882 ▪ Internal library – Write VectorAPI code by hand – Sort, Join, … ▪ Generated code at runtime – Generate VectorAPI code by Catalyst – Catalyst translates DataFrame program info Java program 20 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 21.
    External Library More texton one line in this location if needed
  • 22.
    Three Approaches ▪ JNI(Java Native Interface) library – Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library ▪ SIMD code – Call Java VectorAPI code if JVM supports VectorAPI ▪ Scalar code – Call naïve Java code that runs on all JVMs 22 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 23.
    Implementation using VectorAPI ▪An example of matrix operation kernels 23 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki // y += alpha * x public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) { ... DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha); int i = 0; // vectorized part for (; i < DMAX.loopBound(n); i += DMAX.length()) { DoubleVector vx = DoubleVector.fromArray(DMAX, x, i); DoubleVector vy = DoubleVector.fromArray(DMAX, y, i); vx.fma(valpha, vy).intoArray(y, i); } // residual part for (; i < n; i += 1) { y[i] += alpha * x[i]; } ... } SPARK-33882
  • 24.
    Benchmark for Large-sizeData ▪ JNI achieves the best performance 24 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Algorithm Data size (double type) elapsed time (ms) JNI VectorAPI Scalar daxpy (Y += a * X ) 10,000,000 1.3 14.6 18.2 dgemm Z = X * Y 1000x1000 * 1000x100 1.3 40.6 81.1
  • 25.
    Benchmark for Small-sizeData ▪ VectorAPI achieves the best performance 25 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Algorithm Data size (double type) elapsed time (ns) JNI VectorAPI Scalar daxpy (Y += a * X ) 256 118 27 140 dgemm Z = X * Y 8x8 * 8x8 555 365 679 OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 26.
    Summary of ThreeApproaches 26 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Performance Overhead Portability Choice JNI library Best High (Data copy between Java heap and native memory) Readyness of Native library Good for large data SIMD code Moderate No Java 16 or later Good for small data and better than scalar code Scalar code Slow No Any Java version Backup path
  • 27.
    Internal Library More texton one line in this location if needed
  • 28.
    Lots of Researchfor SIMD Sort and Join 28 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 29.
    What Sort AlgorithmWe Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort 29 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 30.
    What Sort AlgorithmWe Can Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort ▪ SIMD sort algorithms in existing research – AA-Sort ▪ Comb sort ▪ Merge sort – Merge sort – Quick sort – … 30 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 31.
    What Sort AlgorithmWe Can Use ▪ Current Spark uses without SIMD – Radix sort – Tim sort ▪ SIMD sort algorithms in existing research – AA-Sort ▪ Comb sort ▪ Merge sort – Merge sort – Quick sort – … 31 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Fast for data in CPU data cache
  • 32.
    Comb Sort is2.5x Faster than Tim Sort 32 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Radix sort (Scalar) Comb sort (SIMD) Sort 1,048,576 long pairs {key, value} 84ms 117ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Tim sort (Scalar) 292ms Shorter is better
  • 33.
    Radix Sort is1.4x Faster than Comb Sort ▪ Radix sort order is smaller than that of Comb sort – O(N) v.s. O(N log N) ▪ VectorAPI cannot exploit platform-specific SIMD instructions 33 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Radix sort (Scalar) Comb sort (SIMD) Sort 1,048,576 long pairs {key, value} 84ms 117ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Tim sort (Scalar) 292ms Shorter is better
  • 34.
    Sort a Pairof Key and Value ▪ Compare two 64-bit keys and get the pair with a smaller key – This is a frequently executed operation 34 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} in0 out in1
  • 35.
    Sort a Pairof Key and Value ▪ Sort the first pair 35 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 < 5 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} in0 out in1
  • 36.
    Sort a Pairof Key and Value ▪ Sort the second pair 36 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} 7 > 3 in0 out in1
  • 37.
    Parallel Sort aPair using SIMD ▪ In parallel, compare two 64-bit keys and get the pair with a smaller key at once 37 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki {key, value} 1 -1 7 -7 5 -5 3 -3 1 -1 3 -3 {key, value} 7 > 3 in0 out An example of 256-bit width instruction 1 < 5 in1
  • 38.
    No shuffle inC Version ▪ The result of compare can be logically shifted without shuffle. 38 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki __mmask8 mask = 0b10101010; void shufflePair(__m256 *x) { __mmask8 maska, maskb, maskar, maskbr, maskzero; maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask); maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask); maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1)); maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1)); x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]); x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]); x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]); x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]); } 0 shuffle + 6 shift/or + 2 compare instructions 1 7 x[0-3] maska maskA It is an important optimization to reduce the number of shuffle instruction on x86_64 “reduce port 5 pressure” 3 -1 -7 -5 -3 5 x[4-7] compare
  • 39.
    4 Shuffles inVectorAPI Version ▪ Since the result of the comparison (VectorMask) cannot be shifted, all four values should be shuffled before the comparison 39 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki final VectorShuffle pair = VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2); private void swapPair(long x[], int i) { LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt; xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7]; ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15]; xpa = xa.rearrange(pair); xpb = xb.rearrange(pair); ypa = ya.rearrange(pair); ypb = yb.rearrange(pair); VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa); VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb); xs = xa.blend(ya, maskA); xt = xb.blend(yb, maskB); ys = ya.blend(xa, maskA); yt = yb.blend(xb, maskB); xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]); xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]); } 4 shuffle + 2 compare instructions maskA 1 7 1 7 rearrange 5 3 5 3 5 3 rearrange 1 7 compare xa xb
  • 40.
    Where is Bottleneckin Spark Sort Program? ▪ Spend most of the time out of the sort routine in the program 40 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Sort algorithm Elapsed time (ms) Radix sort 563 Tim sort 757 val N = 1048576 val p = spark.sparkContext.parallelize(1 to N, 1) val df = p.map(_ => -1 * rand.nextLong).toDF("a") df.cache df.count // start measuring time df.sort("a").noop() OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 41.
    Where is Bottleneckin Spark Sort Program? ▪ Spend most of the time out of the sort routine in the program 41 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Sort algorithm Elapsed time (ms) Estimated time with SIMD (ms) Radix sort 563 563 Tim sort 757 587 val N = 1048576 val p = spark.sparkContext.parallelize(1 to N, 1) val df = p.map(_ => -1 * rand.nextLong).toDF("a") df.cache df.count // start measuring time df.sort("a").noop() Radix sort took 84ms in the previous benchmark OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • 42.
    Sort Requires AdditionalOperation ▪ df.sort() always involve in a costly exchange operation – Data transfer among nodes 42 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki == Physical Plan == Sort [a#5L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54] +- InMemoryTableScan [a#5L] +- ...
  • 43.
    Lessons Learned ▪ SIMDComb sort is faster than the current Tim sort ▪ Radix sort is smart – Order is O(N), where N is the number of elements ▪ Sort operation involves other costly operations ▪ There is room to exploit platform-specific SIMD instructions in VectorAPI 43 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 44.
    Generated Code More texton one line in this location if needed
  • 45.
    How DataFrame Programis Translated? 45 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() class … { … } DataFrame source program Generated Java code
  • 46.
    Catalyst Translates intoJava Code 46 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() Create Logical Plans Optimize Logical Plans Create Physical Plans class … { … } DataFrame source program Select Physical Plans Generate Java code Catalyst Generated Java code
  • 47.
    Current Generated Code ▪Read data in a vector style, but computation is executed in a sequential style at a row 47 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); } } void processNext() { if (columnarBatch == null) { BatchRead(); } float valA = cola.getFloat(batchIdx); float valB = colb.getFloat(batchIdx); float val0 = valA + valB; float val1 = valA * valB; appendRow(Row(val0, val1)); batchIdx++; } } Simplified generated code
  • 48.
    Computation is Inefficientin Current Code ▪ To read data is efficient in a vector style 48 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); } } void processNext() { if (columnarBatch == null) { BatchRead(); } float valA = cola.getFloat(batchIdx); float valB = cola.getFloat(batchIdx); float val0 = valA * valB; float val1 = valA + valB; appendRow(Row(val0, val1)); batchIdx++; } } Read data in a vector style Compute data at a row Put data at a row
  • 49.
    Prototyped Generated Code ▪To read and compute data in a vector style. To put data is still in a sequential style 49 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); // compute date using Vector API for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); v0.intoArray(cola, i); v1.intoArray(colb, i); } } } void processNext() { if (columnarBatch == null) { BatchRead(); } appendRow(Row(cola[batchIdx], colb[batchIdx])); batchIdx++; } } Read data in a vector style Compute data in a vector style Put data at a row
  • 50.
    Enhanced Code Generationin Catalyst 50 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count df.selectExpr("a+b", "a*b“).noop() Create Logical Plans Optimize Logical Plans Create Physical Plans class … { … } DataFrame source program Select Physical Plans Generate Java code Catalyst Generated Java code with vectorized computation
  • 51.
    Prototyped Two CodeGenerations ▪ Perform computations using scalar variables ▪ Perform computations using VectorAPI 51 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 52.
    Using Scalar Variables ▪Perform computation for multiple rows in a batch 52 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE], col2[] = new float[COLUMN_BATCH_SIZE]; void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); for (int i = 0; i < columnarBatch.size(); i += 1) { float valA = cola.getFloat(i); float valB = colb.getFloat(i); col0[i] = valA + valB; col1[i] = valA * valB; } } } void processNext() { if (batchIdx == columnarBatch.size()) { BatchRead(); } appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx])); batchIdx++; } } Simplified generated code
  • 53.
    Using VectorAPI ▪ Performcomputation for multiple rows using SIMD in a batch 53 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki class GeneratedCodeGenStage { float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE], col2[] = new float[COLUMN_BATCH_SIZE]; void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); FloatVector vb = FloatVector.fromArray(SPECIES, vb, i); FloatVector v0 = va.mul(vb); FloatVector v1 = va.add(vb); v0.intoArray(col0, i); v1.intoArray(col1, i); } } } void processNext() { if (batchIdx == columnarBatch.size()) { BatchRead(); } appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx])); batchIdx++; } } Simplified generated code
  • 54.
    Up to 1.7xFaster at Micro Benchmark ▪ Vectorized version achieve up to 1.7x performance improvement ▪ SIMD version achieves about 1.2x improvement over Vectorized Scalar version 54 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Current version Vectorized (Scalar) Vectorized (SIMD) 34.2ms 26.6ms 20.0ms OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz val N = 16384 val p = sparkContext.parallelize(1 to N, 1) val df = p.map(i => (i.toFloat, 2*i.toFloat)) .toDF("a", "b") df.cache df.count // start measuring time df.selectExpr("a+b", "a*b“).noop() Shorter is better
  • 55.
    2.8x Faster atNano Benchmark ▪ Perform the same computation as in the previous benchmark – Add and multiple operations against 16384 float elements 55 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki void scalar(float a[], float b[], float c[], float d[], int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; d[i] = a[i] * b[i]; } } void simd(float a[], float b[], float c[], float d[], int n) { for (int i = 0; i < n; i += SPECIES.length()) { FloatVector va = FloatVector .fromArray(SPECIES, a, i); FloatVector vb = FloatVector .fromArray(SPECIES, b, i); FloatVector vc = va.add(vb); FloatVector vd = va.mul(vb); vc.intoArray(c, i); vd.intoArray(d, i); } } Scalar version SIMD version 2.8x faster
  • 56.
    Now, To PutData is Bottleneck ▪ To read and compute data in a vector style. To put data is in a sequential style 56 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Class GeneratedCodeGenStage { void BatchRead() { if (iterator.hasNext()) { columnarBatch = iterator.next(); batchIdx = 0; ColumnVector colA = columnarBatch.column(0); ColumnVector colB = columnarBatch.column(1); float va[] = colA.getFloats(), vb[] = colB.getFloats(); // compute date using Vector API for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) { FloatVector va = FloatVector.fromArray(SPECIES, va, i); v0.intoArray(cola, i); v1.intoArray(colb, i); } } } void processNext() { if (columnarBatch == null) { BatchRead(); } appendRow(Row(cola[batchIdx], colb[batchIdx])); batchIdx++; } } Read data in a vector style Compute data in a vector style Put data at a row
  • 57.
    Lessons Learned ▪ Tovectorize computation is effective ▪ To use SIMD is also effective, but not huge improvement ▪ There is room to improve performance at an interface between the generated code and its successor unit 57 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki
  • 58.
    Takeaway ▪ How wecan use SIMD instructions in Java ▪ Use SIMD at three areas – Good result for matrix library (SPARK-33882 has been merged) ▪ Better than Java implementation ▪ Better for small data than native implementation – Room to improve the performance of sort program ▪ VectorAPI implementation in Java virtual machine ▪ Other parts to be improved in Apache Spark – Good result for catalyst ▪ To vectorize computation is effective ▪ Interface between computation units is important for performance • c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019 58 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki Visit https://www.slideshare.net/ishizaki if you are interested in this slide