Enabling Vectorized Engine in Apache Spark

Enabling Vectorized Engine
in Apache Spark
Kazuaki Ishizaki
IBM Research - Tokyo

About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Apache Spark committer from 2018/9 (SQL module)
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– https://www.slideshare.net/ishizaki/
2 Enabling Vectorized Engine in Apache Spark - Kazuaki Ishizaki

Table of Contents
▪ What are vectorization and SIMD?
– How can SIMD improve performance?
▪ What is VectorAPI?
– Why can’t the current Spark use SIMD?
▪ How to use SIMD with performance analysis
1. Replace external libraries
2. Use vectorized runtime routines such as sort
3. Generate vectorized Java code from a given SQL query by Catalyst

What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
Scalar Vectorization
Read one row at a time Read four rows at a time
table table

What is Vectorization?
▪ Do multiple jobs in a batch to improve performance
– Read multiple rows at a time
– Compute multiple rows at a time
▪ Spark already implemented multiple vectorizations
– Vectorized Parquet Reader
– Vectorized ORC Reader
– Pandas UDF (a.k.a. vectorized UDF)

▪ Apply the same operation to primitive-type multiple data in an
instruction (Single Instruction Multiple Data: SIMD)
– Boolean, Short, Integer, Long, Float, and Double
What is SIMD?

– Increase the parallelism in an instruction (8x in the example)
What is SIMD?
Vector register
SIMD instruction
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
add add add add
input 1
input 2
output
add gr1,gr2,gr3 vadd vr1,vr2,vr3
Scalar instruction SIMD instruction
A4 A5 A6 A7
B4 B5 B6 B7
C4 C5 C6 C7
add add add add
A0
B0
C0
add
input 1
input 2
output

– Increase the parallelism in an instruction
▪ SIMD can be used to implement vectorization
What is SIMD?

SIMD is Used in Various BigData Software
▪ Database
– DB2, Oracle, PostgreSQL, …
▪ SQL Query Engine
– Delta Engine in Databricks Runtime, Apache Impala, Apache Drill, …

Why Current Spark Does Not Use SIMD?
▪ Java Virtual Machine (JVM) cannot ensure whether a given Java
program will use SIMD
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code

Why Current Spark Do Not Use SIMD?
▪ Java Virtual Machine (JVM) can not ensure whether a given Java
– We rely on HotSpot compiler in JVM to generate SIMD instructions or not
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
SIMD may be generated or not
JVM

for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
Slower scalar code
JVM

for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Java code
for (int i = 0; i < n; i++) {
load r1, a[i * 4]
load r2, b[i * 4]
add r3, r1, r2
store r3, c[i * 4]
}
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Faster SIMD code
Slower scalar code
JVM

New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
import jdk.incubator.vector.*;
int a[], b[], c[];
...
for (int i = 0; i < n; i += SPECIES.length()) {
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
SIMD can be always generated
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Scalar code
SIMD length (e.g. 8)

New Approach: VectorAPI
▪ VectorAPI can ensure the generated code will use SIMD
import jdk.incubator.vector.*;
int a[], b[], c[];
...
var va = IntVector.fromArray(SPECIES, a, i);
var vb = IntVector.fromArray(SPECIES, b, i);
var vc = va.add(vb);
vc.intoArray(c, i);
}
VectorAPI
for (int i = 0; i < n / 8; i++) {
vload vr1, a[i * 4 * 8]
vload vr2, a[i * 4 * 8]
vadd vr3, vr1, vr2
vstore vr3, c[i * 4 * 8]
}
Pseudo native SIMD code

Where We Can Use SIMD in Spark

▪ External library
– BLAS library (matrix operation)
▪ SPARK-33882

▪ SPARK-33882
▪ Internal library
– Sort, Join, …

▪ SPARK-33882
▪ Internal library
– Sort, Join, …
▪ Generated code at runtime
– Java program translated from DataFrame program by Catalyst

Where and How We Can Use SIMD in Spark
▪ External library – Write VectorAPI code by hand
▪ SPARK-33882
▪ Internal library – Write VectorAPI code by hand
– Sort, Join, …
▪ Generated code at runtime – Generate VectorAPI code by Catalyst
– Catalyst translates DataFrame program info Java program

External Library
More text on one line in this location if needed

Three Approaches
▪ JNI (Java Native Interface) library
– Call highly-optimized binary (e.g. written in C or Fortran) thru JNI library
▪ SIMD code
– Call Java VectorAPI code if JVM supports VectorAPI
▪ Scalar code
– Call naïve Java code that runs on all JVMs

Implementation using VectorAPI
▪ An example of matrix operation kernels
// y += alpha * x
public void daxpy(int n, double alpha, double[] x, int incx, double[] y, int incy) {
...
DoubleVector valpha = DoubleVector.broadcast(DMAX, alpha);
int i = 0;
// vectorized part
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
DoubleVector vy = DoubleVector.fromArray(DMAX, y, i);
vx.fma(valpha, vy).intoArray(y, i);
}
// residual part
for (; i < n; i += 1) {
y[i] += alpha * x[i];
}
...
}
SPARK-33882

Benchmark for Large-size Data
▪ JNI achieves the best performance
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Algorithm
Data size
(double type)
elapsed time (ms)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 10,000,000 1.3 14.6 18.2
dgemm
Z = X * Y
1000x1000
* 1000x100
1.3 40.6 81.1

Benchmark for Small-size Data
▪ VectorAPI achieves the best performance
Algorithm
Data size
(double type)
elapsed time (ns)
JNI VectorAPI Scalar
daxpy
(Y += a * X ) 256 118 27 140
dgemm
Z = X * Y
8x8 * 8x8 555 365 679
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 4.15.0-115-generic

Summary of Three Approaches
Performance Overhead Portability Choice
JNI library Best
High
(Data copy
between Java
heap and native
memory)
Readyness of
Native library
Good for large
data
SIMD code Moderate No Java 16 or later
Good for small
data
and better than
scalar code
Scalar code Slow No
Any Java
version
Backup path

Internal Library

Lots of Research for SIMD Sort and Join

What Sort Algorithm We Use
▪ Current Spark uses without SIMD
– Radix sort
– Tim sort

What Sort Algorithm We Can Use
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …

What Sort Algorithm We Can Use
– Radix sort
– Tim sort
▪ SIMD sort algorithms in existing research
– AA-Sort
▪ Comb sort
▪ Merge sort
– Merge sort
– Quick sort
– …
Fast for data in CPU data cache

Comb Sort is 2.5x Faster than Tim Sort
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
OpenJDK 64-Bit Server VM 16.0.1+9-24 on Linux 3.10.0-1160.15.2.el7.x86_64
Tim sort
(Scalar) 292ms
Shorter is better

Radix Sort is 1.4x Faster than Comb Sort
▪ Radix sort order is smaller than that of Comb sort
– O(N) v.s. O(N log N)
▪ VectorAPI cannot exploit platform-specific SIMD instructions
Radix sort
(Scalar)
Comb sort
(SIMD)
Sort 1,048,576 long pairs {key, value}
84ms
117ms
Tim sort
(Scalar) 292ms
Shorter is better

Sort a Pair of Key and Value
▪ Compare two 64-bit keys and get the pair with a smaller key
– This is a frequently executed operation
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1

▪ Sort the first pair
{key,
value}
1 < 5
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
in0
out
in1

▪ Sort the second pair
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
in1

Parallel Sort a Pair using SIMD
▪ In parallel, compare two 64-bit keys and get the pair with a smaller
key at once
{key,
value}
1
-1
7
-7
5
-5
3
-3
1
-1
3
-3
{key,
value}
7 > 3
in0
out
An example of 256-bit width instruction
1 < 5
in1

No shuffle in C Version
▪ The result of compare can be logically shifted without shuffle.
__mmask8 mask = 0b10101010;
void shufflePair(__m256 *x) {
__mmask8 maska, maskb, maskar, maskbr, maskzero;
maska = _kand_mask8(_mm256_cmpgt_epi64_mask(x[0], x[8]), mask);
maskb = _kand_mask8(_mm256_cmpgt_epi64_mask(x[4], x[12], mask);
maskA = _kor_mask8(maska, _kshiftli_mask8(maska, 1));
maskB = _kor_mask8(maskb, _kshiftli_mask8(maskb, 1));
x[0] = _mm256_mask_blend_epi64(maskA, x[8], x[0]);
x[4] = _mm256_mask_blend_epi64(maskA, x[12], x[4]);
x[8] = _mm256_mask_blend_epi64(maskB, x[0], x[8]);
x[12] = _mm256_mask_blend_epi64(maskB, x[4], x[12]);
}
0 shuffle + 6 shift/or + 2 compare instructions
1
7
x[0-3]
maska
maskA
It is an important optimization to reduce the number of shuffle instruction on x86_64
“reduce port 5 pressure”
3
-1
-7
-5
-3 5
x[4-7]
compare

4 Shuffles in VectorAPI Version
▪ Since the result of the comparison (VectorMask) cannot be shifted,
all four values should be shuffled before the comparison
final VectorShuffle pair =
VectorShuffle.fromValues(SPECIES_256, 0, 0, 2, 2);
private void swapPair(long x[], int i) {
LongVector xa, xb, ya, yb, xpa, xpb, ypa, ypb, xs, xt, ys, yt;
xa = load x[i+0 … i+3]; xb = load x[i+4 … i+7];
ya = load x[i+8 … i+11]; yb = load x[i+12 … i+15];
xpa = xa.rearrange(pair);
xpb = xb.rearrange(pair);
ypa = ya.rearrange(pair);
ypb = yb.rearrange(pair);
VectorMask<Long> maskA = xpa.compare(VectorOperators.GT, ypa);
VectorMask<Long> maskA = xpb.compare(VectorOperators.GT, ypb);
xs = xa.blend(ya, maskA);
xt = xb.blend(yb, maskB);
ys = ya.blend(xa, maskA);
yt = yb.blend(xb, maskB);
xs.store(x[i+0 … i+3]); xt.store(x[i+4 … i+7]);
xs.store(x[i+8 … i+11]); yt.store(x[i+11 … i+15]);
}
4 shuffle + 2 compare instructions
maskA
1
7 1
7
rearrange
5
3 5
3
5
3
rearrange
1
7
compare
xa
xb

Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
Sort
algorithm
Elapsed
time (ms)
Radix sort 563
Tim sort 757
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
// start measuring time
df.sort("a").noop()

Where is Bottleneck in Spark Sort Program?
▪ Spend most of the time out of the sort routine in the program
Sort
algorithm
Elapsed
time (ms)
Estimated time
with SIMD (ms)
Radix sort 563 563
Tim sort 757 587
val N = 1048576
val p = spark.sparkContext.parallelize(1 to N, 1)
val df = p.map(_ => -1 * rand.nextLong).toDF("a")
df.cache
df.count
df.sort("a").noop()
Radix sort took 84ms
in the previous benchmark

Sort Requires Additional Operation
▪ df.sort() always involve in a costly exchange operation
– Data transfer among nodes
== Physical Plan ==
Sort [a#5L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#5L ASC NULLS FIRST, 200), ..., [id=#54]
+- InMemoryTableScan [a#5L]
+- ...

Lessons Learned
▪ SIMD Comb sort is faster than the current Tim sort
▪ Radix sort is smart
– Order is O(N), where N is the number of elements
▪ Sort operation involves other costly operations
▪ There is room to exploit platform-specific SIMD instructions in
VectorAPI

Generated Code

How DataFrame Program is Translated?
val N = 16384
val p = sparkContext.parallelize(1 to N, 1)
val df = p.map(i => (i.toFloat, 2*i.toFloat))
.toDF("a", "b")
df.cache
df.count
df.selectExpr("a+b", "a*b“).noop()
class … {
…
}
DataFrame source program
Generated Java code

Catalyst Translates into Java Code
val N = 16384
.toDF("a", "b")
df.cache
df.count
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code

Current Generated Code
▪ Read data in a vector style, but computation is executed
in a sequential style at a row
class GeneratedCodeGenStage {
void BatchRead() {
if (iterator.hasNext()) {
columnarBatch = iterator.next();
batchIdx = 0;
ColumnVector colA = columnarBatch.column(0);
ColumnVector colB = columnarBatch.column(1);
}
}
void processNext() {
if (columnarBatch == null) { BatchRead(); }
float valA = cola.getFloat(batchIdx);
float valB = colb.getFloat(batchIdx);
float val0 = valA + valB;
float val1 = valA * valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Simplified generated code

Computation is Inefficient in Current Code
▪ To read data is efficient in a vector style
void BatchRead() {
batchIdx = 0;
}
}
float valA = cola.getFloat(batchIdx);
float valB = cola.getFloat(batchIdx);
float val0 = valA * valB;
float val1 = valA + valB;
appendRow(Row(val0, val1));
batchIdx++;
}
}
Read data in a vector style
Compute data at a row
Put data at a row

Prototyped Generated Code
▪ To read and compute data in a vector style. To put data is still in a
sequential style
void BatchRead() {
batchIdx = 0;
float va[] = colA.getFloats(), vb[] = colB.getFloats();
// compute date using Vector API
for (int i = 0; i < columnarBatch.size(); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, va, i);
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Compute data in a vector style
Put data at a row

Enhanced Code Generation in Catalyst
val N = 16384
.toDF("a", "b")
df.cache
df.count
Create
Logical Plans
Optimize
Logical Plans
Create
Physical Plans
class … {
…
}
Select
Physical Plans
Generate
Java code
Catalyst
Generated Java code
with vectorized computation

Prototyped Two Code Generations
▪ Perform computations using scalar variables
▪ Perform computations using VectorAPI

Using Scalar Variables
▪ Perform computation for multiple rows in a batch
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
batchIdx = 0;
for (int i = 0; i < columnarBatch.size(); i += 1) {
float valA = cola.getFloat(i);
float valB = colb.getFloat(i);
col0[i] = valA + valB;
col1[i] = valA * valB;
}
}
}
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}

Using VectorAPI
▪ Perform computation for multiple rows using SIMD in a batch
float col0[] = new float[COLUMN_BATCH_SIZE], col1[] = new float[COLUMN_BATCH_SIZE],
col2[] = new float[COLUMN_BATCH_SIZE];
void BatchRead() {
batchIdx = 0;
FloatVector vb = FloatVector.fromArray(SPECIES, vb, i);
FloatVector v0 = va.mul(vb);
FloatVector v1 = va.add(vb);
v0.intoArray(col0, i); v1.intoArray(col1, i);
}
}
}
if (batchIdx == columnarBatch.size()) { BatchRead(); }
appendRow(Row(col0[batchIdx], col1[batchIdx], col2[batchIdx]));
batchIdx++;
}
}

Up to 1.7x Faster at Micro Benchmark
▪ Vectorized version achieve up to 1.7x performance improvement
▪ SIMD version achieves about 1.2x improvement over Vectorized
Scalar version
Current
version
Vectorized
(Scalar)
Vectorized
(SIMD)
34.2ms
26.6ms
20.0ms
val N = 16384
.toDF("a", "b")
df.cache
df.count
Shorter is better

2.8x Faster at Nano Benchmark
▪ Perform the same computation as in the previous benchmark
– Add and multiple operations against 16384 float elements
void scalar(float a[], float b[],
float c[], float d[],
int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
d[i] = a[i] * b[i];
}
}
void simd(float a[], float b[], float c[],
float d[], int n) {
FloatVector va = FloatVector
.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector
.fromArray(SPECIES, b, i);
FloatVector vc = va.add(vb);
FloatVector vd = va.mul(vb);
vc.intoArray(c, i);
vd.intoArray(d, i);
}
}
Scalar version SIMD version
2.8x faster

Now, To Put Data is Bottleneck
▪ To read and compute data in a vector style. To put data is
in a sequential style
Class GeneratedCodeGenStage {
void BatchRead() {
batchIdx = 0;
// compute date using Vector API
v0.intoArray(cola, i);
v1.intoArray(colb, i);
}
}
}
appendRow(Row(cola[batchIdx], colb[batchIdx]));
batchIdx++;
}
}
Compute data in a vector style
Put data at a row

Lessons Learned
▪ To vectorize computation is effective
▪ To use SIMD is also effective, but not huge improvement
▪ There is room to improve performance at an interface
between the generated code and its successor unit

Takeaway
▪ How we can use SIMD instructions in Java
▪ Use SIMD at three areas
– Good result for matrix library (SPARK-33882 has been merged)
▪ Better than Java implementation
▪ Better for small data than native implementation
– Room to improve the performance of sort program
▪ VectorAPI implementation in Java virtual machine
▪ Other parts to be improved in Apache Spark
– Good result for catalyst
▪ To vectorize computation is effective
▪ Interface between computation units is important for performance
• c.f. “Vectorized Query Execution in Apache Spark at Facebook”, 2019
Visit https://www.slideshare.net/ishizaki if you are interested in this slide

Enabling Vectorized Engine in Apache Spark

More Related Content

What's hot(20)

Similar to Enabling Vectorized Engine in Apache Spark(20)

More from Kazuaki Ishizaki(20)

Recently uploaded(20)

Enabling Vectorized Engine in Apache Spark