Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

Exploiting GPUs for
Columnar DataFrames
Kiran Lonikar

About Myself: Kiran Lonikar
● Presently working as Staff Engineer with Informatica, Bangalore
○ Keep track of technology trends
○ Work on futuristic products/features
● Passionate about new technologies, gadgets and healthy food
● Education:
○Indian Institute of Technology, Bombay (1992)
○Indian Institute of Science Bangalore (1994)

About Informatica
• Put Potential of Data to work. Informatica
helps you make data ready for use in any
way possible, so you can put truly great data
at the center of everything you do.
• The #1 Independent Leader in Data
Integration
• Focus on Big Data, Master Data
Management, Cloud Integration and
Data Security
• Founded: 1993
• Revenue 2014: $1.048 billion
• Employees: ~3700
• Partners: 500+
– Major SI, ISV, OEM and On-
Demand Leaders
Annual Total Revenue ($ millions)
2005-2014
Total Revenue CAGR = 16%
* A reconciliation of GAAP and non-GAAP results is provided in the Appendix section, as well as on Informatica’s Investor Relations website

Agenda
● Introducing GPUs
● Existing Applications in Big Data
● CPU the new bottleneck
● Project Tungsten
● Proposal: Extending Tungsten
○GPU for parallel execution across rows
○Code generation changes (minor refactoring)
○Batched Execution, Columnar layout (major refactoring to DataFrame)
● Results, Demo
● Future work, competing products

GPUs are Omnipresent
Jetson TK1
192 cores
GPU, 5”X5”,
20WGPU Servers
upto 5760 cores
g2: $0.65/hour:
1536 cores
Nexus 9: 192 cores

Hardware Architecture: Latency Vs Throughput
T
i
m
e
t1 t2 ... t32
ins 1
ins 2
ins 3
ins 4
Thread block
Warp 1
t1 t2 ... t32
ins 1
ins 2
ins 3
ins 4
Warp 2
...
SIMT: Single Instruction Multiple Thread

GPU Programming Model
CPU RAM
CPU Processing
GPU RAM
GPU
Processing
PCIe Bus
GPU
Processing
GPU
Processing
GPU
Processing
GPU
Processing
GPU
Processing
Shared CPU+GPU RAM
CPU ProcessingHeterogeneous
System
Architecture
based SoC
GPU
Processing
GPU
Processing
GPU
Processing
GPU
Processing
● CUDA C/C++ (NVidia GPUs)
● OpenCL C/C++ (All GPUs)
● JavaCL/ScalaCL, Aparapi,
Rootbeer
● JDK 1.9 Lambdas
GPU 1
GPU 2GPU RAM

GPUs in the world of Big Data
LHC CERN’s ROOT: 30 PB per day,
GPU based ML packages
Analytic DBs
12 GPUs: 60,000 cores
on a node
gpudb, sqream, mapd
Deep Learning
Image Classification
Speech Recognition NLP
Genomics, DNA
SparkCL:
● Aparapi based APIs to
develop spark closures.
● Aparapi converts Java
code to OpenCL and
run on GPUs.

Natural progression into computing dimension
++
2008
onwards
2012
onwards
2015
onwards

Spark SQL Architecture
null bit set (1 bit/col) values (8 bytes/col) variable length data (length, data)
row 1
row 2
row 3
Tungsten Row: Instead of Array of Java Objects
nulltypeId
row 1
row 2
row 3
null
null
null
nulltypeId
row 1
row 2
row 3
null
null
null
nulltypeId
row 1
row 2
row 3
null
null
null
nulltypeId
row 1
row 2
row 3
null
null
null
column 1 column 2 column 3 column 4
Columnar Cache

10Gbps ethernet, Infiniband
SSD, Striped HDD arrays
• Higher IO throughput: From Project Tungsten blog and Renold Xin’s talk slide 21
– Hardware advances in last 5 years: 10x improvements
– Software advances:
• Spark Optimizer: Prune input data to avoid unnecessary disk IO
• Improved file formats: Binary, compressed, columnar (Parquet, ORC)
• Less memory pressure:
– Hardware: High memory bandwidths
– Software: Taking over memory allocation
⇒ More data available to process. CPU the new bottleneck.
CPU the new bottleneck

Project Tungsten
• Taking Over memory management and bypass GC
– Avoid large Java object overhead and GC overhead
– Replace Java Objects allocation with sun.misc.Unsafe based explicit
allocation and freeing
– Replace general purpose data structures like java.util.HashMap with explicit
binary map
• Cache Aware computation
– Change internal data structures to make them cache friendly
• Co-locate key and value reference in one record for sorting
• Code Generation
– Expressions of columns for selecting and filtering executed through generated
Java code ⇒ Avoids expensive expression tree evaluation for each row

Proposal: Execution on GPUs
Goal: Change execution within a partition from serial row by row to
batched/vectorized parallel execution
• Change code generation to generate OpenCL code
• Change executor code (Project, TungstenProject in
basicOperators.scala) to execute OpenCL code through JavaCL
• **Columnar layout of input data for GPU execution
– BatchRow/CacheBatch: References to required columnar arrays instead of
creating and processing InternalRow objects
– UnsafeColumn/ByteBuffer: Columnar structure to be used for GPU execution
a0 b0 c0 a1 b1 c1
a0 a1 a2
row wise
columnar
b0 b1 b2
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
c0 c1 c2
a2 b2 c2

Code Generation Changes
// Existing Gnerated Java code
class SpecificUnsafeProjection extends
UnsafeProjection {
private UnsafeRow row = new UnsafeRow();
// buffer for 2 cols, null bits
private byte[] buffer11 = new byte[24];
private int cursor12 = 24; // size of buffer for 2 cols
// initialization code, constructor etc.
public UnsafeRow apply(InternalRow i) {
double primitive3 = -1.0;
int fixedOffset = Platform.BYTE_ARRAY_OFFSET;
row.pointTo(buffer11, fixedOffset, 2, cursor12);
if(nullChecks == false) {
primitive3 = 2*i.getInt(0) + 4*i.getDouble(1);
row.setDouble(1, primitive3);
}
else
row.setNull(1);
return row;
}
}
// New OpenCL sample code: Columnar
__kernel void computeExpression(
const int* a, const char *aNulls,
const int* b, const char *bNulls,
int* output, char *outNulls,
int dataSize)
{
int i = get_global_id(0);
if(i < dataSize) {
if(nullChecks == false)
output[i] = 2*a[i] + 4*b[i];
else
outNulls[i] = 1;
}
}
// Scala code to drive the OpenCL code
1. rowIterator ⇒ ByteBuffers with a, b, aNulls,
bNulls : 20x time of 2,3,4
2. Transfer ByteBuffers
3. Execute computeExpression
4. read output, outNulls into ByteBuffers ⇒ Cache

Row wise execution
row wise CPU RAM
Input Data
row wise GPU RAM
// Only A and C needed to compute D
// B not needed
typedef struct {float a, float b,
float c} row;
__kernel void expr(row *r, float *d,
int n) {
int id = get_global_id(0);
if(id < n)
d[id] = 3*r[id].a + 2*r[id].c;
}
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
a0 b0 c0 a1 b1 c1 a2 b2 c2
a0 b0 c0 a1 b1 c1 a2 b2 c2
t1
r0
row wise SMP Cache
a0 b0 c0 a1 b1 c1 a2 b2 c2
t2
r1
t3
r2
Streaming
Multiprocessor

Columnar execution
Columnar CPU RAM
Input Data
Columnar GPU RAM
// Only A and C needed to compute
D
// Only A and C transferred
__kernel void expr(float *a, float
*c, float *d, int n) {
int id = get_global_id(0);
if(id < n)
d[id] = 3*a[id] + 2*c[id];
}
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
a0 b0 c0a1 b1 c1a2 b2 c2
a0 c0a1 c1a2 c2
t1
a0
Columnar SMP Cache
a0 c0a1 c1a2 c2
t2
a1
t3
a2
c0 c1 c2
Streaming
Multiprocessor

JVM Considerations
• Row wise representation: Array of Java objects
– Java objects not same as C structs: Members not contiguous
– Serialization needed before transfer to GPU RAM
• Columnar representation: Arrays of individual members
– Already serialized
– Save Host-GPU and GPU RAM-SMP Cache data transfer
– Avoid copying from input rows to projected InternalRow objects

DataFrame Execution: Current
val data = sc.parallelize(1 to size, 5).map {
x => (x, x*x)
}.toDF("key", "value")
val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cache
data1.show() // show first 20 rows: Trigger execution
1, 1*1 1, 1, 1*2+1*4
Columnar Cache: buildBuffers
1
2
3
1
4
9
6
20
42
2, 2*2 2, 4, 2*2+4*4
3, 3*3 3, 8, 2*3+4*9
Columnar Cache ⇒ Rows
1 1 6
2 4 20
3 9 42
1
2
3

DataFrame Execution: Proposed
val data = sc.parallelize(1 to size, 5).map {
x => (x, x*x)
}.toDF("key", "value")
val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cache
data1.show() // show first 20 rows: Trigger execution
Columnar Cache: buildBuffers
1
2
3
1
4
9
6
20
42
Columnar Cache ⇒ Rows
1 1 6
2 4 20
3 9 42
1
2
3
1
2
3
1
4
9
1
2
3
1
4
9
6
20
42
GPU

Proposal: Batched Execution
In Memory RDDs ⇒ DFs
Byte code Modification through
Javassist to build
BatchRow+UnsafeColumn
Columnar Cache
Input: Parquet, ORC, Relational
DBs
Pipelined operations: Filter, Join,
Union, Sort, Group by, ...
In Memory RDDs ⇒ DFs
Byte code Modification through
Javassist to consume
BatchRow+UnsafeColumn
Output: Parquet, ORC, Relational
DBs
Pipelined operations: Filter, Join,
Union, Sort, Group by, ...

Roadmap for future changes
• Spark
– Multi-GPU
– Sorting: GPU based TimSort
– Aggregations (groupBy)
– Union
– Join
• Other projects capable of competing with Spark
– Impala (C++, easier to adapt than Scala/JVM for GPU)
– CERN Root (C++ REPL, multi-node)
– Flink
– Thrust (CUDA C++, single node, single GPU)
– Boost Compute (OpenCL, C++, single node, single GPU)
– VexCL (C++, OpenCL, CUDA, multi-GPU, multi node)

Q&A
Contact Info
○ Twitter @KiranLonikar
○ https://www.linkedin.com/in/kiranlonikar
○ lonikar@gmail.com

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

More Related Content

What's hot (20)

Similar to Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar (20)

More from Spark Summit (20)

Recently uploaded (20)

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

Editor's Notes