20170602_OSSummit_an_intelligent_storage

An Intelligent Storage
~SQL Execution on GPU Closely Connected with SSD~
PG-Strom Development Team
KaiGai Kohei <kaigai@kaigai.gr.jp>

The PG-Strom Project
Who am I?
▌KaiGai Kohei
 PG-Strom Development Team
▌Background
 Linux kernel (SELinux, …)
 PostgreSQL database
 Enhancement of security features
 Writable FDW, Remote Join
 Custom Scan Interface
 GPU and CUDA
▌PG-Strom
 SQL Acceleration Technology using GPU
 Solution for data analytics in-database
2 Open Source Summit Japan 2017 - An Intelligent Storage

PG-Strom
PG-Strom (1/3) – GPU as accelerator of SQL
3
GPU CPU
Model
NVIDIA
Tesla P40
Intel Xeon
E5-2699v4
Code name Pascal Broadwell
Release Q3-2016 Q1-2016
# of transistors 12billion 7.2billion
# of cores
3840
(simple)
22
(functional)
Core clocks
1.306GHz
~1.53GHz
2.20GHz
~3.60GHz
Performance
(FP32)
12 TFLOPS
1.2 TFLOPS
(with AVX2)
RAM Size 24GB (GDDR5) max1.5TB (DDR4)
Memory Band 347GB/s 76.8GB/s
Power
consumption
250W 145W
Manufacturing
process
16nm 14nm
Application
GPU
Off-load
WHERE, JOIN,
GROUP BY
Auto generation
of GPU binary
from SQL
PG-Strom: Extension module of PostgreSQL, to accelerate SQL workloads
using multi-thousands cores and high bandwidth memory.
Same
SQL
Statement
Same
Data
Structure
Open Source Summit Japan 2017 - An Intelligent Storage

PG-Strom (2/3) – Transparent GPU code generation on demand
Open Source Summit Japan 2017 - An Intelligent Storage4
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Generate CUDA C code from
calculation formula in WHERE-clause
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution

PG-Strom (3/3) – An example of SQL acceleration
5
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
8.48
13.23
18.28
23.42
28.88
34.50
40.77
47.16
5.00 5.46 5.91 6.45 7.17 8.07
9.22 10.21
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of tables joined
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.6 PG-Strom 2.0devel
CPU: Xeon E5-2640v4
GPU: Tesla P40
RAM: 128GB
OS: CentOS 7.3
DB: PostgreSQL 9.6.2 +
PG-Strom 2.0devel

In-database Analytics
▌Computing close to data
 Cost to move data
 Cost to transform / validate data
 Waste of time for DBAs
▌Utilization of SQL’s flexibility
 Pre-/Post-process of analytic
algorithms
 Conjunction with other data (JOIN)
 Summarization of results
(GROUP BY/Window Functions)

What we want to develop (1/2)
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large  Small
Amount of computing: Small  Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU

What we want to develop (2/2) – elemental technology
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large  Small
Amount of computing: Small  Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
SSD-to-GPU
Direct SQL
Execution
(for I/O)
PG-Strom
(for SQL)
PL/CUDA
and
Library
(for computing)

I/O Acceleration with GPU?

Re-definition of GPU’s role
I/O workloads Computing workloads

Brief Architecture of GPU
GPU Device
Memory
(GDDR5 or HBM2)
Host RAM
Data Transfer with DMA
over the PCIe Bus
CPU
PCIe Bus
Also data transfer via DMA
from: particular blocks
to: destination physical
address
Storage
GPU Device

Technology Basis (1/2) – GPUDirect RDMA
▌Feature for peer-to-peer DMA between GPU and PCIe devices
▌Originally, designed for MPI over Infiniband
▌But available for any kind of PCIe devices, if custom kernel module
Copyright (c) NVIDIA corporation, 2015

Technology Basis (2/2) – GPUDirect RDMA
Physical
Address Space
PCIe BAR1 AreaGPU
RAM
Host
RAM
NVMe-SSD Infiniband
HBA
PCIe devices
GPUDirect RDMA
A feature to map GPU’s device
memory on physical address space
of the host system Once GPU device memory gets physical
address of the host system, driver can
use the address as source/destination
of DMA operations.
13 Open Source Summit Japan 2017 - An Intelligent Storage

Idea of SSD-to-GPU Direct SQL Execution
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
 Smaller amount of I/O
Existing Data Flow
(Large I/O Size)
SSD-to-GPU Direct SQL Execution – Load PostgreSQL’s data block to GPU directly,
then reduce unnecessary rows with multi-thousands cores of GPUs.

As destination of
SSD to GPU DMA commands!!
Linux kernel APIs of GPUDirect RDMA
int nvidia_p2p_get_pages(uint64_t p2p_token, /* deprecated */
uint32_t va_space, /* deprecated */
uint64_t virtual_address,
uint64_t length,
struct nvidia_p2p_page_table **page_table,
void (*free_callback)(void *private_data),
void *private_data);
struct nvidia_p2p_page_table {
uint32_t version;
uint32_t page_size;
struct nvidia_p2p_page **pages;
uint32_t entries;
uint8_t *gpu_uuid;
} nvidia_p2p_page_table_t;
Array of I/O mapped
physical memory address
to the GPU device memory
CUresult cuMemAlloc(CUdeviceptr *daddr,
size_t bytesize);

NVMe-Strom Software Stack
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space

Benchmark (1/2) – Star Schema Benchmark
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Q1-1 Q1-2 Q1-3 Q2-1 Q2-2 Q2-3 Q3-1 Q3-2 Q3-3 Q3-4 Q4-1 Q4-2 Q4-3
QueryExecutionThroughput[MB/s]
PostgreSQL SSDx1 PostgreSQL SSDx2 PG-Strom SSDx1 PG-Strom SSDx2
 SQL with WHERE-clause, JOIN, Aggregation/GROUP BY towards 353GB database.
 Usually, entire workloads are dominated by I/O.
 PostgreSQL achieves up to 1.6GB/s with regular filesystem based access.
 PG-Strom with SSD-to-GPU direct SQL execution achieved up to 3.8GB/s at the peak performance.
※ (Query Execution Throughput) = (Database Size; 353GB) / (Entire Query Response Time [sec])
Theoretical limitation
by 2x SSD [4.4GB/s]
Under
Investigation
Theoretical limitation
by 1x SSD [4.4GB/s]

Benchmark (2/2) – Star Schema Benchmark
 Example of Query
SELECT sum(lo_revenue), d_year, p_brand1
FROM lineorder, date1, part, supplier
WHERE lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12‘
AND s_region = 'AMERICA‘
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;
customer
12M Rows
(1.6GB)
date1
2500 Rows
(400KB)
part
1.8M Rows
(206MB)
supplier
4M Rows
(528MB)
lineorder
2.4B Rows
(351GB)
Typical queries for summary; which
consists of WHERE-clause, JOIN and
GROUP BY.
Size of lineorder table is larger than
physical RAM size, majority of the
workload is dominated by I/O.

Hardware configuration on the benchmark
19
▌Hardware/Software configuration
 Model: DELL PowerEdge R730
 CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2
 RAM: 128GB
 HDD: SAS 300GB x8 (RAID5)
 OS: CentOS7.2 (3.10.0-327.el7)
 SW: CUDA 7.5.17
NVIDIA driver 352.99
PostgreSQL 9.6.1
PG-Strom v2.0devel
▌GPU Specifications
 NVIDIA Tesla K80
 2x 2496 CUDA cores (560MHz)
 2x 12GB GDDR5 RAM (240GB/s)
▌SSD Specifications
 Intel SSD 750 (400GB) x2
 Interface: PCIe 3.0 x4 (NVMe 1.2)
 Max SeqRead: 2.2GB/s
CPU2CPU1
SSD
x2
Tesla
K80
GPU
Tesla
K20
GPU
(unused)

(OT) Consistency with disk cache
20
 Problem ① Storage block may not be up-to-date, if target blocks are kept on page caches.
② Short size DMA is not efficient.
 Solution – Write back cached PostgreSQL blocks (= 8KB/block) to userspace buffer once,
then kicks bulk transfer from RAM to GPU by CUDA APIs.
 Almost equivalent cost with read(2) + cuMemcpyHtoDAsync()
 Although PostgreSQL blocks are reordered on GPU memory, it generates correct results.
BLK-100: uncached
BLK-101: cached
BLK-102: uncached
BLK-103: uncached
BLK-104: cached
BLK-105: cached
BLK-106: uncached
BLK-107: uncached
BLK-108: cached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
unused SSD-to-GPU
P2P DMA
File Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync

Not only Scan, but GpuJoin also...
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next Step Next Step
Just reference
to the results
on CPU side
Hash Table
Search by CPU
Generation of
results by CPU
Generation of
results by GPU
Parallel
Hash Table
Search by GPU
HashJoin by CPU GpuHashJoin
Here we have
SSD-to-GPU
P2P DMA

GpuPreAgg
Not only Scan, but GpuPreAgg also...
Aggregation by CPU Aggregation with GPU
Relation Relation
Aggregation
Aggregation
1st Step:
Parallel Hash
Reduction
2nd Step:
Atomic Merge
Read of
1M rows
Process
1M rows
by CPU
Smaller
# of rows
GpuPreAgg:
Reduction of rows
to be processed by CPU
Interim
results
Here we have SSD-to-GPU P2P DMA.

The Future

Single Node Solution
2x Enterprise Grade SSD
Based on PCIe x8
NVIDIA
Tesla P40
CPU
RAM
DMI
SSD-1
SSD-2
GPU
max 10GB/s capability of query processing / node
PCI-E
Hardware Block Diagram

Multi Node Solution (?)
NVIDIA
Tesla P40
Joint solution with PostgreSQL’s scale-out technologies
Based on PCIe x8

Scale Up Solution (?)
5x NVIDIA
Tesla P40
DMI
GPU-1
PCI-SW
PCI-SW
QPIx2
GPU-2
GPU-4
GPU-5
SSD-1
SSD-2
SSD-3
SSD-4
SSD-5
GPU-3
RAM
RAM
CPU 1
CPU 2
PCIe x16
x2 slot
PCIe
96lane
PCIe
96lane
Deployment on GPU monster
Based on PCIe x16

Towards columnar structure (1/3) – On-memory Columnar Cache
27
▌On-memory Columnar Cache
 Caches PostgreSQL’s data-blocks (row-data) on memory as column-data
 Reduction of data transfer between RAM<->GPU
 Optimization of memory bandwidth of GPU
 Switch GPU kernel to be invoked according to existence of column-cache
BlockNum = 101
all_visible=false
BlockNum = 102
all_visible=true
BlockNum = 103
all_visible=true
BlockNum = 104
all_visible=true
Columnar Cache
(BlockNum=102)
Columnar Cache
(BlockNum=102)
background
worker
Construction
UPDATE
invalidation
PG-Strom
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column

Towards columnar structure (2/3) – Background
28
▌Row data format – Random memory access
 Less usage ratio of memory bus, larger number of memory transactions
▌Column data format – Coalesced memory access
 Optimal memory bus usage, least number of memory transactions
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x8 = 256bit is valid
in 256bit memory transaction.
(usage ratio: 100.0%)
Only 32bit x1 = 32bit is valid
in 256bit memory transaction .
(usage ratio: 12.5%)
GPU Cores
GPU Cores
Why columnar structure is preferable for GPU architecture

World of OLTP World of OLAP
Towards columnar structure (3/3) – FPGA engine on SSD
FPGA Logic on SSD enables write by row, then read as column
RowColumn
Transformer
column-format
Read of
analytic data
(column-format)
Write of
transactional
data
(row-format)
Data-format
transformation by
FPGA on SSD
SQL Execution
– based on columnar
format; best for GPU
Pre-processed
data
Only required
columns are
fetched

Towards the “true” intelligent storage system
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
 Smaller amount of I/O
Step-1: A pair of NVMe-SSD and GPU looks like an intelligent storage which understand
SQL workloads for filtering-out unnecessary rows and pre-process of JOIN/GROUP BY
prior to data loading onto the main memory.
Step-2: A special NVMe-SSD
transforms data format for more
suitable one to process, based on
knowledge of the requester.
(PostgreSQL’s block format in this
case)

20170602_OSSummit_an_intelligent_storage

More Related Content

What's hot (20)

Similar to 20170602_OSSummit_an_intelligent_storage (20)

More from Kohei KaiGai (20)

Recently uploaded (20)

20170602_OSSummit_an_intelligent_storage