An Intelligent Storage
~SQL Execution on GPU Closely Connected with SSD~
PG-Strom Development Team
KaiGai Kohei <kaigai@kaigai.gr.jp>
The PG-Strom Project
Who am I?
▌KaiGai Kohei
 PG-Strom Development Team
▌Background
 Linux kernel (SELinux, …)
 PostgreSQL database
 Enhancement of security features
 Writable FDW, Remote Join
 Custom Scan Interface
 GPU and CUDA
▌PG-Strom
 SQL Acceleration Technology using GPU
 Solution for data analytics in-database
2 Open Source Summit Japan 2017 - An Intelligent Storage
The PG-Strom Project
PG-Strom
PG-Strom (1/3) – GPU as accelerator of SQL
3
GPU CPU
Model
NVIDIA
Tesla P40
Intel Xeon
E5-2699v4
Code name Pascal Broadwell
Release Q3-2016 Q1-2016
# of transistors 12billion 7.2billion
# of cores
3840
(simple)
22
(functional)
Core clocks
1.306GHz
~1.53GHz
2.20GHz
~3.60GHz
Performance
(FP32)
12 TFLOPS
1.2 TFLOPS
(with AVX2)
RAM Size 24GB (GDDR5) max1.5TB (DDR4)
Memory Band 347GB/s 76.8GB/s
Power
consumption
250W 145W
Manufacturing
process
16nm 14nm
Application
GPU
Off-load
WHERE, JOIN,
GROUP BY
Auto generation
of GPU binary
from SQL
PG-Strom: Extension module of PostgreSQL, to accelerate SQL workloads
using multi-thousands cores and high bandwidth memory.
Same
SQL
Statement
Same
Data
Structure
Open Source Summit Japan 2017 - An Intelligent Storage
The PG-Strom Project
PG-Strom (2/3) – Transparent GPU code generation on demand
Open Source Summit Japan 2017 - An Intelligent Storage4
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Generate CUDA C code from
calculation formula in WHERE-clause
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
The PG-Strom Project
PG-Strom (3/3) – An example of SQL acceleration
5
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
Open Source Summit Japan 2017 - An Intelligent Storage
8.48
13.23
18.28
23.42
28.88
34.50
40.77
47.16
5.00 5.46 5.91 6.45 7.17 8.07
9.22 10.21
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of tables joined
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.6 PG-Strom 2.0devel
CPU: Xeon E5-2640v4
GPU: Tesla P40
RAM: 128GB
OS: CentOS 7.3
DB: PostgreSQL 9.6.2 +
PG-Strom 2.0devel
The PG-Strom Project
In-database Analytics
▌Computing close to data
 Cost to move data
 Cost to transform / validate data
 Waste of time for DBAs
▌Utilization of SQL’s flexibility
 Pre-/Post-process of analytic
algorithms
 Conjunction with other data (JOIN)
 Summarization of results
(GROUP BY/Window Functions)
Open Source Summit Japan 2017 - An Intelligent Storage6
The PG-Strom Project
What we want to develop (1/2)
Open Source Summit Japan 2017 - An Intelligent Storage7
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large  Small
Amount of computing: Small  Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
The PG-Strom Project
What we want to develop (2/2) – elemental technology
Open Source Summit Japan 2017 - An Intelligent Storage8
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large  Small
Amount of computing: Small  Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
SSD-to-GPU
Direct SQL
Execution
(for I/O)
PG-Strom
(for SQL)
PL/CUDA
and
Library
(for computing)
The PG-Strom Project
I/O Acceleration with GPU?
Open Source Summit Japan 2017 - An Intelligent Storage9
The PG-Strom Project
Re-definition of GPU’s role
I/O workloads Computing workloads
Open Source Summit Japan 2017 - An Intelligent Storage10
The PG-Strom Project
Brief Architecture of GPU
GPU Device
Memory
(GDDR5 or HBM2)
Host RAM
Data Transfer with DMA
over the PCIe Bus
CPU
PCIe Bus
Also data transfer via DMA
from: particular blocks
to: destination physical
address
Storage
GPU Device
Open Source Summit Japan 2017 - An Intelligent Storage11
The PG-Strom Project
Technology Basis (1/2) – GPUDirect RDMA
▌Feature for peer-to-peer DMA between GPU and PCIe devices
▌Originally, designed for MPI over Infiniband
▌But available for any kind of PCIe devices, if custom kernel module
Copyright (c) NVIDIA corporation, 2015
Open Source Summit Japan 2017 - An Intelligent Storage12
The PG-Strom Project
Technology Basis (2/2) – GPUDirect RDMA
Physical
Address Space
PCIe BAR1 AreaGPU
RAM
Host
RAM
NVMe-SSD Infiniband
HBA
PCIe devices
GPUDirect RDMA
A feature to map GPU’s device
memory on physical address space
of the host system Once GPU device memory gets physical
address of the host system, driver can
use the address as source/destination
of DMA operations.
13 Open Source Summit Japan 2017 - An Intelligent Storage
The PG-Strom Project
Idea of SSD-to-GPU Direct SQL Execution
Open Source Summit Japan 2017 - An Intelligent Storage14
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
 Smaller amount of I/O
Existing Data Flow
(Large I/O Size)
SSD-to-GPU Direct SQL Execution – Load PostgreSQL’s data block to GPU directly,
then reduce unnecessary rows with multi-thousands cores of GPUs.
The PG-Strom Project
As destination of
SSD to GPU DMA commands!!
Linux kernel APIs of GPUDirect RDMA
int nvidia_p2p_get_pages(uint64_t p2p_token, /* deprecated */
uint32_t va_space, /* deprecated */
uint64_t virtual_address,
uint64_t length,
struct nvidia_p2p_page_table **page_table,
void (*free_callback)(void *private_data),
void *private_data);
struct nvidia_p2p_page_table {
uint32_t version;
uint32_t page_size;
struct nvidia_p2p_page **pages;
uint32_t entries;
uint8_t *gpu_uuid;
} nvidia_p2p_page_table_t;
Array of I/O mapped
physical memory address
to the GPU device memory
CUresult cuMemAlloc(CUdeviceptr *daddr,
size_t bytesize);
Open Source Summit Japan 2017 - An Intelligent Storage15
The PG-Strom Project
NVMe-Strom Software Stack
Open Source Summit Japan 2017 - An Intelligent Storage16
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
The PG-Strom Project
Benchmark (1/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage17
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Q1-1 Q1-2 Q1-3 Q2-1 Q2-2 Q2-3 Q3-1 Q3-2 Q3-3 Q3-4 Q4-1 Q4-2 Q4-3
QueryExecutionThroughput[MB/s]
PostgreSQL SSDx1 PostgreSQL SSDx2 PG-Strom SSDx1 PG-Strom SSDx2
 SQL with WHERE-clause, JOIN, Aggregation/GROUP BY towards 353GB database.
 Usually, entire workloads are dominated by I/O.
 PostgreSQL achieves up to 1.6GB/s with regular filesystem based access.
 PG-Strom with SSD-to-GPU direct SQL execution achieved up to 3.8GB/s at the peak performance.
※ (Query Execution Throughput) = (Database Size; 353GB) / (Entire Query Response Time [sec])
Theoretical limitation
by 2x SSD [4.4GB/s]
Under
Investigation
Theoretical limitation
by 1x SSD [4.4GB/s]
The PG-Strom Project
Benchmark (2/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage18
 Example of Query
SELECT sum(lo_revenue), d_year, p_brand1
FROM lineorder, date1, part, supplier
WHERE lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12‘
AND s_region = 'AMERICA‘
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;
customer
12M Rows
(1.6GB)
date1
2500 Rows
(400KB)
part
1.8M Rows
(206MB)
supplier
4M Rows
(528MB)
lineorder
2.4B Rows
(351GB)
Typical queries for summary; which
consists of WHERE-clause, JOIN and
GROUP BY.
Size of lineorder table is larger than
physical RAM size, majority of the
workload is dominated by I/O.
The PG-Strom Project
Hardware configuration on the benchmark
19
▌Hardware/Software configuration
 Model: DELL PowerEdge R730
 CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2
 RAM: 128GB
 HDD: SAS 300GB x8 (RAID5)
 OS: CentOS7.2 (3.10.0-327.el7)
 SW: CUDA 7.5.17
NVIDIA driver 352.99
PostgreSQL 9.6.1
PG-Strom v2.0devel
▌GPU Specifications
 NVIDIA Tesla K80
 2x 2496 CUDA cores (560MHz)
 2x 12GB GDDR5 RAM (240GB/s)
▌SSD Specifications
 Intel SSD 750 (400GB) x2
 Interface: PCIe 3.0 x4 (NVMe 1.2)
 Max SeqRead: 2.2GB/s
Open Source Summit Japan 2017 - An Intelligent Storage
CPU2CPU1
SSD
x2
Tesla
K80
GPU
Tesla
K20
GPU
(unused)
The PG-Strom Project
(OT) Consistency with disk cache
20
 Problem ① Storage block may not be up-to-date, if target blocks are kept on page caches.
② Short size DMA is not efficient.
 Solution – Write back cached PostgreSQL blocks (= 8KB/block) to userspace buffer once,
then kicks bulk transfer from RAM to GPU by CUDA APIs.
 Almost equivalent cost with read(2) + cuMemcpyHtoDAsync()
 Although PostgreSQL blocks are reordered on GPU memory, it generates correct results.
Open Source Summit Japan 2017 - An Intelligent Storage
BLK-100: uncached
BLK-101: cached
BLK-102: uncached
BLK-103: uncached
BLK-104: cached
BLK-105: cached
BLK-106: uncached
BLK-107: uncached
BLK-108: cached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
unused SSD-to-GPU
P2P DMA
File Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
The PG-Strom Project
Not only Scan, but GpuJoin also...
Open Source Summit Japan 2017 - An Intelligent Storage21
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next Step Next Step
Just reference
to the results
on CPU side
Hash Table
Search by CPU
Generation of
results by CPU
Generation of
results by GPU
Parallel
Hash Table
Search by GPU
HashJoin by CPU GpuHashJoin
Here we have
SSD-to-GPU
P2P DMA
The PG-Strom Project
GpuPreAgg
Not only Scan, but GpuPreAgg also...
Open Source Summit Japan 2017 - An Intelligent Storage22
Aggregation by CPU Aggregation with GPU
Relation Relation
Aggregation
Aggregation
1st Step:
Parallel Hash
Reduction
2nd Step:
Atomic Merge
Read of
1M rows
Process
1M rows
by CPU
Smaller
# of rows
GpuPreAgg:
Reduction of rows
to be processed by CPU
Interim
results
Here we have SSD-to-GPU P2P DMA.
The PG-Strom Project
The Future
Open Source Summit Japan 2017 - An Intelligent Storage23
The PG-Strom Project
Single Node Solution
2x Enterprise Grade SSD
Based on PCIe x8
NVIDIA
Tesla P40
CPU
RAM
DMI
SSD-1
SSD-2
GPU
max 10GB/s capability of query processing / node
PCI-E
Hardware Block Diagram
Open Source Summit Japan 2017 - An Intelligent Storage24
The PG-Strom Project
Multi Node Solution (?)
NVIDIA
Tesla P40
Joint solution with PostgreSQL’s scale-out technologies
2x Enterprise Grade SSD
Based on PCIe x8
Open Source Summit Japan 2017 - An Intelligent Storage25
The PG-Strom Project
Scale Up Solution (?)
5x NVIDIA
Tesla P40
DMI
GPU-1
PCI-SW
PCI-SW
QPIx2
GPU-2
GPU-4
GPU-5
SSD-1
SSD-2
SSD-3
SSD-4
SSD-5
GPU-3
RAM
RAM
CPU 1
CPU 2
PCIe x16
x2 slot
PCIe
96lane
PCIe
96lane
Deployment on GPU monster
5x Enterprise Grade SSD
Based on PCIe x16
Open Source Summit Japan 2017 - An Intelligent Storage26
The PG-Strom Project
Towards columnar structure (1/3) – On-memory Columnar Cache
27
▌On-memory Columnar Cache
 Caches PostgreSQL’s data-blocks (row-data) on memory as column-data
 Reduction of data transfer between RAM<->GPU
 Optimization of memory bandwidth of GPU
 Switch GPU kernel to be invoked according to existence of column-cache
Open Source Summit Japan 2017 - An Intelligent Storage
BlockNum = 101
all_visible=false
BlockNum = 102
all_visible=true
BlockNum = 103
all_visible=true
BlockNum = 104
all_visible=true
Columnar Cache
(BlockNum=102)
Columnar Cache
(BlockNum=102)
background
worker
Construction
UPDATE
invalidation
PG-Strom
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
The PG-Strom Project
Towards columnar structure (2/3) – Background
28
▌Row data format – Random memory access
 Less usage ratio of memory bus, larger number of memory transactions
Open Source Summit Japan 2017 - An Intelligent Storage
▌Column data format – Coalesced memory access
 Optimal memory bus usage, least number of memory transactions
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x8 = 256bit is valid
in 256bit memory transaction.
(usage ratio: 100.0%)
Only 32bit x1 = 32bit is valid
in 256bit memory transaction .
(usage ratio: 12.5%)
GPU Cores
GPU Cores
Why columnar structure is preferable for GPU architecture
The PG-Strom Project
World of OLTP World of OLAP
Towards columnar structure (3/3) – FPGA engine on SSD
Open Source Summit Japan 2017 - An Intelligent Storage29
FPGA Logic on SSD enables write by row, then read as column
RowColumn
Transformer
column-format
Read of
analytic data
(column-format)
Write of
transactional
data
(row-format)
Data-format
transformation by
FPGA on SSD
SQL Execution
– based on columnar
format; best for GPU
Pre-processed
data
Only required
columns are
fetched
The PG-Strom Project
Towards the “true” intelligent storage system
Open Source Summit Japan 2017 - An Intelligent Storage30
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
 Smaller amount of I/O
Step-1: A pair of NVMe-SSD and GPU looks like an intelligent storage which understand
SQL workloads for filtering-out unnecessary rows and pre-process of JOIN/GROUP BY
prior to data loading onto the main memory.
Step-2: A special NVMe-SSD
transforms data format for more
suitable one to process, based on
knowledge of the requester.
(PostgreSQL’s block format in this
case)
Any Questions?

More Related Content

PDF
pgconfasia2016 plcuda en
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PDF
PG-Strom
PDF
20160407_GTC2016_PgSQL_In_Place
PDF
GPGPU Accelerates PostgreSQL (English)
PDF
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
PDF
20150318-SFPUG-Meetup-PGStrom
pgconfasia2016 plcuda en
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PG-Strom
20160407_GTC2016_PgSQL_In_Place
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
20150318-SFPUG-Meetup-PGStrom

What's hot (20)

PDF
SQL+GPU+SSD=∞ (English)
PDF
PG-Strom - GPU Accelerated Asyncr
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
20171206 PGconf.ASIA LT gstore_fdw
PDF
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PDF
20201006_PGconf_Online_Large_Data_Processing
PDF
20181212 - PGconfASIA - LT - English
PDF
20201128_OSC_Fukuoka_Online_GPUPostGIS
PDF
PostgreSQL with OpenCL
PDF
Let's turn your PostgreSQL into columnar store with cstore_fdw
PDF
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
PDF
PG-Strom - A FDW module utilizing GPU device
PDF
20181025_pgconfeu_lt_gstorefdw
PDF
20181210 - PGconf.ASIA Unconference
PPTX
GPGPU programming with CUDA
PDF
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PPTX
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
PDF
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PDF
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
SQL+GPU+SSD=∞ (English)
PG-Strom - GPU Accelerated Asyncr
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
20171206 PGconf.ASIA LT gstore_fdw
PG-Strom v2.0 Technical Brief (17-Apr-2018)
20201006_PGconf_Online_Large_Data_Processing
20181212 - PGconfASIA - LT - English
20201128_OSC_Fukuoka_Online_GPUPostGIS
PostgreSQL with OpenCL
Let's turn your PostgreSQL into columnar store with cstore_fdw
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
PG-Strom - A FDW module utilizing GPU device
20181025_pgconfeu_lt_gstorefdw
20181210 - PGconf.ASIA Unconference
GPGPU programming with CUDA
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
Ad

Similar to 20170602_OSSummit_an_intelligent_storage (20)

PDF
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PDF
20190909_PGconf.ASIA_KaiGai
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
20180920_DBTS_PGStrom_EN
PDF
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
PDF
20181016_pgconfeu_ssd2gpu_multi
PDF
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
KEY
Grabbing the PostgreSQL Elephant by the Trunk
PDF
Sqream DB on OpenPOWER performance
PDF
Pg no sql_beatemjoinem_v10
ODP
Bogdan Kecman Advanced Databasing
PDF
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
PDF
Dumb Simple PostgreSQL Performance (NYCPUG)
PDF
An evening with Postgresql
PDF
Don't panic! - Postgres introduction
PDF
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
PDF
🐬 The future of MySQL is Postgres 🐘
PDF
Lessons PostgreSQL learned from commercial databases, and didn’t
PDF
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
20190909_PGconf.ASIA_KaiGai
20181116 Massive Log Processing using I/O optimized PostgreSQL
20180920_DBTS_PGStrom_EN
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
20181016_pgconfeu_ssd2gpu_multi
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Grabbing the PostgreSQL Elephant by the Trunk
Sqream DB on OpenPOWER performance
Pg no sql_beatemjoinem_v10
Bogdan Kecman Advanced Databasing
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
Dumb Simple PostgreSQL Performance (NYCPUG)
An evening with Postgresql
Don't panic! - Postgres introduction
[EPPG] Oracle to PostgreSQL, Challenges to Opportunity
🐬 The future of MySQL is Postgres 🐘
Lessons PostgreSQL learned from commercial databases, and didn’t
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Ad

More from Kohei KaiGai (20)

PDF
20221116_DBTS_PGStrom_History
PDF
20221111_JPUG_CustomScan_API
PDF
20211112_jpugcon_gpu_and_arrow
PDF
20210928_pgunconf_hll_count
PDF
20210731_OSC_Kyoto_PGStrom3.0
PDF
20210511_PGStrom_GpuCache
PDF
20201113_PGconf_Japan_GPU_PostGIS
PDF
20200828_OSCKyoto_Online
PDF
20200806_PGStrom_PostGIS_GstoreFdw
PDF
20200424_Writable_Arrow_Fdw
PDF
20191211_Apache_Arrow_Meetup_Tokyo
PDF
20191115-PGconf.Japan
PDF
20190926_Try_RHEL8_NVMEoF_Beta
PDF
20190925_DBTS_PGStrom
PDF
20190516_DLC10_PGStrom
PDF
20190418_PGStrom_on_ArrowFdw
PDF
20190314 PGStrom Arrow_Fdw
PDF
20181212 - PGconf.ASIA - LT
PDF
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
PDF
20180920_DBTS_PGStrom_JP
20221116_DBTS_PGStrom_History
20221111_JPUG_CustomScan_API
20211112_jpugcon_gpu_and_arrow
20210928_pgunconf_hll_count
20210731_OSC_Kyoto_PGStrom3.0
20210511_PGStrom_GpuCache
20201113_PGconf_Japan_GPU_PostGIS
20200828_OSCKyoto_Online
20200806_PGStrom_PostGIS_GstoreFdw
20200424_Writable_Arrow_Fdw
20191211_Apache_Arrow_Meetup_Tokyo
20191115-PGconf.Japan
20190926_Try_RHEL8_NVMEoF_Beta
20190925_DBTS_PGStrom
20190516_DLC10_PGStrom
20190418_PGStrom_on_ArrowFdw
20190314 PGStrom Arrow_Fdw
20181212 - PGconf.ASIA - LT
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20180920_DBTS_PGStrom_JP

Recently uploaded (20)

PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
Internet of Everything -Basic concepts details
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
SaaS reusability assessment using machine learning techniques
Connector Corner: Transform Unstructured Documents with Agentic Automation
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
4 layer Arch & Reference Arch of IoT.pdf
Advancing precision in air quality forecasting through machine learning integ...
Internet of Everything -Basic concepts details
Basics of Cloud Computing - Cloud Ecosystem
NewMind AI Weekly Chronicles – August ’25 Week IV
MuleSoft-Compete-Deck for midddleware integrations
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Microsoft User Copilot Training Slide Deck
Enhancing plagiarism detection using data pre-processing and machine learning...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
SGT Report The Beast Plan and Cyberphysical Systems of Control
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
future_of_ai_comprehensive_20250822032121.pptx
Data Virtualization in Action: Scaling APIs and Apps with FME
SaaS reusability assessment using machine learning techniques

20170602_OSSummit_an_intelligent_storage

  • 1. An Intelligent Storage ~SQL Execution on GPU Closely Connected with SSD~ PG-Strom Development Team KaiGai Kohei <[email protected]>
  • 2. The PG-Strom Project Who am I? ▌KaiGai Kohei  PG-Strom Development Team ▌Background  Linux kernel (SELinux, …)  PostgreSQL database  Enhancement of security features  Writable FDW, Remote Join  Custom Scan Interface  GPU and CUDA ▌PG-Strom  SQL Acceleration Technology using GPU  Solution for data analytics in-database 2 Open Source Summit Japan 2017 - An Intelligent Storage
  • 3. The PG-Strom Project PG-Strom PG-Strom (1/3) – GPU as accelerator of SQL 3 GPU CPU Model NVIDIA Tesla P40 Intel Xeon E5-2699v4 Code name Pascal Broadwell Release Q3-2016 Q1-2016 # of transistors 12billion 7.2billion # of cores 3840 (simple) 22 (functional) Core clocks 1.306GHz ~1.53GHz 2.20GHz ~3.60GHz Performance (FP32) 12 TFLOPS 1.2 TFLOPS (with AVX2) RAM Size 24GB (GDDR5) max1.5TB (DDR4) Memory Band 347GB/s 76.8GB/s Power consumption 250W 145W Manufacturing process 16nm 14nm Application GPU Off-load WHERE, JOIN, GROUP BY Auto generation of GPU binary from SQL PG-Strom: Extension module of PostgreSQL, to accelerate SQL workloads using multi-thousands cores and high bandwidth memory. Same SQL Statement Same Data Structure Open Source Summit Japan 2017 - An Intelligent Storage
  • 4. The PG-Strom Project PG-Strom (2/3) – Transparent GPU code generation on demand Open Source Summit Japan 2017 - An Intelligent Storage4 QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Generate CUDA C code from calculation formula in WHERE-clause Reference to input data SQL expression in CUDA source code Run-time Compiler (nvrtc) Just-in-time Compile Parallel Execution
  • 5. The PG-Strom Project PG-Strom (3/3) – An example of SQL acceleration 5 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat;  t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema) Open Source Summit Japan 2017 - An Intelligent Storage 8.48 13.23 18.28 23.42 28.88 34.50 40.77 47.16 5.00 5.46 5.91 6.45 7.17 8.07 9.22 10.21 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of tables joined PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.6 PG-Strom 2.0devel CPU: Xeon E5-2640v4 GPU: Tesla P40 RAM: 128GB OS: CentOS 7.3 DB: PostgreSQL 9.6.2 + PG-Strom 2.0devel
  • 6. The PG-Strom Project In-database Analytics ▌Computing close to data  Cost to move data  Cost to transform / validate data  Waste of time for DBAs ▌Utilization of SQL’s flexibility  Pre-/Post-process of analytic algorithms  Conjunction with other data (JOIN)  Summarization of results (GROUP BY/Window Functions) Open Source Summit Japan 2017 - An Intelligent Storage6
  • 7. The PG-Strom Project What we want to develop (1/2) Open Source Summit Japan 2017 - An Intelligent Storage7 Data Collection Summarize Pre-process Data Generation Analytics & ML Visualization Data Size: Large  Small Amount of computing: Small  Large SSD-to-GPU Direct SQL Exec Transparent GPU code generation & Asynchronous Exec Statistical Analytics & Machine Learning Libraries Advanced algorithms as a part of SQL statement RowColum Hardware Transform BI Tools Execution of entire lifecycle for data analytics in-database DNN k-means SVM Random Forest PL/CUDA NVMe-SSD GPU
  • 8. The PG-Strom Project What we want to develop (2/2) – elemental technology Open Source Summit Japan 2017 - An Intelligent Storage8 Data Collection Summarize Pre-process Data Generation Analytics & ML Visualization Data Size: Large  Small Amount of computing: Small  Large SSD-to-GPU Direct SQL Exec Transparent GPU code generation & Asynchronous Exec Statistical Analytics & Machine Learning Libraries Advanced algorithms as a part of SQL statement RowColum Hardware Transform BI Tools Execution of entire lifecycle for data analytics in-database DNN k-means SVM Random Forest PL/CUDA NVMe-SSD GPU SSD-to-GPU Direct SQL Execution (for I/O) PG-Strom (for SQL) PL/CUDA and Library (for computing)
  • 9. The PG-Strom Project I/O Acceleration with GPU? Open Source Summit Japan 2017 - An Intelligent Storage9
  • 10. The PG-Strom Project Re-definition of GPU’s role I/O workloads Computing workloads Open Source Summit Japan 2017 - An Intelligent Storage10
  • 11. The PG-Strom Project Brief Architecture of GPU GPU Device Memory (GDDR5 or HBM2) Host RAM Data Transfer with DMA over the PCIe Bus CPU PCIe Bus Also data transfer via DMA from: particular blocks to: destination physical address Storage GPU Device Open Source Summit Japan 2017 - An Intelligent Storage11
  • 12. The PG-Strom Project Technology Basis (1/2) – GPUDirect RDMA ▌Feature for peer-to-peer DMA between GPU and PCIe devices ▌Originally, designed for MPI over Infiniband ▌But available for any kind of PCIe devices, if custom kernel module Copyright (c) NVIDIA corporation, 2015 Open Source Summit Japan 2017 - An Intelligent Storage12
  • 13. The PG-Strom Project Technology Basis (2/2) – GPUDirect RDMA Physical Address Space PCIe BAR1 AreaGPU RAM Host RAM NVMe-SSD Infiniband HBA PCIe devices GPUDirect RDMA A feature to map GPU’s device memory on physical address space of the host system Once GPU device memory gets physical address of the host system, driver can use the address as source/destination of DMA operations. 13 Open Source Summit Japan 2017 - An Intelligent Storage
  • 14. The PG-Strom Project Idea of SSD-to-GPU Direct SQL Execution Open Source Summit Japan 2017 - An Intelligent Storage14 Large PostgreSQL Tables PCIe Bus NVMe SSD GPUSSD-to-GPU P2P DMA (NVMe-Strom driver) WHERE JOIN GROUP BYPostgreSQL Data Block Reduction of unnecessary rows according to the SQL  Smaller amount of I/O Existing Data Flow (Large I/O Size) SSD-to-GPU Direct SQL Execution – Load PostgreSQL’s data block to GPU directly, then reduce unnecessary rows with multi-thousands cores of GPUs.
  • 15. The PG-Strom Project As destination of SSD to GPU DMA commands!! Linux kernel APIs of GPUDirect RDMA int nvidia_p2p_get_pages(uint64_t p2p_token, /* deprecated */ uint32_t va_space, /* deprecated */ uint64_t virtual_address, uint64_t length, struct nvidia_p2p_page_table **page_table, void (*free_callback)(void *private_data), void *private_data); struct nvidia_p2p_page_table { uint32_t version; uint32_t page_size; struct nvidia_p2p_page **pages; uint32_t entries; uint8_t *gpu_uuid; } nvidia_p2p_page_table_t; Array of I/O mapped physical memory address to the GPU device memory CUresult cuMemAlloc(CUdeviceptr *daddr, size_t bytesize); Open Source Summit Japan 2017 - An Intelligent Storage15
  • 16. The PG-Strom Project NVMe-Strom Software Stack Open Source Summit Japan 2017 - An Intelligent Storage16 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL file offset DMA request block number SSD-to-GPU Peer-to-Peer DMA cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 17. The PG-Strom Project Benchmark (1/2) – Star Schema Benchmark Open Source Summit Japan 2017 - An Intelligent Storage17 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Q1-1 Q1-2 Q1-3 Q2-1 Q2-2 Q2-3 Q3-1 Q3-2 Q3-3 Q3-4 Q4-1 Q4-2 Q4-3 QueryExecutionThroughput[MB/s] PostgreSQL SSDx1 PostgreSQL SSDx2 PG-Strom SSDx1 PG-Strom SSDx2  SQL with WHERE-clause, JOIN, Aggregation/GROUP BY towards 353GB database.  Usually, entire workloads are dominated by I/O.  PostgreSQL achieves up to 1.6GB/s with regular filesystem based access.  PG-Strom with SSD-to-GPU direct SQL execution achieved up to 3.8GB/s at the peak performance. ※ (Query Execution Throughput) = (Database Size; 353GB) / (Entire Query Response Time [sec]) Theoretical limitation by 2x SSD [4.4GB/s] Under Investigation Theoretical limitation by 1x SSD [4.4GB/s]
  • 18. The PG-Strom Project Benchmark (2/2) – Star Schema Benchmark Open Source Summit Japan 2017 - An Intelligent Storage18  Example of Query SELECT sum(lo_revenue), d_year, p_brand1 FROM lineorder, date1, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_category = 'MFGR#12‘ AND s_region = 'AMERICA‘ GROUP BY d_year, p_brand1 ORDER BY d_year, p_brand1; customer 12M Rows (1.6GB) date1 2500 Rows (400KB) part 1.8M Rows (206MB) supplier 4M Rows (528MB) lineorder 2.4B Rows (351GB) Typical queries for summary; which consists of WHERE-clause, JOIN and GROUP BY. Size of lineorder table is larger than physical RAM size, majority of the workload is dominated by I/O.
  • 19. The PG-Strom Project Hardware configuration on the benchmark 19 ▌Hardware/Software configuration  Model: DELL PowerEdge R730  CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2  RAM: 128GB  HDD: SAS 300GB x8 (RAID5)  OS: CentOS7.2 (3.10.0-327.el7)  SW: CUDA 7.5.17 NVIDIA driver 352.99 PostgreSQL 9.6.1 PG-Strom v2.0devel ▌GPU Specifications  NVIDIA Tesla K80  2x 2496 CUDA cores (560MHz)  2x 12GB GDDR5 RAM (240GB/s) ▌SSD Specifications  Intel SSD 750 (400GB) x2  Interface: PCIe 3.0 x4 (NVMe 1.2)  Max SeqRead: 2.2GB/s Open Source Summit Japan 2017 - An Intelligent Storage CPU2CPU1 SSD x2 Tesla K80 GPU Tesla K20 GPU (unused)
  • 20. The PG-Strom Project (OT) Consistency with disk cache 20  Problem ① Storage block may not be up-to-date, if target blocks are kept on page caches. ② Short size DMA is not efficient.  Solution – Write back cached PostgreSQL blocks (= 8KB/block) to userspace buffer once, then kicks bulk transfer from RAM to GPU by CUDA APIs.  Almost equivalent cost with read(2) + cuMemcpyHtoDAsync()  Although PostgreSQL blocks are reordered on GPU memory, it generates correct results. Open Source Summit Japan 2017 - An Intelligent Storage BLK-100: uncached BLK-101: cached BLK-102: uncached BLK-103: uncached BLK-104: cached BLK-105: cached BLK-106: uncached BLK-107: uncached BLK-108: cached BLK-109: uncached BLCKSZ (=8KB) Transfer Size Per Request BLCKSZ * NChunks BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached BLK-100: uncached BLK-102: uncached BLK-103: uncached BLK-106: uncached BLK-107: uncached BLK-109: uncached BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached unused SSD-to-GPU P2P DMA File Userspace DMA Buffer (RAM) Device Memory (GPU) CUDA API (userspace) cuMemcpyHtoDAsync
  • 21. The PG-Strom Project Not only Scan, but GpuJoin also... Open Source Summit Japan 2017 - An Intelligent Storage21 Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next Step Next Step Just reference to the results on CPU side Hash Table Search by CPU Generation of results by CPU Generation of results by GPU Parallel Hash Table Search by GPU HashJoin by CPU GpuHashJoin Here we have SSD-to-GPU P2P DMA
  • 22. The PG-Strom Project GpuPreAgg Not only Scan, but GpuPreAgg also... Open Source Summit Japan 2017 - An Intelligent Storage22 Aggregation by CPU Aggregation with GPU Relation Relation Aggregation Aggregation 1st Step: Parallel Hash Reduction 2nd Step: Atomic Merge Read of 1M rows Process 1M rows by CPU Smaller # of rows GpuPreAgg: Reduction of rows to be processed by CPU Interim results Here we have SSD-to-GPU P2P DMA.
  • 23. The PG-Strom Project The Future Open Source Summit Japan 2017 - An Intelligent Storage23
  • 24. The PG-Strom Project Single Node Solution 2x Enterprise Grade SSD Based on PCIe x8 NVIDIA Tesla P40 CPU RAM DMI SSD-1 SSD-2 GPU max 10GB/s capability of query processing / node PCI-E Hardware Block Diagram Open Source Summit Japan 2017 - An Intelligent Storage24
  • 25. The PG-Strom Project Multi Node Solution (?) NVIDIA Tesla P40 Joint solution with PostgreSQL’s scale-out technologies 2x Enterprise Grade SSD Based on PCIe x8 Open Source Summit Japan 2017 - An Intelligent Storage25
  • 26. The PG-Strom Project Scale Up Solution (?) 5x NVIDIA Tesla P40 DMI GPU-1 PCI-SW PCI-SW QPIx2 GPU-2 GPU-4 GPU-5 SSD-1 SSD-2 SSD-3 SSD-4 SSD-5 GPU-3 RAM RAM CPU 1 CPU 2 PCIe x16 x2 slot PCIe 96lane PCIe 96lane Deployment on GPU monster 5x Enterprise Grade SSD Based on PCIe x16 Open Source Summit Japan 2017 - An Intelligent Storage26
  • 27. The PG-Strom Project Towards columnar structure (1/3) – On-memory Columnar Cache 27 ▌On-memory Columnar Cache  Caches PostgreSQL’s data-blocks (row-data) on memory as column-data  Reduction of data transfer between RAM<->GPU  Optimization of memory bandwidth of GPU  Switch GPU kernel to be invoked according to existence of column-cache Open Source Summit Japan 2017 - An Intelligent Storage BlockNum = 101 all_visible=false BlockNum = 102 all_visible=true BlockNum = 103 all_visible=true BlockNum = 104 all_visible=true Columnar Cache (BlockNum=102) Columnar Cache (BlockNum=102) background worker Construction UPDATE invalidation PG-Strom Row-data GPU Kernel for Row Column-data GPU Kernel for Column Row-data GPU Kernel for Row Column-data GPU Kernel for Column
  • 28. The PG-Strom Project Towards columnar structure (2/3) – Background 28 ▌Row data format – Random memory access  Less usage ratio of memory bus, larger number of memory transactions Open Source Summit Japan 2017 - An Intelligent Storage ▌Column data format – Coalesced memory access  Optimal memory bus usage, least number of memory transactions 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x8 = 256bit is valid in 256bit memory transaction. (usage ratio: 100.0%) Only 32bit x1 = 32bit is valid in 256bit memory transaction . (usage ratio: 12.5%) GPU Cores GPU Cores Why columnar structure is preferable for GPU architecture
  • 29. The PG-Strom Project World of OLTP World of OLAP Towards columnar structure (3/3) – FPGA engine on SSD Open Source Summit Japan 2017 - An Intelligent Storage29 FPGA Logic on SSD enables write by row, then read as column RowColumn Transformer column-format Read of analytic data (column-format) Write of transactional data (row-format) Data-format transformation by FPGA on SSD SQL Execution – based on columnar format; best for GPU Pre-processed data Only required columns are fetched
  • 30. The PG-Strom Project Towards the “true” intelligent storage system Open Source Summit Japan 2017 - An Intelligent Storage30 Large PostgreSQL Tables PCIe Bus NVMe SSD GPUSSD-to-GPU P2P DMA (NVMe-Strom driver) WHERE JOIN GROUP BYPostgreSQL Data Block Reduction of unnecessary rows according to the SQL  Smaller amount of I/O Step-1: A pair of NVMe-SSD and GPU looks like an intelligent storage which understand SQL workloads for filtering-out unnecessary rows and pre-process of JOIN/GROUP BY prior to data loading onto the main memory. Step-2: A special NVMe-SSD transforms data format for more suitable one to process, based on knowledge of the requester. (PostgreSQL’s block format in this case)