SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes

BIGGER DATA ON GPUS:
SUCCESSES
APPROACHES, CHALLENGES,
Jake Wheat
Arnon Shimoni

INTRODUCING SQREAM DB
GPU-ACCELERATED DATA WAREHOUSE
100xfaster
Queries
10%of resources
Cost
20xmore data
Analyze

FAST TO GET LOTS OF DATA IN
• Use GPU for loading
• 900 GB/s Memory Bandwidth
• Compress all the data
• Collect metadata

FAST TO GET LOTS OF DATA OUT
• Access with easy-to-use SQL
• Support standards like ODBC and JDBC
• 900 GB/s Memory Bandwidth for SQL operations
• Access raw data directly, without cubes, indexes
• SQream DB reads less data from disk, with compression

ARE GPUS INTERESTING FOR RUNNING SQL?
• Can they run SQL
• Can they run SQL faster
– If qualified yes, in what situations?
• Are there other issues to consider?

CAN GPUS RUN SQL?
Example SQL Physical Operator Implementation
select a+b, c * 5 from t select
(a.k.a
project/extend/rename)
thrust::transform
select a, count(*), sum(b),
avg(b) from t group by a
stream aggregate thrust::reduce_by_key
select a, b from t where a > 0.5 filter thrust::remove_if
select distinct a from t stream distinct thrust::unique
select a, b, c, d from t order
by a,b
sort thrust::sort
select * from t union all
select * from u
union all -
select * from t
inner join u using (a)
sort merge join (smj) simple implementation:
thrust::upper_bounds,
lower_bounds, unnest, gather

MARKETING HURDLES
• PCI-bottleneck means it will never work
• Columnar databases can't do joins
• GPUs can't accelerate SQL operations
• No-one will put a GPU in a server
• GPUs are not actually faster than CPUs
• A startup cannot make a production ready SQL DBMS

OTHER ISSUES
• Can you make a convincing demo?
• Can you turn it into a real product?
• Can you put GPUs in a data centre?
• Are GPUs a safe bet in the medium/long
term?

EARLY RESEARCH
• MonetDB/X100 talk
youtu.be/yrLd-3lnZ58
• Relational Joins on Graphics
Processors
www.cse.ust.hk/catalac/papers/
gpujoin_sigmod08.pdf
• Relational Query Co-Processing on
Graphics Processors
dl.acm.org/citation.cfm?id=162058
8
• Several Daniel Abadi papers
www.cs.umd.edu/~abadi/

THE EARLY SQREAM DB PROTOTYPES
• Original brief: OpenCL + Erlang + Haskell streaming IoT = World Domination!
• Generate thrust at query time
• SQL server plugin
• A real (but simple) DBMS with storage

OUR FIRST DBMS
• Run on data on disk
• Create and drop table
• Insert, insert select (and truncate)
• A wide range of queries:
e.g. select lists, joins, where, aggregates, order by, distinct
• Lots of external algorithms

WHY NOT POSTGRES?
Some downsides to Postgres
• No columnar - engine and storage
• No threads, Not distributed
• A big complex system
Some non-benefits:
• Parsing, syntax, and similar - Haskell makes this easy
• The storage and execution engine – very row based
Some things we miss:
• Wide range of features, data types, operations
• Extensibility
• Cost based optimiser
• Protocol/client compatibility

STEPS TOWARDS TODAY'S PRODUCT
Haskell Compiler
Parse SQL
Desugar to
Relational
Algebra
Optimize
Desugar to
Statement Plan
Network
Server
Runtime
Metadata
Database
Columnar
Storage
Tree Interpreter Building Blocks I/O Task Runner

SQREAM DB ARCHITECTURE
Statement Compiler
SQL Parser
Desugar & Optimize
Relational Algebra
Desugar & Optimize
Low-level stages
Execution Engine
Statement Tree Interpreter
Task Runners
I/O CPU GPU
Storage Layer
Metadata Database
+ Low-level transactions
server or in-process
Bulk Data Layer
Extent Extent Extent …
Storage Reorganizer
Tasks
Queue & Thread
Manager
Profiling Support
Memory Managers
Building
blocks
Building
blocks
Building
blocks
Connection &
Session
Manager
Concurrency
& Admission
Control
Desugar & Optimize
Small
Memory
Managers
Chunk
Memory
Managers
Spool
Memory
Managers
Linux FS
Cache
Prodder

SOME ARCHITECTURE DETAILS
• Haskell has the intelligence
• C++/CUDA does the heavy lifting
• Message passing, worker pools
• Bulk data memory centric
• Storage is append-only with background reorganization

STORAGE AND TRANSACTIONS
• Metadata database with relatively conventional transactions
• Append only storage layer with background reorganization
Transactions
• Serializable, with any kind of statement
• Run multiple queries concurrently with anything
• Run multiple inserts to the same table at the same time
• Cannot run multiple statements in a single transaction
• Other operations such as delete, truncate, and DDL use course grained exclusive
locking

USING GPUS EFFECTIVELY
• Good kernels
• Optimise around GPU memory
• Use large chunks, rechunk where necessary
• Avoid PCI transfers where possible
• Profiling
• Partitioning

VECTORED BINARY SEARCH
0
3
4
2
4
5
0
0
3
3
1
1
1
1
1
2
2
Table A Table B

HASH JOINS
• Can hashing run fast on the GPU?
• Answer from NVIDIA experts:
– in principle probably yes
– in practice, difficult to compete with sort-based algorithms

COMPRESSION
• GPU compression for typical columnar data
– e.g. Dictionary, RLE, Delta, Pfor + Combos
– Helps speed up IO and PCI transfer times
– in house code
• CPU compression for general data
– Helps speed up IO, but not PCI transfer times
– We use things like Snappy and LZ4

SOME FINAL THOUGHTS
• SQL analytics and GPUs are a natural fit
• GPUs can be very effective for big data/external
algorithms
• Lots of exciting work being done in non-SQL
analytics (not just on GPUs)
• Haskell is a big positive
• Building a commercial SQL DBMS is very difficult
• Building a SQL DBMS is a really satisfying thing to do
SQL GPU

HIGH THROUGHPUT, CONVERGED
• SQream DB is designed for high-throughput devices
• IBM Power Systems is the only NVLink CPU-to-GPU enabled architecture,
unlocking the potential of high-throughput accelerated computing
• The IBM AC922, with POWER9 and NVLINK can transfer data at up to 300GB/s,
almost 9.5x faster than PCIe 3.0 found in x86-based architectures, reducing
classic I/O bottlenecks
2x
NVIDIA
Tesla V100
2x
NVIDIA
Tesla V100
IBM
Power 9
IBM
Power 9

HIGH THROUGHPUT ARCHITECTURE
IT’S NOT JUST CORES
RAM
Power9
CPU
Tesla V100
GPU
VRAM
Tesla V100
GPU
VRAM
170GB/s per CPU
NVLink – 300GB/s BiDi
900GB/s
RAM
Power9
CPU
Tesla V100
GPU
VRAM
Tesla V100
GPU
VRAM
IBM SMP bus

UP TO 3.7X FASTER QUERIES
52.83
10.35
84.5
78.57
14.06
2.8
30.29 29.01
0
10
20
30
40
50
60
70
80
90
TPC-H Query 8 TPC-H Query 6 TPC-H Query 19 TPC-H Query 17
Querytime(seconds)
Lowerisbetter
Query
SQream DB performance
IBM Power9 vs Intel Xeon (Skylake)
Dell PowerEdge R740 IBM Power9 AC922
IBM Power9 AC922:
2x POWER9 16C @ 3.8GHz | 256 GB DDR4 2666 MHz | SSD storage | 4x NVIDIA Tesla V100 (SXM2 NVLINK - 16GB)
Dell PowerEdge R740:
2x Intel Xeon Silver 4112 CPU @ 2.60GHz | 256GB DDR4 2666MHz | SSD storage | 4x NVIDIA Tesla V100 (PCIe - 16GB)
• In our testing, SQream DB on Power9
is between 150% to 370% faster than
comparable x86 architectures,
especially on large data sets. For
example, in the TPC-H (SF 10,000)
dataset, Query 8 ran in a quarter of
the time on the IBM Power 9,
compared to the x86 competitor.

UNDERSTAND 40 MILLION CUSTOMERS
TELECOM
HP DL380g9
with NVIDIA Tesla GPU
96 GB RAM + 6 TB storage
$200K
80 NODES
5 full racks
7600 CPU cores
$10,000,000
20M
10M
300M
120M
Ingest time
Reporting time
Ownership Cost

Green
plum
3G
4G
CDRs
Others
ETL
1-2 hours
GP
Daily
aggr.
…
Profiles
GP
Daily report
3 hours
(max) #1 #2 #3 #4 #31•••
•••
Daily reports
Monthly
#1
Monthly
#2
Monthly
#NMonthly reports
(7 days)
5hr 3hr 0.5hr
Billing
Pre-aggregations
ARCHITECTURE BEFORE SQREAM DB

SIMPLIFIED WITH SQREAM DB
3G
4G
CDRs
Others
#1 #2 #3 #4 #31•••
•••
Daily reports
Monthly
#1
Monthly
#2
Monthly
#NMonthly reports
1 day
10m 4m 2m

SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes

More Related Content

What's hot (20)

Similar to SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes (20)

Recently uploaded (20)

SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes