SlideShare a Scribd company logo
Data Lake to AI on GPUs
@blazingdb@blazingdb
CPUs can no longer handle the growing data demands
of data science workloads
Slow Process Suboptimal Infrastructure
Hundreds to tens of thousands of CPU
servers are needed in data centers.
Preparing data and training models
can take days or even weeks.
@blazingdb@blazingdb
GPUs are well known for accelerating the training of
machine learning and deep learning models.
Deep Learning
(Neural Networks)
Machine
Learning
Performance
improvements
increase at scale.
40x Improvement
over CPU.
@blazingdb@blazingdb
But data preparation still happens on CPUs, and can’t
keep up with GPU accelerated machine learning.
• Apache Spark
Query ETL ML Train
Enterprise GPU users find it challenging to “Feed the Beast”.
• Apache Spark + GPU ML
Query ETL
ML
Train
@blazingdb@blazingdb
An end-to-end analytics solution on GPUs is the only
way to maximize GPU power.
Expertise:
· GPU DBMS
· GPU Columnar Analytics
· Data Lakes
Expertise:
· CUDA
· Machine Learning
· Deep Learning
Expertise:
· Python
· Data Science
· Machine Learning
Query ETL
ML
Train
RAPIDS (All GPU)
@blazingdb@blazingdb
RAPIDS, the end-to-end GPU analytics ecosystem
cuDF
Data Preparation
cuML
Machine Learning
cuGRAPH
Graph Analytics
Model TrainingData Preparation Visualization
A set of open source libraries for GPU
accelerating data preparation and
machine learning.
In GPU Memory
import cudf
from cuml import KNN
import numpy as np
np_float = np.array([
[1,2,3], #Point 1
[1,2,3], #Point 2
[1,2,3], #Point 3
]).astype('float32')
gdf_float = cudf.DataFrame()
gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0])
gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1])
gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2])
print('n_samples = 3, n_dims = 3')
print(gdf_float)
knn_float = KNN(n_gpus=1)
knn_float.fit(gdf_float)
Distance,Index = knn_float.query(gdf_float,k=3)
# Get 3 nearest neighbors
print(Index)
print(Distance)
@blazingdb@blazingdb
BlazingSQL: The GPU SQL Engine on RAPIDS
A SQL engine built on RAPIDS.
Query enterprise data lakes lightning fast with full interoperability with the RAPIDS stack.
@blazingdb@blazingdb
cuDF
Data Preparation
BlazingSQL, The GPU SQL Engine for RAPIDS
cuML
Machine Learning
cuGRAPH
Graph Analytics
A SQL engine built on RAPIDS. Query
enterprise data lakes lightning fast with
full interoperability with RAPIDS stack.
In GPU Memory
from blazingsql import BlazingContext
bc = BlazingContext()
#Register Filesystem
bc.hdfs('data', host='129.13.0.12',
port=54310)
# Create Table
bc.create_table('performance',
file_type='parquet',
path='hdfs://data/performance/')
#Execute Query
result_gdf = bc.run_query('SELECT * FROM
performance WHERE
YEAR(maturity_date)>2005')
print(result_gdf)
@blazingdb@blazingdb
Getting Started Demo
@blazingdb@blazingdb
BlazingSQL + XGBoost Loan Risk Demo
Train a model to assess risk of new mortgage loans based
on Fannie Mae loan performance data
ETL/
Feature Engineering XGBoost Training
Mortgage Data
4.22M Loans
148M Perf. Records
CSV Files on HDFS
CLUSTER
+
CLUSTER
1 Nodes
16 vCPUs per node
1 Tesla T4 GPU
2560
CUDA Cores
16GB
VRAM
+
+
4 Nodes
8 vCPUs per node
+30GB RAM
@blazingdb@blazingdb
RAPIDS + BlazingSQL outperforms traditional
CPU pipelines
Demo Timings (ETL Phase)
3.8GB
0’’ 1000’’ 2000’’ 3000’’
(1 x T4)
3.8GB
(4 Nodes)
15.6GB
(1 x T4)
15.6GB
(4 Nodes)
TIME IN SECONDS
@blazingdb@blazingdb
Scale up the data on a DGX
4 x V100 GPUs
@blazingdb@blazingdb
BlazingSQL + Graphistry Netflow Analysis
Visually analyze the VAST netflow data set inside Graphistry in order
to quickly detect anomalous events.
ETL VisualizationNetflow Data
65M Events
2 Weeks
1,440 Devices
@blazingdb@blazingdb
Benchmarks
Netflow Demo Timings (ETL Only)
@blazingdb@blazingdb
Stateless and Simple.
Underlying services being
stateless reduces complexity
and increase extensibility.
Benefits of BlazingSQL
Blazing Fast.
Massive time savings with our
GPU accelerated ETL pipeline.
Data Lake to RAPIDS
Query data from Data Lakes
directly with SQL in to GPU
memory, let RAPIDS do the rest.
Minimal Code Changes Required.
RAPIDS with BlazingSQL mirrors
Pandas and SQL interfaces for
seamless onboarding.
@blazingdb@blazingdb
Upcoming BlazingSQL Releases
Use the PyBlazing
connection to execute SQL
queries on GDFs that are
loaded by the cuDF API
Integrate FileSystem API,
adding the ability to
directly query flat files
(Apache Parquet & CSV)
inside distributed file
systems.
SQL queries are fanned
out across multiple GPUs
and servers.
String support and string
operation support.
Query
GDFs
Direct Query
Flat Files
Distributed
Scheduler
String
Support
Physical Plan
Optimizer
Partition culling for where
clauses and joins.
VO.1 VO.2 VO.3 VO.4 VO.5
@blazingdb@blazingdb
Get Started
BlazingSQL is quick to get up and running using either
DockerHub or Conda Install:

More Related Content

PPTX
BlazingSQL & Graphistry - Netflow Demo
Rodrigo Aramburu
 
PDF
Mixing Analytic Workloads with Greenplum and Apache Spark
VMware Tanzu
 
PPTX
Greenplum-Spark November 2018
KongYew Chan, MBA
 
PDF
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
VMware Tanzu
 
PDF
Greenplum for Kubernetes - Greenplum Summit 2019
VMware Tanzu
 
BlazingSQL & Graphistry - Netflow Demo
Rodrigo Aramburu
 
Mixing Analytic Workloads with Greenplum and Apache Spark
VMware Tanzu
 
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
VMware Tanzu
 
Greenplum for Kubernetes - Greenplum Summit 2019
VMware Tanzu
 

What's hot (20)

PDF
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
PDF
GPU databases - How to use them and what the future holds
Arnon Shimoni
 
PDF
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
PDF
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
VMware Tanzu
 
PDF
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
PDF
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
VMware Tanzu
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
PDF
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
Arnon Shimoni
 
PDF
Dataflow shuffle service
Yuta Hono
 
PDF
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
VMware Tanzu
 
PDF
GPU-Accelerating A Deep Learning Anomaly Detection Platform
NVIDIA
 
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
PDF
20180920_DBTS_PGStrom_EN
Kohei KaiGai
 
PDF
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
VMware Tanzu
 
PDF
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
VMware Tanzu
 
PDF
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
GPU databases - How to use them and what the future holds
Arnon Shimoni
 
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
VMware Tanzu
 
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
VMware Tanzu
 
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
Arnon Shimoni
 
Dataflow shuffle service
Yuta Hono
 
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
VMware Tanzu
 
GPU-Accelerating A Deep Learning Anomaly Detection Platform
NVIDIA
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
20180920_DBTS_PGStrom_EN
Kohei KaiGai
 
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
VMware Tanzu
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
VMware Tanzu
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
Ad

Similar to BlazingSQL + RAPIDS AI at GTC San Jose 2019 (20)

PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Kohei KaiGai
 
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
PDF
NVIDIA Rapids presentation
testSri1
 
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
pgconfasia2016 plcuda en
Kohei KaiGai
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
PDF
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
PDF
20170602_OSSummit_an_intelligent_storage
Kohei KaiGai
 
PDF
Distributed deep learning optimizations for Finance
geetachauhan
 
PDF
GPPB2020 - Milan - Power BI dataflows deep dive
Riccardo Perico
 
PDF
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
Equnix Business Solutions
 
PDF
20190909_PGconf.ASIA_KaiGai
Kohei KaiGai
 
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PDF
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Kohei KaiGai
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
NVIDIA Rapids presentation
testSri1
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
pgconfasia2016 plcuda en
Kohei KaiGai
 
RAPIDS Overview
NVIDIA Japan
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
20170602_OSSummit_an_intelligent_storage
Kohei KaiGai
 
Distributed deep learning optimizations for Finance
geetachauhan
 
GPPB2020 - Milan - Power BI dataflows deep dive
Riccardo Perico
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
Equnix Business Solutions
 
20190909_PGconf.ASIA_KaiGai
Kohei KaiGai
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 
Ad

Recently uploaded (20)

PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
Chad Readey - An Independent Thinker
Chad Readey
 

BlazingSQL + RAPIDS AI at GTC San Jose 2019

  • 1. Data Lake to AI on GPUs
  • 2. @blazingdb@blazingdb CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Hundreds to tens of thousands of CPU servers are needed in data centers. Preparing data and training models can take days or even weeks.
  • 3. @blazingdb@blazingdb GPUs are well known for accelerating the training of machine learning and deep learning models. Deep Learning (Neural Networks) Machine Learning Performance improvements increase at scale. 40x Improvement over CPU.
  • 4. @blazingdb@blazingdb But data preparation still happens on CPUs, and can’t keep up with GPU accelerated machine learning. • Apache Spark Query ETL ML Train Enterprise GPU users find it challenging to “Feed the Beast”. • Apache Spark + GPU ML Query ETL ML Train
  • 5. @blazingdb@blazingdb An end-to-end analytics solution on GPUs is the only way to maximize GPU power. Expertise: · GPU DBMS · GPU Columnar Analytics · Data Lakes Expertise: · CUDA · Machine Learning · Deep Learning Expertise: · Python · Data Science · Machine Learning Query ETL ML Train RAPIDS (All GPU)
  • 6. @blazingdb@blazingdb RAPIDS, the end-to-end GPU analytics ecosystem cuDF Data Preparation cuML Machine Learning cuGRAPH Graph Analytics Model TrainingData Preparation Visualization A set of open source libraries for GPU accelerating data preparation and machine learning. In GPU Memory import cudf from cuml import KNN import numpy as np np_float = np.array([ [1,2,3], #Point 1 [1,2,3], #Point 2 [1,2,3], #Point 3 ]).astype('float32') gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) print('n_samples = 3, n_dims = 3') print(gdf_float) knn_float = KNN(n_gpus=1) knn_float.fit(gdf_float) Distance,Index = knn_float.query(gdf_float,k=3) # Get 3 nearest neighbors print(Index) print(Distance)
  • 7. @blazingdb@blazingdb BlazingSQL: The GPU SQL Engine on RAPIDS A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with the RAPIDS stack.
  • 8. @blazingdb@blazingdb cuDF Data Preparation BlazingSQL, The GPU SQL Engine for RAPIDS cuML Machine Learning cuGRAPH Graph Analytics A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with RAPIDS stack. In GPU Memory from blazingsql import BlazingContext bc = BlazingContext() #Register Filesystem bc.hdfs('data', host='129.13.0.12', port=54310) # Create Table bc.create_table('performance', file_type='parquet', path='hdfs://data/performance/') #Execute Query result_gdf = bc.run_query('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') print(result_gdf)
  • 10. @blazingdb@blazingdb BlazingSQL + XGBoost Loan Risk Demo Train a model to assess risk of new mortgage loans based on Fannie Mae loan performance data ETL/ Feature Engineering XGBoost Training Mortgage Data 4.22M Loans 148M Perf. Records CSV Files on HDFS CLUSTER + CLUSTER 1 Nodes 16 vCPUs per node 1 Tesla T4 GPU 2560 CUDA Cores 16GB VRAM + + 4 Nodes 8 vCPUs per node +30GB RAM
  • 11. @blazingdb@blazingdb RAPIDS + BlazingSQL outperforms traditional CPU pipelines Demo Timings (ETL Phase) 3.8GB 0’’ 1000’’ 2000’’ 3000’’ (1 x T4) 3.8GB (4 Nodes) 15.6GB (1 x T4) 15.6GB (4 Nodes) TIME IN SECONDS
  • 12. @blazingdb@blazingdb Scale up the data on a DGX 4 x V100 GPUs
  • 13. @blazingdb@blazingdb BlazingSQL + Graphistry Netflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. ETL VisualizationNetflow Data 65M Events 2 Weeks 1,440 Devices
  • 15. @blazingdb@blazingdb Stateless and Simple. Underlying services being stateless reduces complexity and increase extensibility. Benefits of BlazingSQL Blazing Fast. Massive time savings with our GPU accelerated ETL pipeline. Data Lake to RAPIDS Query data from Data Lakes directly with SQL in to GPU memory, let RAPIDS do the rest. Minimal Code Changes Required. RAPIDS with BlazingSQL mirrors Pandas and SQL interfaces for seamless onboarding.
  • 16. @blazingdb@blazingdb Upcoming BlazingSQL Releases Use the PyBlazing connection to execute SQL queries on GDFs that are loaded by the cuDF API Integrate FileSystem API, adding the ability to directly query flat files (Apache Parquet & CSV) inside distributed file systems. SQL queries are fanned out across multiple GPUs and servers. String support and string operation support. Query GDFs Direct Query Flat Files Distributed Scheduler String Support Physical Plan Optimizer Partition culling for where clauses and joins. VO.1 VO.2 VO.3 VO.4 VO.5
  • 17. @blazingdb@blazingdb Get Started BlazingSQL is quick to get up and running using either DockerHub or Conda Install: