SlideShare a Scribd company logo
Keith Kraus 18-10-2018
RAPIDS: GPU-ACCELERATED ETL AND
FEATURE ENGINEERING
2
REALITIES OF DATA
3
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
4
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
5
1980 1990 2000 2010 2020
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
Transistors
(thousands)
NEED MORE SPEED
CPU Performance Has Plateaued
6
WE NEED MORE COMPUTE!
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
• In a simple benchmark consisting of
aggregating data, the CPU is the
bottleneck
• This is after the data is parsed and
cached into memory which is
another common bottleneck
• The CPU bottleneck is even worse
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;
7
HOW CAN WE DO BETTER?
• Focus on the full Data Science workflow
• Data Loading
• Data Transformation
• Data Analytics
• Python
• Provide as close to a drop-in replacement for existing tools
• Performance - Leverage GPUs
8
1980 1990 2000 2010 2020
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
GPU-Computing perf
1.5X per year 1000X
By 2025
NEW BEGINNINGS
GPU Performance Grows
9
GPU
ADOPTION
BARRIERS
• Too much data movement
• Too many makeshift data
formats
• Writing CUDA C/C++ is hard
• No Python API for data
manipulation
Yes GPUs are fast but …
10
APP A
DATA MOVEMENT AND TRANSFORMATION
The bane of productivity and performance
CPU GPU
APP B
Read Data
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
APP A
APP B
11
APP A
DATA MOVEMENT AND TRANSFORMATION
What if we could keep data on the GPU?
APP B
Copy & Convert
Copy & Convert
Copy & Convert
APP A GPU
Data
APP B
GPU
Data
Read Data
Load Data
APP B
CPU GPU
APP A
12
DATA FORMATS
Avro
XML
JSON
GML
ProtoBuf
HDFS
Pickle
CSV
Parquet
Pandas
Plain Text vs Binary Compressed vs Uncompressed
CSR
COO
CSC
* Not a complete list
Numpy
13
LEARNING FROM APACHE ARROW
From Apache Arrow Home Page - https://arrow.apache.org/
14
cuDF
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch & Chainer
Deep Learning
Kepler.GL
Visualization
RAPIDS OPEN SOURCE SOFTWARE
15
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
50-100x Improvement
Same code
Language flexible
Primarily on GPU
RAPIDS
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
16
RAPIDS
Rapid Accelerated Platform for Integrating Data Science
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
• Learn what the data science community needs
• Use best practices and standards
• Build scalable systems and algorithms
• Test Applications and workflows
• Iterate
17
RAPIDS
How can I download and use RAPIDS?
• https://ngc.nvidia.com/registry/nvidia-
rapidsai-rapidsai
• https://hub.docker.com/r/rapidsai/rapidsai/
• https://github.com/rapidsai
• WIP: https://anaconda.org/rapidsai/
• WIP:
• https://pypi.org/project/cudf
• https://pypi.org/project/cuml
18
AI LIBRARIES
Accelerating more of the AI ecosystem
Graph Analytics is fundamental to network analysis
Machine Learning is fundamental to prediction,
classification, clustering, anomaly detection and
recommendations.
Both can be accelerated with NVIDIA GPU
8x V100 20-90x faster than dual socket CPU
Decisions Trees
Random Forests
Linear Regressions
Logistics Regressions
K-Means
K-Nearest Neighbor
DBSCAN
Kalman Filtering
Principal Components
Single Value Decomposition
Bayesian Inferencing
PageRank
BFS
Jaccard Similarity
Single Source Shortest Path
Triangle Counting
Louvain Modularity
ARIMA
Holt-Winters
Machine Learning Graph Analytics
Time Series
XGBoost, Criteo Dataset, 90x
3 Hours to 2 mins on 1 DGX-1
cuML & cuGraph
19
CUDF + XGBOOST
DGX-2 vs Scale Out CPU Cluster
• Full end to end pipeline
• Leveraging Dask + PyGDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
20
CUDF + XGBOOST
Scale Out GPU Cluster vs DGX-2
0 50 100 150 200 250 300 350
5xDGX-1
DGX-2
Chart Title
ETL+CSV (s) ML Prep (s) ML (s)
• Full end to end pipeline
• Leveraging Dask for multi-node + PyGDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
21
CUGRAPH
GPU-Accelerated Graph Analytics Library
22
cuDF
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch & Chainer
Deep Learning
Kepler.GL
Visualization
CUDF
23
GPU-ACCELERATED ETL
Is GPU-acceleration really needed?
24
GPU-ACCELERATED ETL
The average data scientist spends 90+% of their time in ETL as opposed
to training models
25
CUDF
GPU DataFrame library
• Apache Arrow data format
• Pandas-like API
• Unary and Binary Operations
• Joins / Merges
• GroupBys
• Filters
• User-Defined Functions (UDFs)
• Accelerated file readers
• Etc.
26
CUDF
Today
LibGDF PyGDF
• Low level library containing function
implementations and C/C++ API
• Importing/exporting a GDF using the CUDA IPC
mechanism
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, and reduction
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames
• Python interface to LibGDF library with
additional functionality
• Creating GDFs from Numpy arrays, Pandas
DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
27
CUDF
Refactor in Progress
cuDF
• Single repository containing both the low level
implementation and high level wrappers and APIs
• Future high level language bindings based on
community demand, feedback, and contributions
• Moving from CFFI to Cython for Python bindings
to better integrate into the PyData community
PyGDFLibGDF
cuDF
28
PANDAS-LIKE API
Python GPU DataFrame library
29
PANDAS-LIKE API
Pandas ↔ PyGDF
30
PANDAS-LIKE API
Built-In Functions
31
DEMO
32
DASK
What is Dask and why does RAPIDS use it for scaling out?
• Dask is a distributed computation scheduler
built to scale Python workloads from laptops to
supercomputer clusters.
• Extremely modular with scheduling, compute,
data transfer, and out-of-core handling all being
disjointed allowing us to plug in our own
implementations.
• Can easily run multiple Dask workers per node
to allow for an easier development model of
one worker per GPU regardless of single node
or multi node environment.
33
DASK
Scale up and out with cuDF
• Use cuDF primitives underneath in map-reduce style
operations with the same high level API
• Instead of using typical Dask data movement of
pickling objects and sending via TCP sockets, take
advantage of hardware advancements using a
communications framework called OpenUCX:
• For intranode data movement, utilize NVLink
and PCIe peer-to-peer communications
• For internode data movement, utilize GPU
RDMA over Infiniband and RoCE
https://github.com/rapidsai/dask_gdf
34
DASK
Scale up and out with cuML
• Native integration with Dask + cuDF
• Can easily use Dask workers to initialize NCCL for
optimized gather / scatter operations
• Example: this is how the dask-xgboost included
in the container works for multi-GPU and multi-
node, multi-GPU
• Provides easy to use, high level primitives for
synchronization of workers which is needed for many
ML algorithms
35
LOOKING TO THE
FUTURE
36
Next few months
GPU DATAFRAME
• Continue improving performance and functionality
• Single GPU
• Single node, multi GPU
• Multi node, multi GPU
• String Support
• Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas
• Accelerated Data Loading
• File formats: CSV, Parquet, Orc – to start
37
GPU-Accelerated string functions with a Pandas-like API
CUSTRING
• API and functionality is following Pandas:
https://pandas.pydata.org/pandas-
docs/stable/api.html#string-handling
• lower()
• ~22x speedup
• find()
• ~40x speedup
• slice()
• ~100x speedup
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
lower() find(#) slice(1,15)
milliseconds
Pandas cudastrings
38
CPUs bottleneck data loading in high throughput systems
ACCELERATED DATA LOADING
• CSV Reader
• Follows API of pandas.read_csv
• Current implementation is >10x speed
improvement over pandas
• Parquet Reader
• Work in progress:
https://github.com/gpuopenanalytics/li
bgdf/pull/85
• Will follow API of pandas.read_parquet
• ORC Reader
• Additionally looking towards GPU-accelerating
decompression for common compression
schemes
Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
39
PYTHON CUDA ARRAY INTERFACE
Interoperability for Python GPU Array Libraries
• The CUDA array interface is a standard format
that describes a GPU array to allow sharing
GPU arrays between different libraries without
needing to copy or convert data
• Numba, CuPy, and PyTorch are the first
libraries to adopt the interface:
• https://numba.pydata.org/numba-
doc/dev/cuda/cuda_array_interface.html
• https://github.com/cupy/cupy/releases/tag/
v5.0.0b4
• https://github.com/pytorch/pytorch/pull/119
84
40
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@GPUOAI
RAPIDS
https://rapids.ai
@RAPIDSAI
41
WE’RE HIRING
Help us build the future!
• Junior/Mid/Senior Data Scientists
• Junior/Mid/Senior Data Engineers
• CUDA
• Internships
THANK YOU
Keith Kraus @keithjkraus

More Related Content

What's hot (20)

PDF
Vector databases and neural search
Dmitry Kan
 
PPTX
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PPT
HyperGraphDb
borislav
 
PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
AI Hardware Landscape 2021
Grigory Sapunov
 
PDF
Spark SQL
Joud Khattab
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
PDF
Apache Spark Introduction
sudhakara st
 
Vector databases and neural search
Dmitry Kan
 
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Introduction to Spark with Python
Gokhan Atil
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
HyperGraphDb
borislav
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
Introduction to Apache Spark
Rahul Jain
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Building a modern Application with DataFrames
Databricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
AI Hardware Landscape 2021
Grigory Sapunov
 
Spark SQL
Joud Khattab
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Apache Spark Introduction
sudhakara st
 

Similar to RAPIDS: GPU-Accelerated ETL and Feature Engineering (20)

PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
PDF
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
PDF
Tesla Accelerated Computing Platform
inside-BigData.com
 
PDF
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
 
PPTX
Stream Processing
arnamoy10
 
PPTX
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Manish Harsh
 
PDF
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
PPT
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
Rapids: Data Science on GPUs
inside-BigData.com
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
RAPIDS Overview
NVIDIA Japan
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
Tesla Accelerated Computing Platform
inside-BigData.com
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
 
Stream Processing
arnamoy10
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Manish Harsh
 
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Ad

Recently uploaded (20)

PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Ad

RAPIDS: GPU-Accelerated ETL and Feature Engineering

  • 1. Keith Kraus 18-10-2018 RAPIDS: GPU-ACCELERATED ETL AND FEATURE ENGINEERING
  • 3. 3 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  • 4. 4 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  • 5. 5 1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year Transistors (thousands) NEED MORE SPEED CPU Performance Has Plateaued
  • 6. 6 WE NEED MORE COMPUTE! Basic workloads are bottlenecked by the CPU Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR • In a simple benchmark consisting of aggregating data, the CPU is the bottleneck • This is after the data is parsed and cached into memory which is another common bottleneck • The CPU bottleneck is even worse in more complex workloads! SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type;
  • 7. 7 HOW CAN WE DO BETTER? • Focus on the full Data Science workflow • Data Loading • Data Transformation • Data Analytics • Python • Provide as close to a drop-in replacement for existing tools • Performance - Leverage GPUs
  • 8. 8 1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025 NEW BEGINNINGS GPU Performance Grows
  • 9. 9 GPU ADOPTION BARRIERS • Too much data movement • Too many makeshift data formats • Writing CUDA C/C++ is hard • No Python API for data manipulation Yes GPUs are fast but …
  • 10. 10 APP A DATA MOVEMENT AND TRANSFORMATION The bane of productivity and performance CPU GPU APP B Read Data Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data APP A APP B
  • 11. 11 APP A DATA MOVEMENT AND TRANSFORMATION What if we could keep data on the GPU? APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B CPU GPU APP A
  • 12. 12 DATA FORMATS Avro XML JSON GML ProtoBuf HDFS Pickle CSV Parquet Pandas Plain Text vs Binary Compressed vs Uncompressed CSR COO CSC * Not a complete list Numpy
  • 13. 13 LEARNING FROM APACHE ARROW From Apache Arrow Home Page - https://arrow.apache.org/
  • 14. 14 cuDF Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization RAPIDS OPEN SOURCE SOFTWARE
  • 15. 15 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION 25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 16. 16 RAPIDS Rapid Accelerated Platform for Integrating Data Science APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE • Learn what the data science community needs • Use best practices and standards • Build scalable systems and algorithms • Test Applications and workflows • Iterate
  • 17. 17 RAPIDS How can I download and use RAPIDS? • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://hub.docker.com/r/rapidsai/rapidsai/ • https://github.com/rapidsai • WIP: https://anaconda.org/rapidsai/ • WIP: • https://pypi.org/project/cudf • https://pypi.org/project/cuml
  • 18. 18 AI LIBRARIES Accelerating more of the AI ecosystem Graph Analytics is fundamental to network analysis Machine Learning is fundamental to prediction, classification, clustering, anomaly detection and recommendations. Both can be accelerated with NVIDIA GPU 8x V100 20-90x faster than dual socket CPU Decisions Trees Random Forests Linear Regressions Logistics Regressions K-Means K-Nearest Neighbor DBSCAN Kalman Filtering Principal Components Single Value Decomposition Bayesian Inferencing PageRank BFS Jaccard Similarity Single Source Shortest Path Triangle Counting Louvain Modularity ARIMA Holt-Winters Machine Learning Graph Analytics Time Series XGBoost, Criteo Dataset, 90x 3 Hours to 2 mins on 1 DGX-1 cuML & cuGraph
  • 19. 19 CUDF + XGBOOST DGX-2 vs Scale Out CPU Cluster • Full end to end pipeline • Leveraging Dask + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 20. 20 CUDF + XGBOOST Scale Out GPU Cluster vs DGX-2 0 50 100 150 200 250 300 350 5xDGX-1 DGX-2 Chart Title ETL+CSV (s) ML Prep (s) ML (s) • Full end to end pipeline • Leveraging Dask for multi-node + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 22. 22 cuDF Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization CUDF
  • 24. 24 GPU-ACCELERATED ETL The average data scientist spends 90+% of their time in ETL as opposed to training models
  • 25. 25 CUDF GPU DataFrame library • Apache Arrow data format • Pandas-like API • Unary and Binary Operations • Joins / Merges • GroupBys • Filters • User-Defined Functions (UDFs) • Accelerated file readers • Etc.
  • 26. 26 CUDF Today LibGDF PyGDF • Low level library containing function implementations and C/C++ API • Importing/exporting a GDF using the CUDA IPC mechanism • CUDA kernels to perform element-wise math operations on GPU DataFrame columns • CUDA sort, join, groupby, and reduction operations on GPU DataFrames • A Python library for manipulating GPU DataFrames • Python interface to LibGDF library with additional functionality • Creating GDFs from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba
  • 27. 27 CUDF Refactor in Progress cuDF • Single repository containing both the low level implementation and high level wrappers and APIs • Future high level language bindings based on community demand, feedback, and contributions • Moving from CFFI to Cython for Python bindings to better integrate into the PyData community PyGDFLibGDF cuDF
  • 28. 28 PANDAS-LIKE API Python GPU DataFrame library
  • 32. 32 DASK What is Dask and why does RAPIDS use it for scaling out? • Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters. • Extremely modular with scheduling, compute, data transfer, and out-of-core handling all being disjointed allowing us to plug in our own implementations. • Can easily run multiple Dask workers per node to allow for an easier development model of one worker per GPU regardless of single node or multi node environment.
  • 33. 33 DASK Scale up and out with cuDF • Use cuDF primitives underneath in map-reduce style operations with the same high level API • Instead of using typical Dask data movement of pickling objects and sending via TCP sockets, take advantage of hardware advancements using a communications framework called OpenUCX: • For intranode data movement, utilize NVLink and PCIe peer-to-peer communications • For internode data movement, utilize GPU RDMA over Infiniband and RoCE https://github.com/rapidsai/dask_gdf
  • 34. 34 DASK Scale up and out with cuML • Native integration with Dask + cuDF • Can easily use Dask workers to initialize NCCL for optimized gather / scatter operations • Example: this is how the dask-xgboost included in the container works for multi-GPU and multi- node, multi-GPU • Provides easy to use, high level primitives for synchronization of workers which is needed for many ML algorithms
  • 36. 36 Next few months GPU DATAFRAME • Continue improving performance and functionality • Single GPU • Single node, multi GPU • Multi node, multi GPU • String Support • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas • Accelerated Data Loading • File formats: CSV, Parquet, Orc – to start
  • 37. 37 GPU-Accelerated string functions with a Pandas-like API CUSTRING • API and functionality is following Pandas: https://pandas.pydata.org/pandas- docs/stable/api.html#string-handling • lower() • ~22x speedup • find() • ~40x speedup • slice() • ~100x speedup 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 lower() find(#) slice(1,15) milliseconds Pandas cudastrings
  • 38. 38 CPUs bottleneck data loading in high throughput systems ACCELERATED DATA LOADING • CSV Reader • Follows API of pandas.read_csv • Current implementation is >10x speed improvement over pandas • Parquet Reader • Work in progress: https://github.com/gpuopenanalytics/li bgdf/pull/85 • Will follow API of pandas.read_parquet • ORC Reader • Additionally looking towards GPU-accelerating decompression for common compression schemes Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
  • 39. 39 PYTHON CUDA ARRAY INTERFACE Interoperability for Python GPU Array Libraries • The CUDA array interface is a standard format that describes a GPU array to allow sharing GPU arrays between different libraries without needing to copy or convert data • Numba, CuPy, and PyTorch are the first libraries to adopt the interface: • https://numba.pydata.org/numba- doc/dev/cuda/cuda_array_interface.html • https://github.com/cupy/cupy/releases/tag/ v5.0.0b4 • https://github.com/pytorch/pytorch/pull/119 84
  • 40. 40 JOIN THE REVOLUTION Everyone Can Help! Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @GPUOAI RAPIDS https://rapids.ai @RAPIDSAI
  • 41. 41 WE’RE HIRING Help us build the future! • Junior/Mid/Senior Data Scientists • Junior/Mid/Senior Data Engineers • CUDA • Internships
  • 42. THANK YOU Keith Kraus @keithjkraus