SlideShare a Scribd company logo
Pavel Klemenkov, Chief Data Scientist @ NVIDIA
RAPIDS: SPEEDING UP
PANDAS AND SCIKIT-LEARN
2
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
3
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
Hyperparameters
optimization
4
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack Python
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
5
GETTING STARTED
rapids.ai getting started
10 minutes to cuDF
6
“GROUP BY” BENCHMARK
7
def randChar(f, numGrp, N):
things = [f.format(x) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100 * np.random.random(), 4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N = int(1e7)
K = 100
pdf = pd.DataFrame({
'id1' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id2' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id3' : randChar("id{0:0=3d}", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
cdf = cudf.DataFrame.from_pandas(pdf)
8
BENCHMARK #1
%%timeit -r 3 -n 3
pdf.groupby(['id1']).agg({'v1':'sum’})
776 ms ± 4.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1']).agg({'v1':'sum’})
21.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Small number of large groups
9
BENCHMARK #2
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Multiple groups
10
BENCHMARK #3
%%timeit -r 3 -n 3
pdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
1.36 s ± 21.9 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
53 ms ± 2.42 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Large number (1e5) of small groups, multiple arrgegates
GroupBy benchmark notebook
11
WAIT A MINUTE…
• Pandas is single-threaded, but there is Dask
• cuDF is a single GPU solution
12
WAIT A MINUTE…
ddf = dask.dataframe.from_pandas(pdf, npartitions=8)
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
ddf.groupby(["id1", "id2"]).agg({'v1': 'sum'}).compute()
1.34 s ± 33.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
DASK DataFrame execution
13
+
14
CUML
15
Category Algorithm Notes
Clustering
Density-Based Spatial Clustering
of Applications with Noise
(DBSCAN)
K-Means Multi-node multi-GPU via Dask
Dimensionality Reduction
Principal Components Analysis
(PCA)
Multi-node multi-GPU via Dask
Truncated Singular Value
Decomposition (tSVD)
Multi-node multi-GPU via Dask
Uniform Manifold Approximation
and Projection (UMAP)
Random Projection
t-Distributed Stochastic
Neighbor Embedding (TSNE)
Linear Models for Regression
or Classification
Linear Regression (OLS)
Linear Regression with Lasso or
Ridge Regularization
ElasticNet Regression
Logistic Regression
Stochastic Gradient Descent
(SGD), Coordinate Descent (CD),
and Quasi-Newton (QN)
(including L-BFGS and OWL-QN)
solvers for linear models
16
Category Algorithm Notes
Nonlinear Models for
Regression or Classification
Random Forest (RF)
Classification
Experimental multi-node multi-
GPU via Dask
Random Forest (RF) Regression
Experimental multi-node multi-
GPU via Dask
Inference for decision tree-
based models
Forest Inference Library (FIL)
K-Nearest Neighbors (KNN)
Multi-node multi-GPU via Dask,
uses Faiss for Nearest Neighbors
Query.
K-Nearest Neighbors (KNN)
Classification
K-Nearest Neighbors (KNN)
Regression
Support Vector Machine
Classifier (SVC)
Epsilon-Support Vector
Regression (SVR)
Time Series Linear Kalman Filter
Holt-Winters Exponential
Smoothing
Auto-regressive Integrated
Moving Average (ARIMA)
17
RANDOM FOREST SNMG
18
START DASK CLUSTER
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
19
GENERATE DATA
# Data parameters
train_size = int(1e6)
test_size = int(1e3)
n_samples = train_size + test_size
n_features = 20
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_clusters_per_class=1, n_informative=int(n_features / 3),
random_state=123, n_classes=5)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
20
DISTRIBUTE DATA TO GPUS
n_partitions = n_workers
# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)
# Partition with Dask
# In this case, each worker will train on 1/n_partitions fraction of the data
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
# Persist to cache the data in active memory
X_train_dask, y_train_dask = 
dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
21
22
BUILD A SCIKIT-LEARN MODEL
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
%%time
# Use all avilable CPU cores
skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
skl_model.fit(X_train, y_train)
CPU times: user 3h 3min 18s, sys: 32.3 s, total: 3h 3min 51s
Wall time: 2min 27s
23
24
BUILD DISTRIBUTED CUML MODEL
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
%%time
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins,
n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
CPU times: user 133 ms, sys: 24.4 ms, total: 157 ms
Wall time: 1.93 s
25
PREDICT AND CHECK ACCURACY
skl_y_pred = skl_model.predict(X_test)
cuml_y_pred = cuml_model.predict(X_test)
# Due to randomness in the algorithm, you may see slight variation in accuracies
print("SKLearn accuracy: ", accuracy_score(y_test, skl_y_pred))
print("CuML accuracy: ", accuracy_score(y_test, cuml_y_pred))
SKLearn accuracy: 0.899
CuML accuracy: 0.886
Random Forest SNMG demo
26
ANY PROBLEMS?
27
YES!
• Still pretty amature and not ready for production
• Especially DASK
• Porting UDFs is hard [1, 2]
• No CPU version (even for inference)
• No automatic memory management
• Due to obvious reasons
1. Apply Operations in cuDF
2. Numba cuDF integration
28
GPU 101
29
2010 2016 2019 Scale factor
Storage 50 MB/s
(HDD)
500 MB/s
(SATA-SSD)
2 GB/s (NVMe-
SSD)
40х
Network 1 Gbit/s 10 Gbit/s 40 Gbit/s 40х
CPU 500 GFLOPS 1 200
GFLOPS
3 000 GFLOPS
(18
cores/avx512)
6x
CPU
mem
40 GB/s 80 GB/s 125 GB/s 3х
GPU 1 300 GFLOPS 6 000
GFLOPS
15 000 GFLOPS 12x
GPU
mem
150 GB/s 480 GB/s 900 GB/s 6х
30
Performance,
GFLOPS
Memory
bandwidth,
GB/s
TDP, W Price, $
Nvidia
Tesla T4
8 100 320 75 3000
Intel®
Xeon® Gold
6140
2 500 120 140 3000
31
GPU VS CPU ARCHITECTURE
32
GPU TAKE AWAYS
1. GPU memory bus is ~7x wider than CPU
2. GPU has thousands of “simple” ALUs
3. GPU is a peripherial device
1. CPU needs to run a CUDA kernel on GPU
2. GPU connects to CPU via PCI Express
33
DRAM CPU GPU DRAM
GPU
(Tesla V100)
DDR4 4ch
60 GB/s
PCI v4 x16
32 GB/s
HBM2
900 GB/s
CPU TO GPU IS SLOW!
30x performance drop
34
GPU BEST PRACTICE
1. Data must not leave GPU memory!
2. You will get performance boost if your dataset is big enough to keep GPU busy
3. Use Apache Arrow compatible formats (e.g. Parquet)
4. Keep an eye on GPUDirect Storage and similar
5. CUDA is different to what you’re used to. Accept it and make use of it!
35
USEFUL LINKS
RAPIDS
RAPIDS DOCS
rapids-nightly dockerhub (use it except for production)
RAPIDS Notebooks
RAPIDS Contributed Notebooks
kNN 600x speedup on MNIST (Kaggle notebook)
Multi-GPU XGBoost with RAPIDS
Dmitry Ursegov presentation for Moscow Spark #7
Numba for CUDA GPUs
PyCUDA
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia

More Related Content

What's hot (20)

PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
 
PPTX
Distributed caching and computing v3.7
Rahul Gupta
 
PPTX
Am I reading GC logs Correctly?
Tier1 App
 
PDF
Chainer ui v0.3 and imagereport
Preferred Networks
 
PPTX
Profiling & Testing with Spark
Roger Rafanell Mas
 
PDF
NoSQL @ CodeMash 2010
Ben Scofield
 
PPTX
Tutorial: Image Generation and Image-to-Image Translation using GAN
Wuhyun Rico Shin
 
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
PDF
20160407_GTC2016_PgSQL_In_Place
Kohei KaiGai
 
PPTX
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 
PDF
Device-specific Clang Tooling for Embedded Systems
emBO_Conference
 
PPTX
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
PPTX
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
PDF
OS-Assisted Task Preemption for Hadoop
Matteo Dell'Amico
 
PDF
On heap cache vs off-heap cache
rgrebski
 
PPTX
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 
PPT
How to Stop Worrying and Start Caching in Java
srisatish ambati
 
POTX
Performance Tuning EC2 Instances
Brendan Gregg
 
PDF
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
 
Distributed caching and computing v3.7
Rahul Gupta
 
Am I reading GC logs Correctly?
Tier1 App
 
Chainer ui v0.3 and imagereport
Preferred Networks
 
Profiling & Testing with Spark
Roger Rafanell Mas
 
NoSQL @ CodeMash 2010
Ben Scofield
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Wuhyun Rico Shin
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
Kohei KaiGai
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 
Device-specific Clang Tooling for Embedded Systems
emBO_Conference
 
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
OS-Assisted Task Preemption for Hadoop
Matteo Dell'Amico
 
On heap cache vs off-heap cache
rgrebski
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 
How to Stop Worrying and Start Caching in Java
srisatish ambati
 
Performance Tuning EC2 Instances
Brendan Gregg
 
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 

Similar to RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia (20)

PDF
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
PDF
RAPIDS Overview
NVIDIA Japan
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
NVIDIA Rapids presentation
testSri1
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
Fast and Scalable Python
Travis Oliphant
 
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PPTX
Scaling Python to CPUs and GPUs
Travis Oliphant
 
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
PDF
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
PDF
GPU Computing With Apache Spark And Python
Jen Aman
 
PPTX
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
PPTX
CUDA DLI Training Courses at GTC 2019
NVIDIA
 
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow6
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
RAPIDS Overview
NVIDIA Japan
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Rapids: Data Science on GPUs
inside-BigData.com
 
NVIDIA Rapids presentation
testSri1
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
Fast and Scalable Python
Travis Oliphant
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Scaling Python to CPUs and GPUs
Travis Oliphant
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
RAPIDS, GPUs & Python - AWS Community Day Melbourne
Ray Hilton
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
GPU Computing With Apache Spark And Python
Jen Aman
 
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
CUDA DLI Training Courses at GTC 2019
NVIDIA
 
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Ad

More from Mail.ru Group (20)

PDF
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Mail.ru Group
 
PDF
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
Mail.ru Group
 
PDF
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Mail.ru Group
 
PDF
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Mail.ru Group
 
PDF
Управление инцидентами в Почте Mail.ru, Антон Викторов
Mail.ru Group
 
PDF
DAST в CI/CD, Ольга Свиридова
Mail.ru Group
 
PDF
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Mail.ru Group
 
PDF
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
Mail.ru Group
 
PDF
WebAuthn в реальной жизни, Анатолий Остапенко
Mail.ru Group
 
PDF
AMP для электронной почты, Сергей Пешков
Mail.ru Group
 
PDF
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Mail.ru Group
 
PDF
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Mail.ru Group
 
PDF
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Mail.ru Group
 
PDF
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Mail.ru Group
 
PDF
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Mail.ru Group
 
PDF
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Mail.ru Group
 
PDF
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Mail.ru Group
 
PDF
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Mail.ru Group
 
PDF
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Mail.ru Group
 
PDF
Learning from Swift sources, Иван Сметанин
Mail.ru Group
 
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Mail.ru Group
 
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
Mail.ru Group
 
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Mail.ru Group
 
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Mail.ru Group
 
Управление инцидентами в Почте Mail.ru, Антон Викторов
Mail.ru Group
 
DAST в CI/CD, Ольга Свиридова
Mail.ru Group
 
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Mail.ru Group
 
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
Mail.ru Group
 
WebAuthn в реальной жизни, Анатолий Остапенко
Mail.ru Group
 
AMP для электронной почты, Сергей Пешков
Mail.ru Group
 
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Mail.ru Group
 
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Mail.ru Group
 
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Mail.ru Group
 
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Mail.ru Group
 
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Mail.ru Group
 
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Mail.ru Group
 
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Mail.ru Group
 
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Mail.ru Group
 
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Mail.ru Group
 
Learning from Swift sources, Иван Сметанин
Mail.ru Group
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 

RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia

  • 1. Pavel Klemenkov, Chief Data Scientist @ NVIDIA RAPIDS: SPEEDING UP PANDAS AND SCIKIT-LEARN
  • 2. 2 TYPICAL DS PIPELINE All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Can we test more hypothesis per unit of time?
  • 3. 3 TYPICAL DS PIPELINE All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Can we test more hypothesis per unit of time? Hyperparameters optimization
  • 4. 4 RAPIDS — OPEN GPU DATA SCIENCE Software Stack Python CUDA PYTHON APACHE ARROW on GPU Memory DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH
  • 5. 5 GETTING STARTED rapids.ai getting started 10 minutes to cuDF
  • 7. 7 def randChar(f, numGrp, N): things = [f.format(x) for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] def randFloat(numGrp, N) : things = [round(100 * np.random.random(), 4) for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] N = int(1e7) K = 100 pdf = pd.DataFrame({ 'id1' : randChar("id{0:0=3d}", K, N), # large groups (char) 'id2' : randChar("id{0:0=3d}", K, N), # large groups (char) 'id3' : randChar("id{0:0=3d}", N//K, N), # small groups (char) 'id4' : np.random.choice(K, N), # large groups (int) 'id5' : np.random.choice(K, N), # large groups (int) 'id6' : np.random.choice(N//K, N), # small groups (int) 'v1' : np.random.choice(5, N), # int in range [1,5] 'v2' : np.random.choice(5, N), # int in range [1,5] 'v3' : randFloat(100,N) # numeric e.g. 23.5749 }) cdf = cudf.DataFrame.from_pandas(pdf)
  • 8. 8 BENCHMARK #1 %%timeit -r 3 -n 3 pdf.groupby(['id1']).agg({'v1':'sum’}) 776 ms ± 4.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1']).agg({'v1':'sum’}) 21.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Small number of large groups
  • 9. 9 BENCHMARK #2 %%timeit -r 3 -n 3 pdf.groupby(['id1','id2']).agg({'v1':'sum’}) 1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1','id2']).agg({'v1':'sum’}) 37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Multiple groups
  • 10. 10 BENCHMARK #3 %%timeit -r 3 -n 3 pdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’}) 1.36 s ± 21.9 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’}) 53 ms ± 2.42 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Large number (1e5) of small groups, multiple arrgegates GroupBy benchmark notebook
  • 11. 11 WAIT A MINUTE… • Pandas is single-threaded, but there is Dask • cuDF is a single GPU solution
  • 12. 12 WAIT A MINUTE… ddf = dask.dataframe.from_pandas(pdf, npartitions=8) %%timeit -r 3 -n 3 pdf.groupby(['id1','id2']).agg({'v1':'sum’}) 1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 ddf.groupby(["id1", "id2"]).agg({'v1': 'sum'}).compute() 1.34 s ± 33.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1','id2']).agg({'v1':'sum’}) 37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) DASK DataFrame execution
  • 13. 13 +
  • 15. 15 Category Algorithm Notes Clustering Density-Based Spatial Clustering of Applications with Noise (DBSCAN) K-Means Multi-node multi-GPU via Dask Dimensionality Reduction Principal Components Analysis (PCA) Multi-node multi-GPU via Dask Truncated Singular Value Decomposition (tSVD) Multi-node multi-GPU via Dask Uniform Manifold Approximation and Projection (UMAP) Random Projection t-Distributed Stochastic Neighbor Embedding (TSNE) Linear Models for Regression or Classification Linear Regression (OLS) Linear Regression with Lasso or Ridge Regularization ElasticNet Regression Logistic Regression Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models
  • 16. 16 Category Algorithm Notes Nonlinear Models for Regression or Classification Random Forest (RF) Classification Experimental multi-node multi- GPU via Dask Random Forest (RF) Regression Experimental multi-node multi- GPU via Dask Inference for decision tree- based models Forest Inference Library (FIL) K-Nearest Neighbors (KNN) Multi-node multi-GPU via Dask, uses Faiss for Nearest Neighbors Query. K-Nearest Neighbors (KNN) Classification K-Nearest Neighbors (KNN) Regression Support Vector Machine Classifier (SVC) Epsilon-Support Vector Regression (SVR) Time Series Linear Kalman Filter Holt-Winters Exponential Smoothing Auto-regressive Integrated Moving Average (ARIMA)
  • 18. 18 START DASK CLUSTER from dask.distributed import Client from dask_cuda import LocalCUDACluster # This will use all GPUs on the local host by default cluster = LocalCUDACluster(threads_per_worker=1) c = Client(cluster) # Query the client for all connected workers workers = c.has_what().keys() n_workers = len(workers) n_streams = 8 # Performance optimization
  • 19. 19 GENERATE DATA # Data parameters train_size = int(1e6) test_size = int(1e3) n_samples = train_size + test_size n_features = 20 X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features, n_clusters_per_class=1, n_informative=int(n_features / 3), random_state=123, n_classes=5) y = y.astype(np.int32) X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
  • 20. 20 DISTRIBUTE DATA TO GPUS n_partitions = n_workers # First convert to cudf (with real data, you would likely load in cuDF format to start) X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train)) y_train_cudf = cudf.Series(y_train) # Partition with Dask # In this case, each worker will train on 1/n_partitions fraction of the data X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions) y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions) # Persist to cache the data in active memory X_train_dask, y_train_dask = dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
  • 21. 21
  • 22. 22 BUILD A SCIKIT-LEARN MODEL # Random Forest building parameters max_depth = 12 n_bins = 16 n_trees = 1000 %%time # Use all avilable CPU cores skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1) skl_model.fit(X_train, y_train) CPU times: user 3h 3min 18s, sys: 32.3 s, total: 3h 3min 51s Wall time: 2min 27s
  • 23. 23
  • 24. 24 BUILD DISTRIBUTED CUML MODEL # Random Forest building parameters max_depth = 12 n_bins = 16 n_trees = 1000 %%time cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams) cuml_model.fit(X_train_dask, y_train_dask) wait(cuml_model.rfs) # Allow asynchronous training tasks to finish CPU times: user 133 ms, sys: 24.4 ms, total: 157 ms Wall time: 1.93 s
  • 25. 25 PREDICT AND CHECK ACCURACY skl_y_pred = skl_model.predict(X_test) cuml_y_pred = cuml_model.predict(X_test) # Due to randomness in the algorithm, you may see slight variation in accuracies print("SKLearn accuracy: ", accuracy_score(y_test, skl_y_pred)) print("CuML accuracy: ", accuracy_score(y_test, cuml_y_pred)) SKLearn accuracy: 0.899 CuML accuracy: 0.886 Random Forest SNMG demo
  • 27. 27 YES! • Still pretty amature and not ready for production • Especially DASK • Porting UDFs is hard [1, 2] • No CPU version (even for inference) • No automatic memory management • Due to obvious reasons 1. Apply Operations in cuDF 2. Numba cuDF integration
  • 29. 29 2010 2016 2019 Scale factor Storage 50 MB/s (HDD) 500 MB/s (SATA-SSD) 2 GB/s (NVMe- SSD) 40х Network 1 Gbit/s 10 Gbit/s 40 Gbit/s 40х CPU 500 GFLOPS 1 200 GFLOPS 3 000 GFLOPS (18 cores/avx512) 6x CPU mem 40 GB/s 80 GB/s 125 GB/s 3х GPU 1 300 GFLOPS 6 000 GFLOPS 15 000 GFLOPS 12x GPU mem 150 GB/s 480 GB/s 900 GB/s 6х
  • 30. 30 Performance, GFLOPS Memory bandwidth, GB/s TDP, W Price, $ Nvidia Tesla T4 8 100 320 75 3000 Intel® Xeon® Gold 6140 2 500 120 140 3000
  • 31. 31 GPU VS CPU ARCHITECTURE
  • 32. 32 GPU TAKE AWAYS 1. GPU memory bus is ~7x wider than CPU 2. GPU has thousands of “simple” ALUs 3. GPU is a peripherial device 1. CPU needs to run a CUDA kernel on GPU 2. GPU connects to CPU via PCI Express
  • 33. 33 DRAM CPU GPU DRAM GPU (Tesla V100) DDR4 4ch 60 GB/s PCI v4 x16 32 GB/s HBM2 900 GB/s CPU TO GPU IS SLOW! 30x performance drop
  • 34. 34 GPU BEST PRACTICE 1. Data must not leave GPU memory! 2. You will get performance boost if your dataset is big enough to keep GPU busy 3. Use Apache Arrow compatible formats (e.g. Parquet) 4. Keep an eye on GPUDirect Storage and similar 5. CUDA is different to what you’re used to. Accept it and make use of it!
  • 35. 35 USEFUL LINKS RAPIDS RAPIDS DOCS rapids-nightly dockerhub (use it except for production) RAPIDS Notebooks RAPIDS Contributed Notebooks kNN 600x speedup on MNIST (Kaggle notebook) Multi-GPU XGBoost with RAPIDS Dmitry Ursegov presentation for Moscow Spark #7 Numba for CUDA GPUs PyCUDA