SlideShare a Scribd company logo
Domino Data Lab November 10, 2015
Faster data science — without a cluster
Parallel programming in Python
Manojit Nandi
mnandi92@gmail.com
@mnandi92
Who am I?
Domino Data Lab November 10, 2015
• Data Scientist at STEALTHBits Technologies





• Data Science Evangelist at Domino Data Lab





• BS in Decision Science
Domino Data Lab November 10, 2015
Agenda and Goals
• Motivation
• Conceptual intro to parallelism, general principles and pitfalls
• Machine learning applications
• Demos
Goal: Leave you with principles, and practical concrete tools, that will
help you run your code much faster
Motivation
Domino Data Lab November 10, 2015
• Lots of “medium data” problems
• Can fit in memory on one machine
• Lots of naturally parallel problems

• Easy to access large machines
• Clusters are hard
• Not everything fits map-reduce
CPUs with multiple cores have become the standard in the recent
development of modern computer architectures and we can not only find
them in supercomputer facilities but also in our desktop machines at
home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core
processor in 2013.
- Sebastian Rascka
Parallel programing 101
Domino Data Lab November 10, 2015
• Think about independent tasks (hint: “for” loops are a good place to start!)
• Should be CPU-bound tasks
• Warning and pitfalls
• Not a substitute for good code
• Overhead
• Shared resource contention
• Thrashing
Source: Blaise Barney, Lawrence Livermore National Laboratory
Can parallelize at different “levels”
Domino Data Lab November 10, 2015
Will focus on algorithms, with some brief comments on
Experiments
Run against underlying libraries that parallelize
low-level operations, e.g., openBLAS, ATLAS
Write your code (or use a package) to
parallelize functions or steps within your
analysis
Run different analyses at once
Math ops
Algorithms
Experiments
Parallelize tasks to match your resources
Domino Data Lab November 10, 2015
Computing something (CPU)

Reading from disk/database

Writing to disk/database

Network IO (e.g., web scraping)
Saturating a resource will create a bottleneck
Don't oversaturate your resources
Domino Data Lab November 10, 2015
itemIDs = [1, 2, … , n]
parallel-for-each(i = itemIDs){
item = fetchData(i)
result = computeSomething(item)
saveResult(result)
}
Parallelize tasks to match your resources
Domino Data Lab November 10, 2015
items = fetchData([1, 2, … , n])
results = parallel-for-each(i = items){
computeSomething(item)
}
saveResult(results)
Avoid modifying global state
Domino Data Lab November 10, 2015
itemIDs = [0, 0, 0, 0]
parallel-for-each(i = 1:4) {
itemIDs[i] = i
}
A = [0,0,0,0]Array initialized in process 1
[0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process
[0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modified
[0,0,0,0]
When all parallel tasks finish, array in original process remained unchanged
Demo
Domino Data Lab November 10, 2015
Many ML tasks are parallelized
Domino Data Lab November 10, 2015
• Cross-Validation
• Grid Search Selection
• Random Forest
• Kernel Density Estimation
• K-Means Clustering
• Probabilistic Graphical Models
• Online Learning
• Neural Networks (Backpropagation)
Harder to parallelize
Intuitive to parallelize
Cross validation
Domino Data Lab November 10, 2015
Grid search
Domino Data Lab November 10, 2015
1 10 100 1000
Linear
RBF
C
Kernel
Random forest
Domino Data Lab November 10, 2015
Parallel programing in Python
Domino Data Lab November 10, 2015
• Joblib

pythonhosted.org/joblib/parallel.html
• scikit-learn (n_jobs) scikit-learn.org
• GridSearchCV
• RandomForest
• KMeans
• cross_val_score
• IPython Notebook clusters

www.astro.washington.edu/users/vanderplas/Astr599/notebooks/
21_IPythonParallel
Demo
Domino Data Lab November 10, 2015
Parallel Programming using the GPU
Domino Data Lab November 10, 2015
• GPUs are essential to deep learning
because they can yield 10x speed-up
when training the neural networks.
• Use PyCUDA library to write Python
code that executes using the GPU.
Demo
Domino Data Lab November 10, 2015
Can compose layers of parallelism
Domino Data Lab November 10, 2015
c1 c2 cn… c1 c2 cn…c1 c2 cn…
Machines

(experiments)
Cores
RF NN GridSearched 

SVC
Demo
Domino Data Lab November 10, 2015
FYI: Parallel programing in R
Domino Data Lab November 10, 2015
• General purpose
• parallel
• foreach

cran.r-project.org/web/packages/foreach
• More specialized
• randomForest

cran.r-project.org/web/packages/randomForest
• caret

topepo.github.io/caret
• plyr

cran.r-project.org/web/packages/plyr
Domino Data Lab November 10, 2015
dominodatalab.com
blog.dominodatalab.com
@dominodatalab
Check us out!

More Related Content

What's hot (20)

PDF
FireWorks workflow software
Anubhav Jain
 
PDF
MAVRL Workshop 2014 - pymatgen-db & custodian
University of California, San Diego
 
PPTX
D3 in Jupyter : PyData NYC 2015
Brian Coffey
 
PDF
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Ville Tuulos
 
PDF
Deep learning with TensorFlow
Ndjido Ardo BAR
 
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
PDF
Tensorflow presentation
Ahmed rebai
 
PPTX
TensorFlow in Context
Altoros
 
PPTX
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
KEY
Numba lightning
Travis Oliphant
 
PDF
TensorFlow 101
Raghu Rajah
 
PDF
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
PPTX
Jonathan Coveney: Why Pig?
mortardata
 
PDF
High Performance Machine Learning in R with H2O
Sri Ambati
 
PDF
Introduction To TensorFlow
Spotle.ai
 
PDF
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PPTX
Tensorflow internal
Hyunghun Cho
 
FireWorks workflow software
Anubhav Jain
 
MAVRL Workshop 2014 - pymatgen-db & custodian
University of California, San Diego
 
D3 in Jupyter : PyData NYC 2015
Brian Coffey
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Ville Tuulos
 
Deep learning with TensorFlow
Ndjido Ardo BAR
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Tensorflow presentation
Ahmed rebai
 
TensorFlow in Context
Altoros
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
Numba lightning
Travis Oliphant
 
TensorFlow 101
Raghu Rajah
 
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
Jonathan Coveney: Why Pig?
mortardata
 
High Performance Machine Learning in R with H2O
Sri Ambati
 
Introduction To TensorFlow
Spotle.ai
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
Introduction to Storm
Chandler Huang
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Tensorflow internal
Hyunghun Cho
 

Viewers also liked (20)

PPT
Class & Object - Intro
PRN USM
 
PPT
Object and class relationships
Pooja mittal
 
PDF
No-Bullshit Data Science
Domino Data Lab
 
PPTX
Anomaly Detection for Security
Cody Rioux
 
PPTX
The Dark of Building an Production Incident Syste
Alois Reitbauer
 
PPTX
Traffic anomaly detection and attack
Qrator Labs
 
PPTX
Anomaly Detection for Real-World Systems
Manojit Nandi
 
PPTX
Where is Data Going? - RMDC Keynote
Ted Dunning
 
PPTX
Monitoring without alerts
Alois Reitbauer
 
PPTX
The Dark Art of Production Alerting
Alois Reitbauer
 
PPTX
Can a monitoring tool pass the turing test
Alois Reitbauer
 
PPTX
Monitoring large scale Docker production environments
Alois Reitbauer
 
PPTX
PyGotham 2016
Manojit Nandi
 
PPTX
The definition of normal - An introduction and guide to anomaly detection.
Alois Reitbauer
 
PDF
SSL Certificate Expiration and Howler Monkey's Inception
royrapoport
 
PDF
Cloud Tech III: Actionable Metrics
royrapoport
 
PDF
Python Through the Back Door: Netflix Presentation at CodeMash 2014
royrapoport
 
PPTX
Monitoring Docker Application in Production
Alois Reitbauer
 
PPTX
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Alois Reitbauer
 
PPTX
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
 
Class & Object - Intro
PRN USM
 
Object and class relationships
Pooja mittal
 
No-Bullshit Data Science
Domino Data Lab
 
Anomaly Detection for Security
Cody Rioux
 
The Dark of Building an Production Incident Syste
Alois Reitbauer
 
Traffic anomaly detection and attack
Qrator Labs
 
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Where is Data Going? - RMDC Keynote
Ted Dunning
 
Monitoring without alerts
Alois Reitbauer
 
The Dark Art of Production Alerting
Alois Reitbauer
 
Can a monitoring tool pass the turing test
Alois Reitbauer
 
Monitoring large scale Docker production environments
Alois Reitbauer
 
PyGotham 2016
Manojit Nandi
 
The definition of normal - An introduction and guide to anomaly detection.
Alois Reitbauer
 
SSL Certificate Expiration and Howler Monkey's Inception
royrapoport
 
Cloud Tech III: Actionable Metrics
royrapoport
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
royrapoport
 
Monitoring Docker Application in Production
Alois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Alois Reitbauer
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
 
Ad

Similar to Parallel Programming in Python: Speeding up your analysis (20)

PPTX
Blastn plus jupyter on Docker
Lynn Langit
 
PPTX
Databricks for Dummies
Rodney Joyce
 
PDF
Stackato
Jonas Brømsø
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Stackato v5
Jonas Brømsø
 
PDF
PyData Boston 2013
Travis Oliphant
 
PDF
Data processing with celery and rabbit mq
Jeff Peck
 
PPTX
Python ml
Shubham Sharma
 
PDF
NLP based Data Engineering and ETL Tool - Ask On Data.pdf
HelicalInsight1
 
PDF
Phingified ci and deployment strategies ipc 2012
TEQneers GmbH & Co. KG
 
PDF
Data science in ruby, is it possible? is it fast? should we use it?
Rodrigo Urubatan
 
PPTX
.NET per la Data Science e oltre
Marco Parenzan
 
PDF
Proud to be polyglot!
NLJUG
 
PDF
Dev ops lessons learned - Michael Collins
Devopsdays
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Deep learning in production with the best
Adam Gibson
 
PDF
Open Source Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
PDF
Stackato v6
Jonas Brømsø
 
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PPTX
Background processing with hangfire
Aleksandar Bozinovski
 
Blastn plus jupyter on Docker
Lynn Langit
 
Databricks for Dummies
Rodney Joyce
 
Stackato
Jonas Brømsø
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Stackato v5
Jonas Brømsø
 
PyData Boston 2013
Travis Oliphant
 
Data processing with celery and rabbit mq
Jeff Peck
 
Python ml
Shubham Sharma
 
NLP based Data Engineering and ETL Tool - Ask On Data.pdf
HelicalInsight1
 
Phingified ci and deployment strategies ipc 2012
TEQneers GmbH & Co. KG
 
Data science in ruby, is it possible? is it fast? should we use it?
Rodrigo Urubatan
 
.NET per la Data Science e oltre
Marco Parenzan
 
Proud to be polyglot!
NLJUG
 
Dev ops lessons learned - Michael Collins
Devopsdays
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Deep learning in production with the best
Adam Gibson
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
Stackato v6
Jonas Brømsø
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Background processing with hangfire
Aleksandar Bozinovski
 
Ad

Recently uploaded (20)

PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Biography of Daniel Podor.pdf
Daniel Podor
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 

Parallel Programming in Python: Speeding up your analysis

  • 1. Domino Data Lab November 10, 2015 Faster data science — without a cluster Parallel programming in Python Manojit Nandi [email protected] @mnandi92
  • 2. Who am I? Domino Data Lab November 10, 2015 • Data Scientist at STEALTHBits Technologies
 
 
 • Data Science Evangelist at Domino Data Lab
 
 
 • BS in Decision Science
  • 3. Domino Data Lab November 10, 2015 Agenda and Goals • Motivation • Conceptual intro to parallelism, general principles and pitfalls • Machine learning applications • Demos Goal: Leave you with principles, and practical concrete tools, that will help you run your code much faster
  • 4. Motivation Domino Data Lab November 10, 2015 • Lots of “medium data” problems • Can fit in memory on one machine • Lots of naturally parallel problems
 • Easy to access large machines • Clusters are hard • Not everything fits map-reduce CPUs with multiple cores have become the standard in the recent development of modern computer architectures and we can not only find them in supercomputer facilities but also in our desktop machines at home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core processor in 2013. - Sebastian Rascka
  • 5. Parallel programing 101 Domino Data Lab November 10, 2015 • Think about independent tasks (hint: “for” loops are a good place to start!) • Should be CPU-bound tasks • Warning and pitfalls • Not a substitute for good code • Overhead • Shared resource contention • Thrashing Source: Blaise Barney, Lawrence Livermore National Laboratory
  • 6. Can parallelize at different “levels” Domino Data Lab November 10, 2015 Will focus on algorithms, with some brief comments on Experiments Run against underlying libraries that parallelize low-level operations, e.g., openBLAS, ATLAS Write your code (or use a package) to parallelize functions or steps within your analysis Run different analyses at once Math ops Algorithms Experiments
  • 7. Parallelize tasks to match your resources Domino Data Lab November 10, 2015 Computing something (CPU)
 Reading from disk/database
 Writing to disk/database
 Network IO (e.g., web scraping) Saturating a resource will create a bottleneck
  • 8. Don't oversaturate your resources Domino Data Lab November 10, 2015 itemIDs = [1, 2, … , n] parallel-for-each(i = itemIDs){ item = fetchData(i) result = computeSomething(item) saveResult(result) }
  • 9. Parallelize tasks to match your resources Domino Data Lab November 10, 2015 items = fetchData([1, 2, … , n]) results = parallel-for-each(i = items){ computeSomething(item) } saveResult(results)
  • 10. Avoid modifying global state Domino Data Lab November 10, 2015 itemIDs = [0, 0, 0, 0] parallel-for-each(i = 1:4) { itemIDs[i] = i } A = [0,0,0,0]Array initialized in process 1 [0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process [0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modified [0,0,0,0] When all parallel tasks finish, array in original process remained unchanged
  • 11. Demo Domino Data Lab November 10, 2015
  • 12. Many ML tasks are parallelized Domino Data Lab November 10, 2015 • Cross-Validation • Grid Search Selection • Random Forest • Kernel Density Estimation • K-Means Clustering • Probabilistic Graphical Models • Online Learning • Neural Networks (Backpropagation) Harder to parallelize Intuitive to parallelize
  • 13. Cross validation Domino Data Lab November 10, 2015
  • 14. Grid search Domino Data Lab November 10, 2015 1 10 100 1000 Linear RBF C Kernel
  • 15. Random forest Domino Data Lab November 10, 2015
  • 16. Parallel programing in Python Domino Data Lab November 10, 2015 • Joblib
 pythonhosted.org/joblib/parallel.html • scikit-learn (n_jobs) scikit-learn.org • GridSearchCV • RandomForest • KMeans • cross_val_score • IPython Notebook clusters
 www.astro.washington.edu/users/vanderplas/Astr599/notebooks/ 21_IPythonParallel
  • 17. Demo Domino Data Lab November 10, 2015
  • 18. Parallel Programming using the GPU Domino Data Lab November 10, 2015 • GPUs are essential to deep learning because they can yield 10x speed-up when training the neural networks. • Use PyCUDA library to write Python code that executes using the GPU.
  • 19. Demo Domino Data Lab November 10, 2015
  • 20. Can compose layers of parallelism Domino Data Lab November 10, 2015 c1 c2 cn… c1 c2 cn…c1 c2 cn… Machines
 (experiments) Cores RF NN GridSearched 
 SVC
  • 21. Demo Domino Data Lab November 10, 2015
  • 22. FYI: Parallel programing in R Domino Data Lab November 10, 2015 • General purpose • parallel • foreach
 cran.r-project.org/web/packages/foreach • More specialized • randomForest
 cran.r-project.org/web/packages/randomForest • caret
 topepo.github.io/caret • plyr
 cran.r-project.org/web/packages/plyr
  • 23. Domino Data Lab November 10, 2015 dominodatalab.com blog.dominodatalab.com @dominodatalab Check us out!