SlideShare a Scribd company logo
Zhang Zhang, Victoriya Fedotova
Intel Corporation
November 2016
2
Agenda
Introduction
– A quick intro to Intel® Data Analytics Acceleration Library and Intel®
Distribution for Python
– A brief overview of basic machine learning concepts
Lab activities
– Warm-up exercises: Learn the gist of PyDAAL API
– Linear regression
– Classification with SVM
– K-Means clustering
– PCA
Conclusions
Get Your Hands Dirty with Intel® Distribution for Python*
Modelling
Data Analytics Flow Example
Spam Filter
not spam
not spam
spam
Pre-
process
Collect Store Load
Train &
Validate
Deploy Make Decision
Computational Aspects of Big Data
• Distributed across
different nodes/devices
• Huge data size not fitting
into node/device memory
Volume
• Non-homogeneous data
• Sparse/Missing/Noisy
data
Variety
• Data coming in timeVelocity
Converts, Indexing, Repacking Data Recovery
Distributed Computing Online Computing
D1
DK
P1
RK
R
...
Di Pi+1
Pi
Time
Memory
capacity
Attributes
OutlierNumeric Categorical Missing
Recover
Dense
Algorithm
Sparse
Algorithm
Counter
Intel® Data Analytics Acceleration Library
(Intel® DAAL)
• Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom)
• Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease
network bandwidth utilization, and maximize security
• Offload data to server/cluster for complex and large-scale analytics
(De-)Compression
(De-)Serialization
PCA
Statistical moments
Quantiles
Variance matrix
QR, SVD, Cholesky
Apriori
Outlier detection
Regression
• Linear
• Ridge
Classification
• Naïve Bayes
• SVM
• Classifier boosting
• kNN
Clustering
• Kmeans
• EM GMM
Collaborative filtering
• ALS
Neural Networks
Pre-processing Transformation Analysis Modeling Decision Making
Scientific/Engineering
Web/Social
Business
Validation
Intel® DAAL Main Features
Building end-to-end data applications
Optimized for Intel architectures, from Intel® Atom™, Intel®
Core™, Intel® Xeon®, to Intel® Xeon Phi™
A rich set of widely applicable algorithms for data mining and
machine learning
Batch, online, and distributed processing
Data connectors to a variety of data sources and formats:
KDB*, MySQL*, HDFS, CSV, and user-defined sources/formats
C++, Java, and Python APIs
*Other names and brands may be claimed as the property of others
http://www.rarewallpapers.com/animals/blue-snake-2029/
Python Landscape
Challenge#1:
Domain specialists are not professional
software programmers.
Adoption of Python
continues to grow among
domain specialists and
developers for its
productivity benefits
Challenge#2:
Python performance limits migration
to production systems
Intel’s solution is to…
 Accelerate Python performance
 Enable easy access
 Empower the community
1
Highlights: Intel® Distribution for Python* 2017
Focus on advancing Python performance closer to native speeds
• Prebuilt, accelerated Distribution for numerical & scientific
computing, data analytics, HPC. Optimized for IA
• Drop in replacement for your existing Python. No code changes
required
Easy, out-of-the-box
access to high
performance Python
• Accelerated NumPy/SciPy/scikit-learn with Intel® Math Kernel
Library
• Data analytics with pyDAAL, Enhanced thread scheduling with
TBB, Jupyter* notebook interface, Numba, Cython
• Scale easily with optimized mpi4py and Jupyter notebooks
Drive performance with
multiple optimization
techniques
• Distribution and individual optimized packages available through
conda and Anaconda Cloud
• Optimizations upstreamed back to main Python trunk
Faster access to latest
optimizations for Intel
architecture
Performance Gain from MKL (Compare to “vanilla” SciPy)
Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB
of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
Linear Algebra
• BLAS
• LAPACK
• ScaLAPACK
• Sparse BLAS
• Sparse Solvers
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential
• Log
• Power, Root
Vector RNGs
• Multiple BRNG
• Support methods
for independent
streams creation
• Support all key probability
distributions
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Up to
100x
faster
Up to
10x
faster!
Up to
10x
faster!
Up to
60x
faster!
PyDAAL (Python API for Intel® DAAL)
Turbocharged machine learning tool for Python developers
Interoperability and composability with the SciPy ecosystem:
– Work directly with NumPy ndarrays
– Faster than scikit-learn
We’ll see how to use it in this lab
Get Your Hands Dirty with Intel® Distribution for Python*
Problems
– A company wants to define the impact of
the pricing changes on the number of
product sales
– A biologist wants to define the
relationships between body size, shape,
anatomy and behavior of the organism
Solution: Linear Regression
– A linear model for relationship between
features and the response
Regression
14
Source: Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani. (2014). An Introduction to Statistical Learning. Springer
Problems
– An emailing service provider wants to build a
spam filter for the customers
– A postal service wants to implement
handwritten address interpretation
Solution: Support Vector Machine (SVM)
– Works well for non-linear decision boundary
– Two kernel functions are provided:
– Linear kernel
– Gaussian kernel (RBF)
– Multi-class classifier
– One-vs-One
Classification
Source: Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani. (2014). An Introduction to Statistical Learning. Springer
Problems
– A news provider wants to group the news
with similar headlines in the same section
– Humans with similar genetic pattern are
grouped together to identify correlation
with a specific disease
Solution: K-Means
– Pick k centroids
– Repeat until converge:
– Assign data points to the closest centroid
– Re-calculate centroids as the mean of all points in
the current cluster
– Re-assign data points to the closest centroid
Cluster Analysis
Problems
– Data scientist wants to visualize a multi-
dimensional data set
– A classifier built on the whole data set tends
to overfit
Solution: Principal Component Analysis
– Compute eigen decomposition on the
correlation matrix
– Apply the largest eigenvectors to compute
the largest principal components that can
explain most of variance in original data
Dimensionality Reduction
18
Setup
 Unpack the archive to the local disk
 Run setup script:
– Linux, OS X: ./setup.sh
– Windows: setup.bat
 Set path to conda:
– Linux, OS X: export PATH=<path_to_idp>/bin:$PATH
– Windows: set PATH=<path_to_idp>Scripts;%PATH%
Lab 1: Warm-up Exercise
Learning objectives:
 Understand NumericTable - The main data structure of DAAL
– Create NumericTable from data sources
– Interoperability with NumPy, Pandas, scikit-learn
– Get NumPy ndarray from NumericTable
 Understand code sequence of using DAAL API
– Create an algorithm object
– Pass in input data
– Set algorithm specific parameters
– Compute
– Get results
Lab 2: Linear Regression
Learning objectives:
 Understand the 2 regression algorithms currently available in DAAL
– Linear regression without regularization
– Ridge regression
 Learn supervised learning workflow
– Train a model using known data
– Test the model by making predictions on new data
 Visualize prediction results
Lab 3: Classification with SVM
Learning objectives:
 Understand SVM algorithm usage model
– Multi-class classification with SVM
– Two-class classification with SVM
 Understand quality metrics in classification
– Confusion matrix
– Metrics computed using the confusion matrix (accuracy, etc.)
Lab 4: Clustering with K-Means
Learning objectives:
 Understand the K-Means algorithm supported in DAAL
 Learn basic clustering workflow
– Initialize cluster centroids
– Minimize the goal function
 Visualize clusters
Lab 5: Principal Component Analysis
Learning objectives:
 Understand PCA algorithms support in DAAL:
– Correlation matrix method
– SVD method
 Evaluate and visualize principal components
References
Intel DAAL User’s Guide and Reference Manual
– https://software.intel.com/sites/products/documentation/doclib/daal/daa
l-user-and-reference-guides/index.htm
Intel Distribution for Python Documentation
– https://software.intel.com/en-us/intel-distribution-for-python-
support/documentation
What’s Next - Takeaways
Learn more about Intel® DAAL
– It supports C++ and Java, too!
– We want you to use DAAL in your data projects
Learn more about Intel® Distribution for Python
– Beyond machine learning, many more benefits
Keep an eye on the tutorial repository
– https://github.com/daaltces/pydaal-tutorials
– I’m adding more labs, samples, etc.
Zhang Zhang (zhang.zhang@intel.com)
Victoriya Fedotova (victoriya.s.fedotova@intel.com)
www.intel.com/hpcdevcon
Get Your Hands Dirty with Intel® Distribution for Python*

More Related Content

What's hot (20)

PDF
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
PDF
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
PDF
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
PyData NYC whatsnew NumPy-SciPy 2019
Ralf Gommers
 
PDF
Tokyo Webmining Talk1
Kenta Oono
 
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
PDF
Python array API standardization - current state and benefits
Ralf Gommers
 
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
PDF
running Tensorflow in Production
Matthias Feys
 
PDF
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
PDF
Common Design of Deep Learning Frameworks
Kenta Oono
 
PPTX
MPI Raspberry pi 3 cluster
Arafat Hussain
 
PDF
The Joy of SciPy
kammeyer
 
PDF
Scipy, numpy and friends
Michele Mattioni
 
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PyData NYC whatsnew NumPy-SciPy 2019
Ralf Gommers
 
Tokyo Webmining Talk1
Kenta Oono
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
Python array API standardization - current state and benefits
Ralf Gommers
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
running Tensorflow in Production
Matthias Feys
 
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
Common Design of Deep Learning Frameworks
Kenta Oono
 
MPI Raspberry pi 3 cluster
Arafat Hussain
 
The Joy of SciPy
kammeyer
 
Scipy, numpy and friends
Michele Mattioni
 

Similar to Get Your Hands Dirty with Intel® Distribution for Python* (20)

PPT
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
PDF
04 open source_tools
Marco Quartulli
 
PPTX
Role of python in hpc
Dr Reeja S R
 
PDF
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
PDF
Tips and tricks for data science projects with Python
Jose Manuel Ortega Candel
 
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
PDF
2. Data Preprocessing.pdf
Jyoti Yadav
 
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
 
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Python ml
Shubham Sharma
 
PPTX
Apache Spark sql
aftab alam
 
PPTX
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Data Works MD
 
PPTX
Basic of python for data analysis
Pramod Toraskar
 
PPTX
System mldl meetup
Ganesan Narayanasamy
 
PPTX
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnnnnnnnnnnnnnnnnnnkkkkkkkkkk
ankittshukla07
 
PPTX
stock market prediction using LSTM ankit
ankittshukla07
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Real time streaming analytics
Anirudh
 
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
04 open source_tools
Marco Quartulli
 
Role of python in hpc
Dr Reeja S R
 
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Tips and tricks for data science projects with Python
Jose Manuel Ortega Candel
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
2. Data Preprocessing.pdf
Jyoti Yadav
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Vijay Srinivas Agneeswaran, Ph.D
 
Python ml
Shubham Sharma
 
Apache Spark sql
aftab alam
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Data Works MD
 
Basic of python for data analysis
Pramod Toraskar
 
System mldl meetup
Ganesan Narayanasamy
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnnnnnnnnnnnnnnnnnnkkkkkkkkkk
ankittshukla07
 
stock market prediction using LSTM ankit
ankittshukla07
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Real time streaming analytics
Anirudh
 
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Intel® Software
 
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software
 
PDF
AI for good: Scaling AI in science, healthcare, and more.
Intel® Software
 
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel® Software
 
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
Intel® Software
 
PPTX
Intel Developer Program
Intel® Software
 
PDF
Intel AIDC Houston Summit - Overview Slides
Intel® Software
 
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Intel® Software
 
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Intel® Software
 
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Intel® Software
 
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
Intel® Software
 
PDF
AIDC India - AI on IA
Intel® Software
 
PDF
AIDC India - Intel Movidius / Open Vino Slides
Intel® Software
 
PDF
AIDC India - AI Vision Slides
Intel® Software
 
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Software
 
AI for All: Biology is eating the world & AI is eating Biology
Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Intel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
Intel® Software
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
Intel® Software
 
Intel Developer Program
Intel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
Intel® Software
 
AIDC India - AI on IA
Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
Intel® Software
 
AIDC India - AI Vision Slides
Intel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Intel® Software
 
Ad

Recently uploaded (20)

PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Python basic programing language for automation
DanialHabibi2
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
July Patch Tuesday
Ivanti
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 

Get Your Hands Dirty with Intel® Distribution for Python*

  • 1. Zhang Zhang, Victoriya Fedotova Intel Corporation November 2016
  • 2. 2 Agenda Introduction – A quick intro to Intel® Data Analytics Acceleration Library and Intel® Distribution for Python – A brief overview of basic machine learning concepts Lab activities – Warm-up exercises: Learn the gist of PyDAAL API – Linear regression – Classification with SVM – K-Means clustering – PCA Conclusions
  • 4. Modelling Data Analytics Flow Example Spam Filter not spam not spam spam Pre- process Collect Store Load Train & Validate Deploy Make Decision
  • 5. Computational Aspects of Big Data • Distributed across different nodes/devices • Huge data size not fitting into node/device memory Volume • Non-homogeneous data • Sparse/Missing/Noisy data Variety • Data coming in timeVelocity Converts, Indexing, Repacking Data Recovery Distributed Computing Online Computing D1 DK P1 RK R ... Di Pi+1 Pi Time Memory capacity Attributes OutlierNumeric Categorical Missing Recover Dense Algorithm Sparse Algorithm Counter
  • 6. Intel® Data Analytics Acceleration Library (Intel® DAAL) • Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom) • Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security • Offload data to server/cluster for complex and large-scale analytics (De-)Compression (De-)Serialization PCA Statistical moments Quantiles Variance matrix QR, SVD, Cholesky Apriori Outlier detection Regression • Linear • Ridge Classification • Naïve Bayes • SVM • Classifier boosting • kNN Clustering • Kmeans • EM GMM Collaborative filtering • ALS Neural Networks Pre-processing Transformation Analysis Modeling Decision Making Scientific/Engineering Web/Social Business Validation
  • 7. Intel® DAAL Main Features Building end-to-end data applications Optimized for Intel architectures, from Intel® Atom™, Intel® Core™, Intel® Xeon®, to Intel® Xeon Phi™ A rich set of widely applicable algorithms for data mining and machine learning Batch, online, and distributed processing Data connectors to a variety of data sources and formats: KDB*, MySQL*, HDFS, CSV, and user-defined sources/formats C++, Java, and Python APIs *Other names and brands may be claimed as the property of others
  • 9. Python Landscape Challenge#1: Domain specialists are not professional software programmers. Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#2: Python performance limits migration to production systems Intel’s solution is to…  Accelerate Python performance  Enable easy access  Empower the community
  • 10. 1 Highlights: Intel® Distribution for Python* 2017 Focus on advancing Python performance closer to native speeds • Prebuilt, accelerated Distribution for numerical & scientific computing, data analytics, HPC. Optimized for IA • Drop in replacement for your existing Python. No code changes required Easy, out-of-the-box access to high performance Python • Accelerated NumPy/SciPy/scikit-learn with Intel® Math Kernel Library • Data analytics with pyDAAL, Enhanced thread scheduling with TBB, Jupyter* notebook interface, Numba, Cython • Scale easily with optimized mpi4py and Jupyter notebooks Drive performance with multiple optimization techniques • Distribution and individual optimized packages available through conda and Anaconda Cloud • Optimizations upstreamed back to main Python trunk Faster access to latest optimizations for Intel architecture
  • 11. Performance Gain from MKL (Compare to “vanilla” SciPy) Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS. Linear Algebra • BLAS • LAPACK • ScaLAPACK • Sparse BLAS • Sparse Solvers Fast Fourier Transforms • Multidimensional • FFTW interfaces • Cluster FFT Vector Math • Trigonometric • Hyperbolic • Exponential • Log • Power, Root Vector RNGs • Multiple BRNG • Support methods for independent streams creation • Support all key probability distributions Summary Statistics • Kurtosis • Variation coefficient • Order statistics • Min/max • Variance-covariance And More • Splines • Interpolation • Trust Region • Fast Poisson Solver Up to 100x faster Up to 10x faster! Up to 10x faster! Up to 60x faster!
  • 12. PyDAAL (Python API for Intel® DAAL) Turbocharged machine learning tool for Python developers Interoperability and composability with the SciPy ecosystem: – Work directly with NumPy ndarrays – Faster than scikit-learn We’ll see how to use it in this lab
  • 14. Problems – A company wants to define the impact of the pricing changes on the number of product sales – A biologist wants to define the relationships between body size, shape, anatomy and behavior of the organism Solution: Linear Regression – A linear model for relationship between features and the response Regression 14 Source: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2014). An Introduction to Statistical Learning. Springer
  • 15. Problems – An emailing service provider wants to build a spam filter for the customers – A postal service wants to implement handwritten address interpretation Solution: Support Vector Machine (SVM) – Works well for non-linear decision boundary – Two kernel functions are provided: – Linear kernel – Gaussian kernel (RBF) – Multi-class classifier – One-vs-One Classification Source: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2014). An Introduction to Statistical Learning. Springer
  • 16. Problems – A news provider wants to group the news with similar headlines in the same section – Humans with similar genetic pattern are grouped together to identify correlation with a specific disease Solution: K-Means – Pick k centroids – Repeat until converge: – Assign data points to the closest centroid – Re-calculate centroids as the mean of all points in the current cluster – Re-assign data points to the closest centroid Cluster Analysis
  • 17. Problems – Data scientist wants to visualize a multi- dimensional data set – A classifier built on the whole data set tends to overfit Solution: Principal Component Analysis – Compute eigen decomposition on the correlation matrix – Apply the largest eigenvectors to compute the largest principal components that can explain most of variance in original data Dimensionality Reduction
  • 18. 18
  • 19. Setup  Unpack the archive to the local disk  Run setup script: – Linux, OS X: ./setup.sh – Windows: setup.bat  Set path to conda: – Linux, OS X: export PATH=<path_to_idp>/bin:$PATH – Windows: set PATH=<path_to_idp>Scripts;%PATH%
  • 20. Lab 1: Warm-up Exercise Learning objectives:  Understand NumericTable - The main data structure of DAAL – Create NumericTable from data sources – Interoperability with NumPy, Pandas, scikit-learn – Get NumPy ndarray from NumericTable  Understand code sequence of using DAAL API – Create an algorithm object – Pass in input data – Set algorithm specific parameters – Compute – Get results
  • 21. Lab 2: Linear Regression Learning objectives:  Understand the 2 regression algorithms currently available in DAAL – Linear regression without regularization – Ridge regression  Learn supervised learning workflow – Train a model using known data – Test the model by making predictions on new data  Visualize prediction results
  • 22. Lab 3: Classification with SVM Learning objectives:  Understand SVM algorithm usage model – Multi-class classification with SVM – Two-class classification with SVM  Understand quality metrics in classification – Confusion matrix – Metrics computed using the confusion matrix (accuracy, etc.)
  • 23. Lab 4: Clustering with K-Means Learning objectives:  Understand the K-Means algorithm supported in DAAL  Learn basic clustering workflow – Initialize cluster centroids – Minimize the goal function  Visualize clusters
  • 24. Lab 5: Principal Component Analysis Learning objectives:  Understand PCA algorithms support in DAAL: – Correlation matrix method – SVD method  Evaluate and visualize principal components
  • 25. References Intel DAAL User’s Guide and Reference Manual – https://software.intel.com/sites/products/documentation/doclib/daal/daa l-user-and-reference-guides/index.htm Intel Distribution for Python Documentation – https://software.intel.com/en-us/intel-distribution-for-python- support/documentation
  • 26. What’s Next - Takeaways Learn more about Intel® DAAL – It supports C++ and Java, too! – We want you to use DAAL in your data projects Learn more about Intel® Distribution for Python – Beyond machine learning, many more benefits Keep an eye on the tutorial repository – https://github.com/daaltces/pydaal-tutorials – I’m adding more labs, samples, etc.