Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Azure Machine Learning
– 其他篇
台灣微軟
技術傳教士
吳宏彬
8/25/2016

什麼是R語言
Open Source
“lingua franca”
Analytics, Computing,
Modeling
Global Community
Millions of users 7000+ Algorithms, Test
Data & Evaluations
Can be Scaled to
Big Data,
Big Analytics
Ecosystem
Scalability

Polls of data miners and analytics professionals on their software
choices since 2007
Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html

 R is developed and contributed by open
source community
 CRAN – the Comprehensive R Archive
Network
 Package repository of R
 7500+ packages, covering all aspects of
statistical analysis, machine learning, natural
language processing …
 Still exponentially growth
 Free!
Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/

1.Seasonal ARIMA
2.Non Seasonal
ARIMA
3.Seasonal ETS
4.Non -Seasonal ETS
5.Average of Seasonal
ETS and Seasonal
ARIMA

Mean Error (ME) - Average forecasting error (an error is the difference between the
predicted value and the actual value) on the test dataset
Root Mean Squared Error (RMSE) - The square root of the average of squared errors of
predictions made on the test dataset.
Mean Absolute Error (MAE) - The average of absolute errors
Mean Percentage Error (MPE) - The average of percentage errors
Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors
Mean Absolute Scaled Error (MASE)
Symmetric Mean Absolute Percentage Error (sMAPE)

Datasize
In-memory
In-memory In-Memory or Disk Based
Speed of
Analysis
Single threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
Support
Community Community Community + Commercial
Analytic
Breadth &
Depth
7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages
+ commercial parallel high-
speed functions
License Open Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server

 Support standard Python library types such as
Pandas data frames and NumPy arrays.
 Execute the Python code is based on Anaconda
2.1, It comes with close to 200 of the most
common Python packages (as NumPy, SciPy and
Scikits-Learn )
 Output generate images from MatplotLib

Data is growing faster than processing
speeds
Only solution is to parallelize data
processing on large clusters
Example: HDInsight

Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab
in 2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?

Spark for Azure HDInsight
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Storage Layer
Decision
Maker
Decision
Maker
Decision
Maker
Spark Cluster
clients

Spark Notebooks
Using the Spark shell to run
interactive queries
Using the Spark shell to run Spark
SQL queries
Using a standalone Scala program

Spark
Notebooks
Zeppelin – for
Scala users
Zupyter – for
Python users

2015 System
Human Error Rate 4%
Speech Recognition could reach human parity in the next 3 years

Microsoft 透過深度學習技術贏得 ImageNet 2015所
有比賽項目冠軍
28.2
25.8
16.4
11.7
7.3 6.7
3.5
ILSVRC 2010
NEC
America
ILSVRC 2011
Xerox
ILSVRC 2012
AlexNet
ILSVRC 2013
Clarifi
ILSVRC 2014
VGG
ILSVRC 2014
GoogleNet
ILSVRC 2015
MSRA
ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet
classification, ImageNet localization, ImageNet detection, COCO
detection, and COCO segmentation

CNTK At the Heart: Computational Networks
•A generalization of machine learning models that can be
described as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
35

36
•A generalization of machine learning models that can be described
as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well

“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to
multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantization
algorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better
[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
* TensorFlow add distributed compute support in April 2016

 Micrsoft Reacher SLAWEK
SMYL win in CIF 2016 by
using LSTM Neural Network
 Powered by CNTK

CIF Competition 2016 – Final Results
• Contestant 1 – Slawek Smyl (LSTM-based
NN on deseasonalized data)
• Contestant 2 – Slawek Smyl (weighted
average of my 3 methods)
• Contestant 3 – prof. Sven Crone (Multilayer
Perceptron with a thorough feature search)
• Contestant 4 - Mikhail Artyukhov (previous
competition winner, ensemble models)
• Contestant 5 - Joerg Wichard, Bayer
Healthcare AG (Adaptive Forecasting
Strategy with Hybrid Ensemble Models)
• Contestant 6 – Slawek Smyl (LSTM-based
NN)

CNTK Architecture
41
CN
Builder
Lambda
CN
Description Use Build
ILearnerIDataReaderFeatures &
Labels Load Get data
IExecutionEngine
CPU/GPU
Task-specific
reader
SGD, AdaGrad,
etc.
Evaluate
Compute Gradient

(1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block
parallel optimization and blockwise model-update filtering”,
in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.

 CNTK is a powerful tool that supports CPU/GPU and
runs under Windows/Linux
 CNTK is extensible with the low-coupling modular
design: adding new readers and new computation
nodes is easy with a new reader design
 Network definition language, macros, and model
editing language (as well as Python and C++
bindings in the future) makes network design and
modification easy
 Compared to other tools CNTK has a great balance
between efficiency, performance, and flexibility

Mahout Spark ML Azure ML R Server
Shared Service No No Yes No
Deployment Model PaaS PaaS PaaS IaaS
Extensibility High High Medium High
Deployment Complexity Medium High Low Medium
Cost High High Low High
Programming Languages Java/Scala Scala/Java/Python Python/R R
Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN)
Scalability High High Medium Medium

xgboost Vowpal Wabbit
Rattle
CNTK
*Copy

雲端隨選隨用各式資料快速上線服務資料分享
跟協同合作
開放支援完整資料
分析流程

https://gallery.cortanaintelligence.com/

唯一一家提供從資料匯
入到產生行動及資料呈
現完整的解決方案

ConnectR
• High-speed & direct
connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed
format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R
algorithms across nodes
DistributedR
• Distributed computing framework
• Delivers cross-platform portability
Available on:
• Windows Servers
• Red Hat and SuSE Linux Servers
• Teradata Database
• Cloudera Hadoop
• Hortonworks Hadoop
• MapR Hadoop
R+CRAN
• Open source R interpreter
• R 3.2.2
• Freely-available huge range of R
algorithms
• Algorithms callable by RevoR
• 100% Compatible with existing R scripts,
functions and packages
RevoR
• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance
math library to speed up
linear algebra functions
R Open MicrosoftR Server
DeployRDevelopR

 Gradient Boosted Decision Trees
 Naïve Bayes
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models  K-Means
 Decision Trees
 Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
 rxDataStep
 rxExec
 PEMA-R API Custom Algorithms

Additional Resources
•CNTK:
• https://github.com/Microsoft/CNTK
• Contains all the source code and example setups
• You may understand better how CNTK works by reading the source code
• New features are added constantly
•How to contact:
• CNTK team: ask a question on CNTK GitHub!
• Alexey:
• Email: alexey.kamenev@microsoft.com
• : https://www.linkedin.com/in/alexeykamenev
59

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

More Related Content

What's hot (18)

Viewers also liked (10)

Similar to Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 (20)

Recently uploaded (20)

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習