SlideShare a Scribd company logo
Azure Machine Learning
– 其他篇
台灣微軟
技術傳教士
吳宏彬
8/25/2016
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
什麼是R語言
Open Source
“lingua franca”
Analytics, Computing,
Modeling
Global Community
Millions of users 7000+ Algorithms, Test
Data & Evaluations
Can be Scaled to
Big Data,
Big Analytics
Ecosystem
Scalability
Polls of data miners and analytics professionals on their software
choices since 2007
Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
 R is developed and contributed by open
source community
 CRAN – the Comprehensive R Archive
Network
 Package repository of R
 7500+ packages, covering all aspects of
statistical analysis, machine learning, natural
language processing …
 Still exponentially growth
 Free!
Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
1.Seasonal ARIMA
2.Non Seasonal
ARIMA
3.Seasonal ETS
4.Non -Seasonal ETS
5.Average of Seasonal
ETS and Seasonal
ARIMA
Mean Error (ME) - Average forecasting error (an error is the difference between the
predicted value and the actual value) on the test dataset
Root Mean Squared Error (RMSE) - The square root of the average of squared errors of
predictions made on the test dataset.
Mean Absolute Error (MAE) - The average of absolute errors
Mean Percentage Error (MPE) - The average of percentage errors
Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors
Mean Absolute Scaled Error (MASE)
Symmetric Mean Absolute Percentage Error (sMAPE)
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Datasize
In-memory
In-memory In-Memory or Disk Based
Speed of
Analysis
Single threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
Support
Community Community Community + Commercial
Analytic
Breadth &
Depth
7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages
+ commercial parallel high-
speed functions
License Open Source
Open Source
Commercial license.
Supported release with
indemnity
Microsoft
R Open
Microsoft
R Server
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 Support standard Python library types such as
Pandas data frames and NumPy arrays.
 Execute the Python code is based on Anaconda
2.1, It comes with close to 200 of the most
common Python packages (as NumPy, SciPy and
Scikits-Learn )
 Output generate images from MatplotLib
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
KNN
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
21
What is Spark?
Data is growing faster than processing
speeds
Only solution is to parallelize data
processing on large clusters
Example: HDInsight
Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab
in 2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?
Spark for Azure HDInsight
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Storage Layer
Decision
Maker
Decision
Maker
Decision
Maker
Spark Cluster
clients
Spark Notebooks
Using the Spark shell to run
interactive queries
Using the Spark shell to run Spark
SQL queries
Using a standalone Scala program
Spark
Notebooks
Zeppelin – for
Scala users
Zupyter – for
Python users
Programming
Spark
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
2015 System
Human Error Rate 4%
Speech Recognition could reach human parity in the next 3 years
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
33
Microsoft 透過深度學習技術贏得 ImageNet 2015所
有比賽項目冠軍
28.2
25.8
16.4
11.7
7.3 6.7
3.5
ILSVRC 2010
NEC
America
ILSVRC 2011
Xerox
ILSVRC 2012
AlexNet
ILSVRC 2013
Clarifi
ILSVRC 2014
VGG
ILSVRC 2014
GoogleNet
ILSVRC 2015
MSRA
ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places this year: ImageNet
classification, ImageNet localization, ImageNet detection, COCO
detection, and COCO segmentation
CNTK At the Heart: Computational Networks
•A generalization of machine learning models that can be
described as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
35
36
•A generalization of machine learning models that can be described
as a series of computational steps.
• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model
•Representation:
• A list of computational nodes denoted as
n = {node name : operation name}
• The parent-children relationship describing the operands
{n : c1, ···, cKn }
• Kn is the number of children of node n. For leaf nodes Kn = 0.
• Order of the children matters: e.g., XY is different from YX
• Given the inputs (operands) the value of the node can be computed.
•Can flexibly describe deep learning models.
• Adopted by many other popular tools as well
“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to
multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantization
algorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better
[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
* TensorFlow add distributed compute support in April 2016
 Micrsoft Reacher SLAWEK
SMYL win in CIF 2016 by
using LSTM Neural Network
 Powered by CNTK
CIF Competition 2016 – Final Results
• Contestant 1 – Slawek Smyl (LSTM-based
NN on deseasonalized data)
• Contestant 2 – Slawek Smyl (weighted
average of my 3 methods)
• Contestant 3 – prof. Sven Crone (Multilayer
Perceptron with a thorough feature search)
• Contestant 4 - Mikhail Artyukhov (previous
competition winner, ensemble models)
• Contestant 5 - Joerg Wichard, Bayer
Healthcare AG (Adaptive Forecasting
Strategy with Hybrid Ensemble Models)
• Contestant 6 – Slawek Smyl (LSTM-based
NN)
CNTK Demo
CNTK Architecture
41
CN
Builder
Lambda
CN
Description Use Build
ILearnerIDataReaderFeatures &
Labels Load Get data
IExecutionEngine
CPU/GPU
Task-specific
reader
SGD, AdaGrad,
etc.
Evaluate
Compute Gradient
(1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block
parallel optimization and blockwise model-update filtering”,
in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.
 CNTK is a powerful tool that supports CPU/GPU and
runs under Windows/Linux
 CNTK is extensible with the low-coupling modular
design: adding new readers and new computation
nodes is easy with a new reader design
 Network definition language, macros, and model
editing language (as well as Python and C++
bindings in the future) makes network design and
modification easy
 Compared to other tools CNTK has a great balance
between efficiency, performance, and flexibility
microsoft.com/cognitive
Mahout Spark ML Azure ML R Server
Shared Service No No Yes No
Deployment Model PaaS PaaS PaaS IaaS
Extensibility High High Medium High
Deployment Complexity Medium High Low Medium
Cost High High Low High
Programming Languages Java/Scala Scala/Java/Python Python/R R
Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN)
Scalability High High Medium Medium
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
xgboost Vowpal Wabbit
Rattle
CNTK
*Copy
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
雲端隨選隨用 各式資料 快速上線服務 資料分享
跟協同合作
開放 支援完整資料
分析流程
https://gallery.cortanaintelligence.com/
唯一一家提供從資料匯
入到產生行動及資料呈
現完整的解決方案
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
ConnectR
• High-speed & direct
connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed
format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R
algorithms across nodes
DistributedR
• Distributed computing framework
• Delivers cross-platform portability
Available on:
• Windows Servers
• Red Hat and SuSE Linux Servers
• Teradata Database
• Cloudera Hadoop
• Hortonworks Hadoop
• MapR Hadoop
R+CRAN
• Open source R interpreter
• R 3.2.2
• Freely-available huge range of R
algorithms
• Algorithms callable by RevoR
• 100% Compatible with existing R scripts,
functions and packages
RevoR
• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance
math library to speed up
linear algebra functions
R Open MicrosoftR Server
DeployRDevelopR
 Gradient Boosted Decision Trees
 Naïve Bayes
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models  K-Means
 Decision Trees
 Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
 rxDataStep
 rxExec
 PEMA-R API Custom Algorithms
Additional Resources
•CNTK:
• https://github.com/Microsoft/CNTK
• Contains all the source code and example setups
• You may understand better how CNTK works by reading the source code
• New features are added constantly
•How to contact:
• CNTK team: ask a question on CNTK GitHub!
• Alexey:
• Email: alexey.kamenev@microsoft.com
• : https://www.linkedin.com/in/alexeykamenev
59

More Related Content

What's hot (18)

PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
PPTX
Machine Learning with Hadoop
Sangchul Song
 
PDF
Bigdata Machine Learning Platform
Mk Kim
 
PPTX
IBM Strategy for Spark
Mark Kerzner
 
PDF
Db tech show - hivemall
Makoto Yui
 
PDF
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
PPTX
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
PDF
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
PDF
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Social Networks Analysis
Joud Khattab
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Benchmarking
Venkata Naga Ravi
 
PPTX
High Performance Computing and Big Data
Geoffrey Fox
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Machine Learning with Hadoop
Sangchul Song
 
Bigdata Machine Learning Platform
Mk Kim
 
IBM Strategy for Spark
Mark Kerzner
 
Db tech show - hivemall
Makoto Yui
 
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Social Networks Analysis
Joud Khattab
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
Big Data Benchmarking
Venkata Naga Ravi
 
High Performance Computing and Big Data
Geoffrey Fox
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 

Viewers also liked (10)

PDF
Azure Machine Learning using R
Herman Wu
 
PPTX
Windows phone發展概況 2013Q3
Herman Wu
 
PDF
Azure Data Lake 簡介
Herman Wu
 
PDF
好的Windows Phone App 主要特色 (注意事項)
Herman Wu
 
PDF
Azure HDInsight 介紹
Herman Wu
 
PDF
選擇正確的Solution 來建置現代化的雲端資料倉儲
Herman Wu
 
PDF
Bot Framework & Azure cognitive service簡介
Herman Wu
 
PDF
物聯網應用全貌以及微軟全球案例
Herman Wu
 
PDF
運用MMLSpark 來加速Spark 上 機器學習專案
Herman Wu
 
PDF
貫通物聯網每一哩路 with Microsfot Azure IoT Sutie
Herman Wu
 
Azure Machine Learning using R
Herman Wu
 
Windows phone發展概況 2013Q3
Herman Wu
 
Azure Data Lake 簡介
Herman Wu
 
好的Windows Phone App 主要特色 (注意事項)
Herman Wu
 
Azure HDInsight 介紹
Herman Wu
 
選擇正確的Solution 來建置現代化的雲端資料倉儲
Herman Wu
 
Bot Framework & Azure cognitive service簡介
Herman Wu
 
物聯網應用全貌以及微軟全球案例
Herman Wu
 
運用MMLSpark 來加速Spark 上 機器學習專案
Herman Wu
 
貫通物聯網每一哩路 with Microsfot Azure IoT Sutie
Herman Wu
 
Ad

Similar to Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 (20)

PDF
Microsoft R - Data Science at Scale
Sascha Dittmann
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PPTX
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
PPTX
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PDF
[Research] azure ml anatomy of a machine learning service - Sharat Chikkerur
PAPIs.io
 
PDF
Open source ai_technical_trend
Mario Cho
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Learning Ray, 5th Early Release Max Pumperla
gjslndtloto
 
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
PPTX
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
PPTX
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
PDF
dl-unit-3 materialdl-unit-3 material.pdf
nandan543979
 
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
[Research] azure ml anatomy of a machine learning service - Sharat Chikkerur
PAPIs.io
 
Open source ai_technical_trend
Mario Cho
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Learning Ray, 5th Early Release Max Pumperla
gjslndtloto
 
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
dl-unit-3 materialdl-unit-3 material.pdf
nandan543979
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Ad

Recently uploaded (20)

PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

  • 1. Azure Machine Learning – 其他篇 台灣微軟 技術傳教士 吳宏彬 8/25/2016
  • 4. 什麼是R語言 Open Source “lingua franca” Analytics, Computing, Modeling Global Community Millions of users 7000+ Algorithms, Test Data & Evaluations Can be Scaled to Big Data, Big Analytics Ecosystem Scalability
  • 5. Polls of data miners and analytics professionals on their software choices since 2007 Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
  • 6.  R is developed and contributed by open source community  CRAN – the Comprehensive R Archive Network  Package repository of R  7500+ packages, covering all aspects of statistical analysis, machine learning, natural language processing …  Still exponentially growth  Free! Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/
  • 12. 1.Seasonal ARIMA 2.Non Seasonal ARIMA 3.Seasonal ETS 4.Non -Seasonal ETS 5.Average of Seasonal ETS and Seasonal ARIMA
  • 13. Mean Error (ME) - Average forecasting error (an error is the difference between the predicted value and the actual value) on the test dataset Root Mean Squared Error (RMSE) - The square root of the average of squared errors of predictions made on the test dataset. Mean Absolute Error (MAE) - The average of absolute errors Mean Percentage Error (MPE) - The average of percentage errors Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors Mean Absolute Scaled Error (MASE) Symmetric Mean Absolute Percentage Error (sMAPE)
  • 15. Datasize In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high- speed functions License Open Source Open Source Commercial license. Supported release with indemnity Microsoft R Open Microsoft R Server
  • 17.  Support standard Python library types such as Pandas data frames and NumPy arrays.  Execute the Python code is based on Anaconda 2.1, It comes with close to 200 of the most common Python packages (as NumPy, SciPy and Scikits-Learn )  Output generate images from MatplotLib
  • 19. KNN
  • 22. Data is growing faster than processing speeds Only solution is to parallelize data processing on large clusters Example: HDInsight
  • 23. Fast, expressive cluster computing system compatible with Apache Hadoop • Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: • In-memory computing primitives • General computation graphs Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, was open sourced in 2010 and donated to Apache in 2013 Up to 100× faster Often 2-10× less code What is Spark?
  • 24. Spark for Azure HDInsight Spark Node Spark Node Spark Node Spark Node Spark Node Storage Layer Decision Maker Decision Maker Decision Maker Spark Cluster clients
  • 25. Spark Notebooks Using the Spark shell to run interactive queries Using the Spark shell to run Spark SQL queries Using a standalone Scala program
  • 26. Spark Notebooks Zeppelin – for Scala users Zupyter – for Python users
  • 30. 2015 System Human Error Rate 4% Speech Recognition could reach human parity in the next 3 years
  • 33. 33
  • 34. Microsoft 透過深度學習技術贏得 ImageNet 2015所 有比賽項目冠軍 28.2 25.8 16.4 11.7 7.3 6.7 3.5 ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 MSRA ResNet ImageNet Classification top-5 error (%) Microsoft had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
  • 35. CNTK At the Heart: Computational Networks •A generalization of machine learning models that can be described as a series of computational steps. • E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model •Representation: • A list of computational nodes denoted as n = {node name : operation name} • The parent-children relationship describing the operands {n : c1, ···, cKn } • Kn is the number of children of node n. For leaf nodes Kn = 0. • Order of the children matters: e.g., XY is different from YX • Given the inputs (operands) the value of the node can be computed. •Can flexibly describe deep learning models. • Adopted by many other popular tools as well 35
  • 36. 36 •A generalization of machine learning models that can be described as a series of computational steps. • E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model •Representation: • A list of computational nodes denoted as n = {node name : operation name} • The parent-children relationship describing the operands {n : c1, ···, cKn } • Kn is the number of children of node n. For leaf nodes Kn = 0. • Order of the children matters: e.g., XY is different from YX • Given the inputs (operands) the value of the node can be computed. •Can flexibly describe deep learning models. • Adopted by many other popular tools as well
  • 37. “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.” Theano only supports 1 GPU Achieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 70000 80000 CNTK Theano TensorFlow Torch 7 Caffe speed comparison (samples/second), higher = better [note: December 2015] 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs) * TensorFlow add distributed compute support in April 2016
  • 38.  Micrsoft Reacher SLAWEK SMYL win in CIF 2016 by using LSTM Neural Network  Powered by CNTK
  • 39. CIF Competition 2016 – Final Results • Contestant 1 – Slawek Smyl (LSTM-based NN on deseasonalized data) • Contestant 2 – Slawek Smyl (weighted average of my 3 methods) • Contestant 3 – prof. Sven Crone (Multilayer Perceptron with a thorough feature search) • Contestant 4 - Mikhail Artyukhov (previous competition winner, ensemble models) • Contestant 5 - Joerg Wichard, Bayer Healthcare AG (Adaptive Forecasting Strategy with Hybrid Ensemble Models) • Contestant 6 – Slawek Smyl (LSTM-based NN)
  • 41. CNTK Architecture 41 CN Builder Lambda CN Description Use Build ILearnerIDataReaderFeatures & Labels Load Get data IExecutionEngine CPU/GPU Task-specific reader SGD, AdaGrad, etc. Evaluate Compute Gradient
  • 42. (1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering”, in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.
  • 43.  CNTK is a powerful tool that supports CPU/GPU and runs under Windows/Linux  CNTK is extensible with the low-coupling modular design: adding new readers and new computation nodes is easy with a new reader design  Network definition language, macros, and model editing language (as well as Python and C++ bindings in the future) makes network design and modification easy  Compared to other tools CNTK has a great balance between efficiency, performance, and flexibility
  • 45. Mahout Spark ML Azure ML R Server Shared Service No No Yes No Deployment Model PaaS PaaS PaaS IaaS Extensibility High High Medium High Deployment Complexity Medium High Low Medium Cost High High Low High Programming Languages Java/Scala Scala/Java/Python Python/R R Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN) Scalability High High Medium Medium
  • 51. 雲端隨選隨用 各式資料 快速上線服務 資料分享 跟協同合作 開放 支援完整資料 分析流程
  • 55. ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributing customized R algorithms across nodes DistributedR • Distributed computing framework • Delivers cross-platform portability Available on: • Windows Servers • Red Hat and SuSE Linux Servers • Teradata Database • Cloudera Hadoop • Hortonworks Hadoop • MapR Hadoop R+CRAN • Open source R interpreter • R 3.2.2 • Freely-available huge range of R algorithms • Algorithms callable by RevoR • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math library to speed up linear algebra functions R Open MicrosoftR Server DeployRDevelopR
  • 56.  Gradient Boosted Decision Trees  Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  rxDataStep  rxExec  PEMA-R API Custom Algorithms
  • 57. Additional Resources •CNTK: • https://github.com/Microsoft/CNTK • Contains all the source code and example setups • You may understand better how CNTK works by reading the source code • New features are added constantly •How to contact: • CNTK team: ask a question on CNTK GitHub! • Alexey: • Email: [email protected] • : https://www.linkedin.com/in/alexeykamenev 59