DECEMBER 15
GLOBAL AI BOOTCAMP IS POWERED BY:
The Data Science Process in ML
How to Apply It and When do We Need It?
Thanks to our Sponsors:
Global Sponsor:
Venue Sponsor:
About me
• Software Architect @
o 16+ years professional experience
• Microsoft Azure MVP
• External Expert Horizon 2020
• External Expert Eurostars-Eureka, InnoFund Denmark
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev
AGENDA
Major Tools
The Purpose of ML
AI as a Service
Iterative ML Process
Takeways
Demo
Machine Learning and Microsoft
• Azure ML integrated, end-to-end data science and advanced analytics
• Microsoft ML related services/tools
• Highlights
o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker)
o Execute experiments in isolated environments and GPU-enabled VMs
DEPRECATED MAINTAINED AND IMPROVED
• (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI
• (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai
• (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark)
Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python)
“Machine Learning Service” (preview) • Azure Batch AI Training
Azure ML Workbench
Desktop application (Windows, macOS) with
• Built-in Jupyter Notebook services and Git integration
• End-to-end process support
o Model development and experimentation (Python)
o Powerful inspectors for data analysis
o Data transformations by example
o Model history and deployment
• Easy to use
and resource hungry 
* Replaced in Sept 24 2018 release to make way for an improved architecture
(ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
Azure ML Studio
• Visual workspace to build, test and deploy ML solutions
• Highlights
o X-browser drag and drop, no programming
o Rich set of modules
o Fits beginners and advanced users
o Unlimited extensibility (R Script, Python Script)
o Enterprise grade cloud service (SLA 99.95%)
o ML REST web services consumption
o Jupyter Notebook
o Azure AI Gallery (9000+ samples)
• At what price?
o Free plan available (10GB storage, 2 web services, 1000 requests/month)
o $10 seat/month + $1 experiment/hour
Azure Data Science VM
• Pre-configured cloud environment for AI & Data Science
• Highlights
o Fully operational environment
o 50+ tools DEV, ML, BigData, Data management
o Windows and Linux (Ubuntu/CentOS)
o Updated every few months
o On-demand elastic capacity
o GPU optimized VMs for deep learning
o Up to 4x GPUs NV K80 or V100
o Up to 128 vCPU, up to 6’144 GiB RAM
• At what price?
o From $11.76/month to $14’314/month
• Cloud-based environment to develop, train, test, deploy,
manage, and track ML models
• Highlights
• Model management
• Distributed deep learning
• Version control and reproducibility
• Hybrid deployment (Local, Cloud, Edge)
• Automated ML (data prep, algorithm, parameters)
• Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker)
• Scale up or out with large GPU-enabled clusters in the cloud
• At what price?
• From $23.51/month to $29’143.94/month
Azure ML Service (preview)
The purpose of ML modelling is:
• Generate predictions
• Understand true relations
Machine Learning Challenges
• Asking the right questions
• Typically 1 Model = 1 Question
• Requires training data
o Real-world data is messy (wrong or missing data)
o Feature engineering transforms to predictive features
o Feature extraction ( i.e. IP Address -> population density)
o Feature selection for informative features
• Overfitting model
o “Kicks ass” while training ,
o fails badly on real predictions
• Model validation
o “Sense” how well model works on new data
Users’ expectations:
• Engaging experience
• Effortless interaction
• High performance
• Relevant content
Businesses aim:
• Provide high value
• Faster and at low cost
o Data science talent
o Powerful infrastructure
o Continuous improvement
The developer role is to
bridge the gap:
Artificial Intelligence as a Service (AIaaS)
Def: Artificial intelligence off the shelf
• Bots and NLP – commands and guidance
• Cognitive APIs – speech, vision, translation, knowledge
• ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service)
• Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio)
• Innovation w/o upfront costs and expertise
• Usability – easy learning curve
• Scalability – start PoC, grow big
• Flexi cost – know what you pay for
• Share data with vendors
• Data regulations (i.e. GDPR)
• Reduced transparency
• Breaking changes
AIaaS market expected to
grow from $1.5Bil (2018) to
10.9Bil (2023)
(ResearchAndMarket Apr’ 2018)
1 year ago this was not as
achievable as it is now.
Some Key Azure AIaaS
Computer Vision
• Advanced algorithms for processing images for information
Face API
• Detect and analyze facial attributes
Custom Vision API
• Build, deploy, improve custom image classifiers (on tags)
LUIS.ai
• Apply custom ML intelligence
to conversational natural language
Custom Decision (experimental)
• Learn behavioural patterns of users
• Appealing
o 64% believe they are working in this century’s most “sexiest” job
• In demand
o 90% contacted at least once a month with job offer
o 50% - weekly, 30% - several times/week, 35% have <2y experience
• The dark side…
o All models are wrong, some are useful
o 80% time is data preparation
o Real life, not academic problems
o Non-linear hypothesis testing
o No full automation
• No one cares how you do it
The Data Scientist Job
Automated ML (AML)
AML is a recommender system for ML pipelines to achieve accuracy with less time
• Problem: Complexity scales faster that time available
• Highlights
o Designed to not look at customer data
o Only each pipeline result is sent to automated ML service
o Data pre-processing, algorithm experimentation, hyperparameters tunings
• How it Works
o Select algorithm: classification(11), regression(9), forecasting(9)
o Specify labeled data source and format (Numpy array, Pandas dataframe)
o Configure target for training (local, remote VM, AML Compute)
o Set AML configuration
automl_classifier = AutoMLConfig(
task='classification',
primary_metric='AUC_weighted',
max_time_sec=12000,
iterations=50, X=F_Train,
y=F_Label,
n_cross_validations=2)
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-
configure-auto-train
Iterative ML Process
Data Understanding (Titanic Dataset)
• Mosaic plot
o Categorical distribution
o Visualizes the relation between X and Y
o Strong relation = Y-splits are far apart
o Conclusion: Women have higher survival rate
• Box plot
o Continuous distribution of numeric var
o IQR = middle 50%
o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR]
o Conclusion: High fares have higher survival rate
• Scatter plot
o How much a variable determines another
o Conclusion: Infants and men 25-45 y
have higher survival rate
• Make features usable
o Numerical
o Categorical (i.e. week day)
o PCA dimensionality reduction
o Dummy variables
• Handle missing data
• Normalize data
o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1])
o Value range influence the importance of the feature compared to other
Data Preprocessing
Feature Engineering, Feature Extraction
Increase predictive power by creating features on raw data
• Features closely related to target (predict default –> debt / balance ratio)
• Easier interpretation (Date to Year/Month/Day/Hour)
• Lag features to “look back” before the date (1, 2,… N days ago)
• Categorical features - identify discrete features
• Rolling aggregates
• smoothening over time window
• Check Azure team data science process
https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
Note: All information is encoded in the digital media
• Images
o Step 1: Colour statistics, EXIF metadata, edges, shapes
o Step 2: Extract knowledge in fixed set of numeric characteristics
• Text
o Step 1:
• Bagging, N-grams, term frequency, topic modelling, stemming
• Named entity recognition (i.e. Wikipedia)
o Step 2: Extract knowledge in fixed set of numeric characteristics
Digital Media Feature Engineering
Feature Selection - select the
most predictive features
For many ML problems, having
a lot of data is a good thing;
but it can sometimes be a curse
Selecting Good Features
• Motivation
o Not only prediction but identification of predictive features
o Computational costs are related to number of features
o Limit external sensors and data sources
• Approach
o Trying all combinations of features? ( that would be infeasible)
• Methods
o Forward selection & Backward elimination
o Filter - Independent from the ML algorithm
o Embedded – Built-in search for predictive features in ML algorithm
o Wrapper – Measure feature usefulness while ML training
Tuning Model Parameters
• Model parameters control inner behaviour
o More sophisticated algorithm, more parameters
o i.e. Locally Deep SVM with kernel
o Kernel type, kernel coefficient
• How parameter tuning works?
1. Choose metric for evaluation (AUC - classification, R2-regression, etc.)
2. Select parameters for optimization
3. Define a grid as Cartesian product between arrays
4. For each combination, cross-validate on training set
5. Select the parameters for the best evaluation
Note: Expected improvement is 3%-8%
Appropriate Algorithms are
Determined by Data
Types of Algorithms
• Linear Algorithms
• Classification - classes separated by straight line
• Support Vector Machine – wide gap from line
• Regression – linear relation variables-label
• Non-Linear Algorithms
• Decision Trees and Jungles - divide space into regions
• Neural Networks – complex and irregular boundaries
• Special Algorithms
• Ordinal Regression – ranked values (i.e. race)
• Poisson Regression - discrete distribution (i.e. nr. of events)
• Bayesian – normal distribution of errors (bell curve)
False AlarmsFalse Alarms have serious impact
• Degraded confidence in the
system
• Loss of revenue
• Loss of brand image
Performance Metrics
• Regression model
o Root Mean Squared Error (RMSE)
o Coefficient of Determination, R2 ϵ [0;1]
• Multi-class classification model
o Confusion matrix
• Binary classification model
o Accuracy based on correct answers
o Area under ROC curve (AUC)
o Threshold
o Precision = TP / (TP + FP)
o Recall = TP / (TP + FN)
o Cost-Balanced (F1)
Handling Imbalanced Data
• Imbalanced: more examples of one class than others (0.001%)
• Errors are not the same
o Prediction of minority class (failures) is more important
o Asymmetric cost (false negative can cost more than false positive)
• Compromised performance of standard ML algorithms
o For 1% minority class, Accuracy of 99% does not mean useful model
o PR-curve is better for imbalanced data
• Oversampling
o SMOTE – allows better learning
o Generate examples combining features of target with features of neighbours
Takeaways
• Team Data Science Process
o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/
• ML in the Microsoft World
o https://docs.microsoft.com/en-us/azure/machine-learning/
• Python for AI
o https://wiki.python.org/moin/PythonForArtificialIntelligence
• Data Science Blog
o https://data-flair.training/blogs/category/machine-learning/
• Starter Books
o Free e-books download link:
https://www.manning.com/books/exploring-data-science
Azure ML StudioAzure ML Workbench

More Related Content

PDF
Python Development in VS2019
PDF
Smart Web Apps with Azure and AI as a Service
PDF
Prepare your data for machine learning
PDF
Industrial IoT with Azure and Open Source
PDF
Time Series Databases for IoT (On-premises and Azure)
PDF
IoT with Azure Machine Learning and InfluxDB
PPTX
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
PPTX
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Python Development in VS2019
Smart Web Apps with Azure and AI as a Service
Prepare your data for machine learning
Industrial IoT with Azure and Open Source
Time Series Databases for IoT (On-premises and Azure)
IoT with Azure Machine Learning and InfluxDB
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...

What's hot (20)

PDF
AI with Azure Machine Learning
PPTX
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
PDF
Ml infra at an early stage
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
PDF
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
PDF
The Quest for an Open Source Data Science Platform
PPTX
AI from your data lake: Using Solr for analytics
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PDF
Challenges of Operationalising Data Science in Production
PDF
Detecting Financial Fraud at Scale with Machine Learning
PDF
Machine learning model to production
PDF
Distributed processing of large graphs in python
PDF
Sawtooth Windows for Feature Aggregations
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
PDF
MLOps with Kubeflow
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Semantic Image Logging Using Approximate Statistics & MLflow
PPTX
Machine Learning with Apache Spark
PPTX
Production machine learning_infrastructure
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
AI with Azure Machine Learning
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Ml infra at an early stage
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
The Quest for an Open Source Data Science Platform
AI from your data lake: Using Solr for analytics
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Challenges of Operationalising Data Science in Production
Detecting Financial Fraud at Scale with Machine Learning
Machine learning model to production
Distributed processing of large graphs in python
Sawtooth Windows for Feature Aggregations
Consolidating MLOps at One of Europe’s Biggest Airports
MLOps with Kubeflow
Production ready big ml workflows from zero to hero daniel marcous @ waze
Semantic Image Logging Using Approximate Statistics & MLflow
Machine Learning with Apache Spark
Production machine learning_infrastructure
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Ad

Similar to The Data Science Process - Do we need it and how to apply? (20)

PDF
The Machine Learning Workflow with Azure
PDF
Machine learning for IoT - unpacking the blackbox
PDF
Azure Machine Learning
PDF
201906 04 Overview of Automated ML June 2019
PPTX
Machine learning
PDF
The Power of Auto ML and How Does it Work
PPTX
Automated machine learning - Global AI night 2019
PDF
Azure Machine Learning tutorial
PPTX
Azure Machine Learning Challenge_Speakers Presentation.pptx
PPTX
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
PDF
cheat-sheet-1-microsoft-azure-ai-fundamentals-ai-900-ai-concepts.pdf
PPTX
MCT Summit Azure automated Machine Learning
PDF
Productionising Machine Learning Models
PDF
Azure Machine Learning and ML on Premises
PPTX
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
PPTX
Azure machine learning tech mela
PPTX
Inside the mind of Sports and Energy Industry through Machine Learning - Igo...
PDF
Walk through of azure machine learning studio new features
PPTX
Getting Started with Azure AutoML
PPTX
Data analytics on Azure
The Machine Learning Workflow with Azure
Machine learning for IoT - unpacking the blackbox
Azure Machine Learning
201906 04 Overview of Automated ML June 2019
Machine learning
The Power of Auto ML and How Does it Work
Automated machine learning - Global AI night 2019
Azure Machine Learning tutorial
Azure Machine Learning Challenge_Speakers Presentation.pptx
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
cheat-sheet-1-microsoft-azure-ai-fundamentals-ai-900-ai-concepts.pdf
MCT Summit Azure automated Machine Learning
Productionising Machine Learning Models
Azure Machine Learning and ML on Premises
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
Azure machine learning tech mela
Inside the mind of Sports and Energy Industry through Machine Learning - Igo...
Walk through of azure machine learning studio new features
Getting Started with Azure AutoML
Data analytics on Azure
Ad

More from Ivo Andreev (20)

PDF
Multi-Agent Era will Define the Future of Software
PDF
LLM-based Multi-Agent Systems to Replace Traditional Software
PDF
LLM Security - Smart to protect, but too smart to be protected
PDF
What are Phi Small Language Models Capable of
PDF
Autonomous Control AI Training from Data
PDF
Autonomous Systems for Optimization and Control
PDF
Cybersecurity and Generative AI - for Good and Bad vol.2
PDF
Architecting AI Solutions in Azure for Business
PDF
Cybersecurity Challenges with Generative AI - for Good and Bad
PDF
JS-Experts - Cybersecurity for Generative AI
PDF
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
PDF
OpenAI GPT in Depth - Questions and Misconceptions
PDF
Cutting Edge Computer Vision for Everyone
PDF
Collecting and Analysing Spaceborn Data
PDF
Collecting and Analysing Satellite Data with Azure Orbital
PDF
Language Studio and Custom Models
PDF
CosmosDB for IoT Scenarios
PDF
Forecasting time series powerful and simple
PDF
Constrained Optimization with Genetic Algorithms and Project Bonsai
PDF
Azure security guidelines for developers
Multi-Agent Era will Define the Future of Software
LLM-based Multi-Agent Systems to Replace Traditional Software
LLM Security - Smart to protect, but too smart to be protected
What are Phi Small Language Models Capable of
Autonomous Control AI Training from Data
Autonomous Systems for Optimization and Control
Cybersecurity and Generative AI - for Good and Bad vol.2
Architecting AI Solutions in Azure for Business
Cybersecurity Challenges with Generative AI - for Good and Bad
JS-Experts - Cybersecurity for Generative AI
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
OpenAI GPT in Depth - Questions and Misconceptions
Cutting Edge Computer Vision for Everyone
Collecting and Analysing Spaceborn Data
Collecting and Analysing Satellite Data with Azure Orbital
Language Studio and Custom Models
CosmosDB for IoT Scenarios
Forecasting time series powerful and simple
Constrained Optimization with Genetic Algorithms and Project Bonsai
Azure security guidelines for developers

Recently uploaded (20)

PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
The AI Revolution in Customer Service - 2025
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
Altius execution marketplace concept.pdf
PDF
Launch a Bumble-Style App with AI Features in 2025.pdf
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PPTX
AQUEEL MUSHTAQUE FAKIH COMPUTER CENTER .
PDF
Identification of potential depression in social media posts
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PPTX
How to use fields_get method in Odoo 18
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
The AI Revolution in Customer Service - 2025
EIS-Webinar-Regulated-Industries-2025-08.pdf
Streamline Vulnerability Management From Minimal Images to SBOMs
Altius execution marketplace concept.pdf
Launch a Bumble-Style App with AI Features in 2025.pdf
Connector Corner: Transform Unstructured Documents with Agentic Automation
A symptom-driven medical diagnosis support model based on machine learning te...
AQUEEL MUSHTAQUE FAKIH COMPUTER CENTER .
Identification of potential depression in social media posts
Advancing precision in air quality forecasting through machine learning integ...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
NewMind AI Journal Monthly Chronicles - August 2025
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Early detection and classification of bone marrow changes in lumbar vertebrae...
Rapid Prototyping: A lecture on prototyping techniques for interface design
How to use fields_get method in Odoo 18
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf

The Data Science Process - Do we need it and how to apply?

  • 1. DECEMBER 15 GLOBAL AI BOOTCAMP IS POWERED BY: The Data Science Process in ML How to Apply It and When do We Need It?
  • 2. Thanks to our Sponsors: Global Sponsor: Venue Sponsor:
  • 3. About me • Software Architect @ o 16+ years professional experience • Microsoft Azure MVP • External Expert Horizon 2020 • External Expert Eurostars-Eureka, InnoFund Denmark • Business Interests o Web Development, SOA, Integration o IoT, Machine Learning, Computer Intelligence o Security & Performance Optimization • Contact [email protected] www.linkedin.com/in/ivelin www.slideshare.net/ivoandreev
  • 4. AGENDA Major Tools The Purpose of ML AI as a Service Iterative ML Process Takeways Demo
  • 5. Machine Learning and Microsoft • Azure ML integrated, end-to-end data science and advanced analytics • Microsoft ML related services/tools • Highlights o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker) o Execute experiments in isolated environments and GPU-enabled VMs DEPRECATED MAINTAINED AND IMPROVED • (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI • (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai • (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark) Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python) “Machine Learning Service” (preview) • Azure Batch AI Training
  • 6. Azure ML Workbench Desktop application (Windows, macOS) with • Built-in Jupyter Notebook services and Git integration • End-to-end process support o Model development and experimentation (Python) o Powerful inspectors for data analysis o Data transformations by example o Model history and deployment • Easy to use and resource hungry  * Replaced in Sept 24 2018 release to make way for an improved architecture (ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
  • 7. Azure ML Studio • Visual workspace to build, test and deploy ML solutions • Highlights o X-browser drag and drop, no programming o Rich set of modules o Fits beginners and advanced users o Unlimited extensibility (R Script, Python Script) o Enterprise grade cloud service (SLA 99.95%) o ML REST web services consumption o Jupyter Notebook o Azure AI Gallery (9000+ samples) • At what price? o Free plan available (10GB storage, 2 web services, 1000 requests/month) o $10 seat/month + $1 experiment/hour
  • 8. Azure Data Science VM • Pre-configured cloud environment for AI & Data Science • Highlights o Fully operational environment o 50+ tools DEV, ML, BigData, Data management o Windows and Linux (Ubuntu/CentOS) o Updated every few months o On-demand elastic capacity o GPU optimized VMs for deep learning o Up to 4x GPUs NV K80 or V100 o Up to 128 vCPU, up to 6’144 GiB RAM • At what price? o From $11.76/month to $14’314/month
  • 9. • Cloud-based environment to develop, train, test, deploy, manage, and track ML models • Highlights • Model management • Distributed deep learning • Version control and reproducibility • Hybrid deployment (Local, Cloud, Edge) • Automated ML (data prep, algorithm, parameters) • Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker) • Scale up or out with large GPU-enabled clusters in the cloud • At what price? • From $23.51/month to $29’143.94/month Azure ML Service (preview)
  • 10. The purpose of ML modelling is: • Generate predictions • Understand true relations
  • 11. Machine Learning Challenges • Asking the right questions • Typically 1 Model = 1 Question • Requires training data o Real-world data is messy (wrong or missing data) o Feature engineering transforms to predictive features o Feature extraction ( i.e. IP Address -> population density) o Feature selection for informative features • Overfitting model o “Kicks ass” while training , o fails badly on real predictions • Model validation o “Sense” how well model works on new data
  • 12. Users’ expectations: • Engaging experience • Effortless interaction • High performance • Relevant content Businesses aim: • Provide high value • Faster and at low cost o Data science talent o Powerful infrastructure o Continuous improvement The developer role is to bridge the gap:
  • 13. Artificial Intelligence as a Service (AIaaS) Def: Artificial intelligence off the shelf • Bots and NLP – commands and guidance • Cognitive APIs – speech, vision, translation, knowledge • ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service) • Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio) • Innovation w/o upfront costs and expertise • Usability – easy learning curve • Scalability – start PoC, grow big • Flexi cost – know what you pay for • Share data with vendors • Data regulations (i.e. GDPR) • Reduced transparency • Breaking changes
  • 14. AIaaS market expected to grow from $1.5Bil (2018) to 10.9Bil (2023) (ResearchAndMarket Apr’ 2018) 1 year ago this was not as achievable as it is now.
  • 15. Some Key Azure AIaaS Computer Vision • Advanced algorithms for processing images for information Face API • Detect and analyze facial attributes Custom Vision API • Build, deploy, improve custom image classifiers (on tags) LUIS.ai • Apply custom ML intelligence to conversational natural language Custom Decision (experimental) • Learn behavioural patterns of users
  • 16. • Appealing o 64% believe they are working in this century’s most “sexiest” job • In demand o 90% contacted at least once a month with job offer o 50% - weekly, 30% - several times/week, 35% have <2y experience • The dark side… o All models are wrong, some are useful o 80% time is data preparation o Real life, not academic problems o Non-linear hypothesis testing o No full automation • No one cares how you do it The Data Scientist Job
  • 17. Automated ML (AML) AML is a recommender system for ML pipelines to achieve accuracy with less time • Problem: Complexity scales faster that time available • Highlights o Designed to not look at customer data o Only each pipeline result is sent to automated ML service o Data pre-processing, algorithm experimentation, hyperparameters tunings • How it Works o Select algorithm: classification(11), regression(9), forecasting(9) o Specify labeled data source and format (Numpy array, Pandas dataframe) o Configure target for training (local, remote VM, AML Compute) o Set AML configuration automl_classifier = AutoMLConfig( task='classification', primary_metric='AUC_weighted', max_time_sec=12000, iterations=50, X=F_Train, y=F_Label, n_cross_validations=2) https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to- configure-auto-train
  • 19. Data Understanding (Titanic Dataset) • Mosaic plot o Categorical distribution o Visualizes the relation between X and Y o Strong relation = Y-splits are far apart o Conclusion: Women have higher survival rate • Box plot o Continuous distribution of numeric var o IQR = middle 50% o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR] o Conclusion: High fares have higher survival rate • Scatter plot o How much a variable determines another o Conclusion: Infants and men 25-45 y have higher survival rate
  • 20. • Make features usable o Numerical o Categorical (i.e. week day) o PCA dimensionality reduction o Dummy variables • Handle missing data • Normalize data o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1]) o Value range influence the importance of the feature compared to other Data Preprocessing
  • 21. Feature Engineering, Feature Extraction Increase predictive power by creating features on raw data • Features closely related to target (predict default –> debt / balance ratio) • Easier interpretation (Date to Year/Month/Day/Hour) • Lag features to “look back” before the date (1, 2,… N days ago) • Categorical features - identify discrete features • Rolling aggregates • smoothening over time window • Check Azure team data science process https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
  • 22. Note: All information is encoded in the digital media • Images o Step 1: Colour statistics, EXIF metadata, edges, shapes o Step 2: Extract knowledge in fixed set of numeric characteristics • Text o Step 1: • Bagging, N-grams, term frequency, topic modelling, stemming • Named entity recognition (i.e. Wikipedia) o Step 2: Extract knowledge in fixed set of numeric characteristics Digital Media Feature Engineering
  • 23. Feature Selection - select the most predictive features For many ML problems, having a lot of data is a good thing; but it can sometimes be a curse
  • 24. Selecting Good Features • Motivation o Not only prediction but identification of predictive features o Computational costs are related to number of features o Limit external sensors and data sources • Approach o Trying all combinations of features? ( that would be infeasible) • Methods o Forward selection & Backward elimination o Filter - Independent from the ML algorithm o Embedded – Built-in search for predictive features in ML algorithm o Wrapper – Measure feature usefulness while ML training
  • 25. Tuning Model Parameters • Model parameters control inner behaviour o More sophisticated algorithm, more parameters o i.e. Locally Deep SVM with kernel o Kernel type, kernel coefficient • How parameter tuning works? 1. Choose metric for evaluation (AUC - classification, R2-regression, etc.) 2. Select parameters for optimization 3. Define a grid as Cartesian product between arrays 4. For each combination, cross-validate on training set 5. Select the parameters for the best evaluation Note: Expected improvement is 3%-8%
  • 27. Types of Algorithms • Linear Algorithms • Classification - classes separated by straight line • Support Vector Machine – wide gap from line • Regression – linear relation variables-label • Non-Linear Algorithms • Decision Trees and Jungles - divide space into regions • Neural Networks – complex and irregular boundaries • Special Algorithms • Ordinal Regression – ranked values (i.e. race) • Poisson Regression - discrete distribution (i.e. nr. of events) • Bayesian – normal distribution of errors (bell curve)
  • 28. False AlarmsFalse Alarms have serious impact • Degraded confidence in the system • Loss of revenue • Loss of brand image
  • 29. Performance Metrics • Regression model o Root Mean Squared Error (RMSE) o Coefficient of Determination, R2 ϵ [0;1] • Multi-class classification model o Confusion matrix • Binary classification model o Accuracy based on correct answers o Area under ROC curve (AUC) o Threshold o Precision = TP / (TP + FP) o Recall = TP / (TP + FN) o Cost-Balanced (F1)
  • 30. Handling Imbalanced Data • Imbalanced: more examples of one class than others (0.001%) • Errors are not the same o Prediction of minority class (failures) is more important o Asymmetric cost (false negative can cost more than false positive) • Compromised performance of standard ML algorithms o For 1% minority class, Accuracy of 99% does not mean useful model o PR-curve is better for imbalanced data • Oversampling o SMOTE – allows better learning o Generate examples combining features of target with features of neighbours
  • 31. Takeaways • Team Data Science Process o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/ • ML in the Microsoft World o https://docs.microsoft.com/en-us/azure/machine-learning/ • Python for AI o https://wiki.python.org/moin/PythonForArtificialIntelligence • Data Science Blog o https://data-flair.training/blogs/category/machine-learning/ • Starter Books o Free e-books download link: https://www.manning.com/books/exploring-data-science
  • 32. Azure ML StudioAzure ML Workbench