Guiding through a typical
Machine Learning Pipeline
2
ML Pipeline
The Standard Machine Learning Pipeline is derived from the CRISP-DM
Model
Datasets
Data Retrieval
Data Preparation & Feature Engineering
Modeling
Model Evaluation &
Tuning
Deployment &
Monitoring
ML
Algorithm
Satisfactory
Perfor-
mance?
Data Processing
& Wrangling
Feature
Extraction &
Engineering
Feature Scaling
& Selection
No
Yes
1
2
3 4
5
Source: Practical Machine Learning with Python
3
ML Pipeline
Data Retrieval
Raw Data Set
Data Retrieval is mainly data collection,
extraction and acquisition from various data
sources and data stores.
Data Sources or Formats, e.g.:
• CSV
• JSON
• XML
• SQL
• SQLite
• Web Scraping (DOM, HTML)
Data Descriptions:
• Numeric
• Text
• Categorical (Nominal, Ordinal)
More data beats clever algorithms, but
better data beats more data.
Peter Norvig
“
“
1 2 3 4 5
Source: Practical Machine Learning with Python
4
ML Pipeline
Data Preparation & Feature Engineering
Data outcome labels
Dataset Features
Feature set with categorical variables
• In this step the data is pre-processed by cleaning,
wrangling (munging) and manipulation as needed.
• Initial exploratory data analysis is also carried out.
• Data Wrangling
• Data Understanding
• Filtering
• Typecasting
• Data Transformation
• Imputing Missing Values
• Handling Duplicates
• Handling Categorical Data
• Normalizing Values
• String Manipulations
• Data Summarization
• Data Visualization
• Feature Engineering, Scaling, Selection
• Dimensionality Reduction
Data Visualization
Purpose
Methods
1 2 3 4 5
Source: Practical Machine Learning with Python
5
Modelling Procedure
ML Pipeline
Modeling
In the process of modeling, data
features are usually fed to a ML
method or algorithm and train
the model, typically to optimize a
specific cost function in most
cases with the objective of
reducing errors and generalizing
the representations learned from
the data.
Model Types
• Linear models
• Logistic Regression
• Naïve Bayes
• Support Vector Machines
• Non parametric models
• K-Nearest Neighbors
• Tree based models
• Decision tree
• Ensemble methods
• Random forests
• Gradient Boosted Machines
• Neural Networks
• Densely Neural networks (DNN)
• Convolutional Neural networks (CNN)
• Recurrent Neural networks (RNN)
Regression models
• Simple linear regression
• Multiple linear regression
• Non linear regression
Clustering models
• Partition based clustering
• Hierarchical clustering
• Density based clustering
Classification models
• Binary Classification
• Multi-Class Classification
• Multi Label Classification
Activation
Function
Initializing
Parameters
Cost function, Metric
definition
Train with # of
epochs
Evaluate model with test
data
1 2 3 4 5
Source: Practical Machine Learning with Python
6
ML Pipeline
Evaluation & Tuning Methods [1]
Models have various parameters that are tuned in a process
called hyper parameter optimization to gate models with the best
and optimal results.
3-fold cross validation
ROC curve for binary and multi-class model evaluation
Classification models can be evaluated and tested on validation
datasets (k-fold cross) and based on metrics like:
• Accuracy
• Confusion matrix, ROC
Regression models can be evaluated by:
• Coefficient of Determination, R2
• Mean Squared Error
Clustering Models can be validated by:
• Homogeneity
• Completeness
• V-measures (combination)
• Silhouette Coefficient
• Calinski-Harabaz Index
Purpose
Methods
1 2 3 4 5
Source: Practical Machine Learning with Python
7
ML Pipeline
Evaluation & Tuning Methods [2]
Bias Variance Trade-Off
• Finding the best balance between Bias and Variance
Errors.
• Bias Error is the difference between expected and
predicted value of the model estimator. It is caused
by the underlying data and patterns.
• Variance errors arises due to model sensitivity of
outliers and random noise.
Bias Variance Trade Off
Underfitting
• Underfitting is seen as a parameter setup resulting in
a low variance and high bias.
Overfitting
• Overfitting is seen as a parameter setup resulting in
a high variance and low bias.
Grid Search
Simplest hyper-parameter
optimization method. Tries out a
predefined grid of hyper parameter
set to find the best.
Randomized Search
This is a modification of Grid
Search and uses a randomized
grid of hyper-parameter settings
to find the best one.
1 2 3 4 5
Source: Practical Machine Learning with Python
8
ML Pipeline
Deployment & Monitoring
Selected models are deployed in
production and are constantly
monitored based on their predictions
and results.
Deployment Persistence
Model Persistence is the simplest was of deploying
A model. The final model will persist on permanent
media Like hard drive. A new program must route
real-life data to the persistent model which creates
the predicted output.
Custom Development
Another option to deploy a model is by developing
the implementation of model prediction method
separately. The output is just the values of
parameters that were learned. Method for the
software development domain.
In-House Model Deployment
Due to data protection reasons a lot of enterprises
do not want to expose their data on which models
need to be built and deployed. Models can be easily
integrated internally with web dev frameworks, APIs
or micro-services on top of the prediction models.
Model Deployment as a Service
Model is open accessible and can be integrated via
a cloud based API request.
1 2 3 4 5
Source: Practical Machine Learning with Python
9
Michael Gerke
Detecon International GmbH
Sternengasse 14-16
50676 Cologne (Germany)
Phone: +49 221 91611138
Mobile: +49 160 6907433
Email: Michael.Gerke@detecon.com
ML Pipeline
Contact
Special Thanks to the author team:
• Dipanjan Sarkar
• Raghav Bali
• Tushar Sharma

More Related Content

PDF
Real World End to End machine Learning Pipeline
PPTX
Machine Learning in Big Data
PDF
Machine Learning Pipelines
PDF
Introduction to Deep learning
PDF
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
PPTX
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
PDF
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
PDF
Big Data Architecture and Design Patterns
Real World End to End machine Learning Pipeline
Machine Learning in Big Data
Machine Learning Pipelines
Introduction to Deep learning
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Big Data Architecture and Design Patterns

What's hot (20)

PPTX
introduction Azure OpenAI by Usama wahab khan
PDF
AI presentation and introduction - Retrieval Augmented Generation RAG 101
PDF
Scaling Data and ML with Apache Spark and Feast
PPTX
Introduction to Interpretable Machine Learning
PDF
Use Case Patterns for LLM Applications (1).pdf
PDF
Advanced Retrieval Augmented Generation Techniques
PPTX
Federated Learning
KEY
Big Data Trends
PDF
Machine Learning Deep Learning AI and Data Science
PPTX
Intro to Neo4j
PDF
Leveraging Generative AI & Best practices
PDF
generative-ai-fundamentals and Large language models
PPTX
Text Classification
PDF
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
PPTX
From Data Science to MLOps
PDF
Machine learning
PDF
Introduction to LLMs, Prompt Engineering fundamentals,
PDF
CS6010 Social Network Analysis Unit I
PDF
Explainable AI - making ML and DL models more interpretable
PPTX
Microsoft Introduction to Automated Machine Learning
introduction Azure OpenAI by Usama wahab khan
AI presentation and introduction - Retrieval Augmented Generation RAG 101
Scaling Data and ML with Apache Spark and Feast
Introduction to Interpretable Machine Learning
Use Case Patterns for LLM Applications (1).pdf
Advanced Retrieval Augmented Generation Techniques
Federated Learning
Big Data Trends
Machine Learning Deep Learning AI and Data Science
Intro to Neo4j
Leveraging Generative AI & Best practices
generative-ai-fundamentals and Large language models
Text Classification
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
From Data Science to MLOps
Machine learning
Introduction to LLMs, Prompt Engineering fundamentals,
CS6010 Social Network Analysis Unit I
Explainable AI - making ML and DL models more interpretable
Microsoft Introduction to Automated Machine Learning
Ad

Similar to Guiding through a typical Machine Learning Pipeline (20)

PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Machine_Learning_Basics_Presentation.pptx
PDF
From science to engineering, the process to build a machine learning product
PPTX
Machine learning at scale - Webinar By zekeLabs
PPTX
Intro to ML for product school meetup
PDF
Machine learning for IoT - unpacking the blackbox
PPTX
Foundations-of-Machine-Learning_in Engineering.pptx
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
BSSML16 L5. Summary Day 1 Sessions
PDF
Customer choice probabilities
PPTX
Introduction to Spark ML
PPTX
Machine Learning Innovations
PDF
Summary machine learning and model deployment
PPTX
Apache Spark MLlib
PDF
VSSML17 Review. Summary Day 1 Sessions
PPTX
230208 MLOps Getting from Good to Great.pptx
PDF
Making Netflix Machine Learning Algorithms Reliable
Python for Machine Learning_ A Comprehensive Overview.pptx
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
Choosing a Machine Learning technique to solve your need
Machine_Learning_Basics_Presentation.pptx
From science to engineering, the process to build a machine learning product
Machine learning at scale - Webinar By zekeLabs
Intro to ML for product school meetup
Machine learning for IoT - unpacking the blackbox
Foundations-of-Machine-Learning_in Engineering.pptx
Tensors Are All You Need: Faster Inference with Hummingbird
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
BSSML16 L5. Summary Day 1 Sessions
Customer choice probabilities
Introduction to Spark ML
Machine Learning Innovations
Summary machine learning and model deployment
Apache Spark MLlib
VSSML17 Review. Summary Day 1 Sessions
230208 MLOps Getting from Good to Great.pptx
Making Netflix Machine Learning Algorithms Reliable
Ad

Recently uploaded (20)

PPTX
Bussiness Plan S Group of college 2020-23 Final
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
Basic Statistical Analysis for experimental data.pptx
PDF
Introduction to Database Systems Lec # 1
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PDF
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
PPT
What is life? We never know the answer exactly
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PDF
General category merit rank list for neet pg
PPTX
cyber row.pptx for cyber proffesionals and hackers
PDF
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
PPTX
AI-Augmented Business Process Management Systems
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPT
Technicalities in writing workshops indigenous language
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPT
2011 HCRP presentation-final.pptjrirrififfi
PDF
Buddhism presentation about world religion
Bussiness Plan S Group of college 2020-23 Final
Chapter security of computer_8_v8.1.pptx
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Basic Statistical Analysis for experimental data.pptx
Introduction to Database Systems Lec # 1
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPT IEPT 2025_Ms. Nurul Presentation 10.pdf
What is life? We never know the answer exactly
Nucleic-Acids_-Structure-Typ...-1.pdf 011
General category merit rank list for neet pg
cyber row.pptx for cyber proffesionals and hackers
Q1-wK1-Human-and-Cultural-Variation-sy-2024-2025-Copy-1.pdf
AI-Augmented Business Process Management Systems
PPT for Diseases (1)-2, types of diseases.pptx
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
Technicalities in writing workshops indigenous language
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
2011 HCRP presentation-final.pptjrirrififfi
Buddhism presentation about world religion

Guiding through a typical Machine Learning Pipeline

  • 1. Guiding through a typical Machine Learning Pipeline
  • 2. 2 ML Pipeline The Standard Machine Learning Pipeline is derived from the CRISP-DM Model Datasets Data Retrieval Data Preparation & Feature Engineering Modeling Model Evaluation & Tuning Deployment & Monitoring ML Algorithm Satisfactory Perfor- mance? Data Processing & Wrangling Feature Extraction & Engineering Feature Scaling & Selection No Yes 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 3. 3 ML Pipeline Data Retrieval Raw Data Set Data Retrieval is mainly data collection, extraction and acquisition from various data sources and data stores. Data Sources or Formats, e.g.: • CSV • JSON • XML • SQL • SQLite • Web Scraping (DOM, HTML) Data Descriptions: • Numeric • Text • Categorical (Nominal, Ordinal) More data beats clever algorithms, but better data beats more data. Peter Norvig “ “ 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 4. 4 ML Pipeline Data Preparation & Feature Engineering Data outcome labels Dataset Features Feature set with categorical variables • In this step the data is pre-processed by cleaning, wrangling (munging) and manipulation as needed. • Initial exploratory data analysis is also carried out. • Data Wrangling • Data Understanding • Filtering • Typecasting • Data Transformation • Imputing Missing Values • Handling Duplicates • Handling Categorical Data • Normalizing Values • String Manipulations • Data Summarization • Data Visualization • Feature Engineering, Scaling, Selection • Dimensionality Reduction Data Visualization Purpose Methods 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 5. 5 Modelling Procedure ML Pipeline Modeling In the process of modeling, data features are usually fed to a ML method or algorithm and train the model, typically to optimize a specific cost function in most cases with the objective of reducing errors and generalizing the representations learned from the data. Model Types • Linear models • Logistic Regression • Naïve Bayes • Support Vector Machines • Non parametric models • K-Nearest Neighbors • Tree based models • Decision tree • Ensemble methods • Random forests • Gradient Boosted Machines • Neural Networks • Densely Neural networks (DNN) • Convolutional Neural networks (CNN) • Recurrent Neural networks (RNN) Regression models • Simple linear regression • Multiple linear regression • Non linear regression Clustering models • Partition based clustering • Hierarchical clustering • Density based clustering Classification models • Binary Classification • Multi-Class Classification • Multi Label Classification Activation Function Initializing Parameters Cost function, Metric definition Train with # of epochs Evaluate model with test data 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 6. 6 ML Pipeline Evaluation & Tuning Methods [1] Models have various parameters that are tuned in a process called hyper parameter optimization to gate models with the best and optimal results. 3-fold cross validation ROC curve for binary and multi-class model evaluation Classification models can be evaluated and tested on validation datasets (k-fold cross) and based on metrics like: • Accuracy • Confusion matrix, ROC Regression models can be evaluated by: • Coefficient of Determination, R2 • Mean Squared Error Clustering Models can be validated by: • Homogeneity • Completeness • V-measures (combination) • Silhouette Coefficient • Calinski-Harabaz Index Purpose Methods 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 7. 7 ML Pipeline Evaluation & Tuning Methods [2] Bias Variance Trade-Off • Finding the best balance between Bias and Variance Errors. • Bias Error is the difference between expected and predicted value of the model estimator. It is caused by the underlying data and patterns. • Variance errors arises due to model sensitivity of outliers and random noise. Bias Variance Trade Off Underfitting • Underfitting is seen as a parameter setup resulting in a low variance and high bias. Overfitting • Overfitting is seen as a parameter setup resulting in a high variance and low bias. Grid Search Simplest hyper-parameter optimization method. Tries out a predefined grid of hyper parameter set to find the best. Randomized Search This is a modification of Grid Search and uses a randomized grid of hyper-parameter settings to find the best one. 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 8. 8 ML Pipeline Deployment & Monitoring Selected models are deployed in production and are constantly monitored based on their predictions and results. Deployment Persistence Model Persistence is the simplest was of deploying A model. The final model will persist on permanent media Like hard drive. A new program must route real-life data to the persistent model which creates the predicted output. Custom Development Another option to deploy a model is by developing the implementation of model prediction method separately. The output is just the values of parameters that were learned. Method for the software development domain. In-House Model Deployment Due to data protection reasons a lot of enterprises do not want to expose their data on which models need to be built and deployed. Models can be easily integrated internally with web dev frameworks, APIs or micro-services on top of the prediction models. Model Deployment as a Service Model is open accessible and can be integrated via a cloud based API request. 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 9. 9 Michael Gerke Detecon International GmbH Sternengasse 14-16 50676 Cologne (Germany) Phone: +49 221 91611138 Mobile: +49 160 6907433 Email: [email protected] ML Pipeline Contact Special Thanks to the author team: • Dipanjan Sarkar • Raghav Bali • Tushar Sharma