SlideShare a Scribd company logo
3
Most read
4
Most read
Machine Learning
Methods and Analysis
Smita Agrawal
Popular AI Methodologies
What is Machine Learning?
 Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data
 Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
Machine
Learning
Classification
Churn Prediction
Fraud Detection
Image Classification
Regression
Market Mix Modelling
ARPU Forecasting
Life/Age Expectancy
Advertising Popularity Prediction
Dimensionality Reduction
Meaningful Compression
Feature Selection
Clustering
Customer Profiling
Targeted Marketing
Recommender Systems
Real-time Decisions
Robot Navigations
Self-Learning Models
The crux of
Machine
Learning lies in
“History Repeats
itself!”
How to be Machine Learning enabled?
A Superficial View of steps
1. Data Curation
Data curation involves gathering data
across relevant attributes such that they
can distinguish and thereby help us learn
more about why something is happening
2. Processing Data
3. Resampling
Resampling data so that data is free of
any biased or over-powered
characteristics of data are selected such
that all the characteristics are balanced
Leveraging measures like Mean,
Median or Mode to process
curated data so that any outliers
or anomalies can be addressed
4. Variable Selection
Not all the variables in the curated data
really distinguish the characteristics.
Variable selection is a critical step which
enables in retaining these variables
5. Predictive Model
A suitable Machine Learning can now
be applied to the data obtained from
step 4 to store the patterns observed
6. Generate Predictions
This is the most exciting step as it is the
stage where we can predict with certain
confidence whether the data points
belong to a characteristic
Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine!
Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
 Emergence of big data has created tremendous opportunities for businesses to gain real-time insights
 Make more informed decisions by leveraging data from the exploding number of digital systems
 However, as often is the case with disruptive technologies, the innovations behind big data have created a critical
problem – one that we call Data Drift
 Data drift creates serious challenges to fully harness the insights available from big data
Data drift is defined as:
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation,
maintenance and modernization of the systems that produce the data
Some Severe Impacts -
 Erodes data fidelity
 Operational reliability
 Ultimately the productivity of your data scientists and engineers
 It increases your costs
 Delays time to analysis
 Decreases the productivity and agility of your data engineers
 Leads to poor decision-making by data scientists and the line of business
Source: Streamsets
Data Drift
Fail Safe Mechanism - Prepping Validation for Automated Predictions
1. Ensure variable data types match in validation data with that in the historical data
2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in
the daily data
3. Measure the mean and most frequent value for numerical variables in historical and validation data
4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe
mechanism
5. There would be no predictions generated for these records as your predictive model does not know that new value as there
were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions
6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model
7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across
variables to ensure that none of the data curation stage was broken which could have a consequent impact on the
predictions
8. The variation should be compared to the historical data’s modal values
9. This should be followed as a real-time(daily) practice even while automatically updating the model
Some Common Data Drift Measures
Population Stability Index
Kolmogorov - Smirnov Statistic
Kullback - Leibler Divergence
Histogram Intersection
01
02
03
04
Smita Agrawal
Thank you!!!

More Related Content

PPTX
Data drift and machine learning
Smita Agrawal
 
PDF
ML Drift - How to find issues before they become problems
Amy Hodler
 
PPTX
Machine Learning for Product Managers
Neal Lathia
 
PPTX
A predictive analytics primer
Raminder Singh
 
PDF
7 steps to Predictive Analytics
Coforge (Erstwhile WHISHWORKS)
 
PDF
The galaxy of data analysis - School of ai Port Harcourt meetup
halifaxchester
 
PDF
Predictive Modelling
Rajib Kumar De
 
PPSX
Transforming Business with Intelligent Data
ashbhatia
 
Data drift and machine learning
Smita Agrawal
 
ML Drift - How to find issues before they become problems
Amy Hodler
 
Machine Learning for Product Managers
Neal Lathia
 
A predictive analytics primer
Raminder Singh
 
7 steps to Predictive Analytics
Coforge (Erstwhile WHISHWORKS)
 
The galaxy of data analysis - School of ai Port Harcourt meetup
halifaxchester
 
Predictive Modelling
Rajib Kumar De
 
Transforming Business with Intelligent Data
ashbhatia
 

What's hot (20)

PDF
Ayasdi strata
Alpine Data
 
PPTX
Guide to data analytics
Debashish Jana
 
PDF
Predicting diabetes using a machine learning approach linked in
venkatvajradhar1
 
PPTX
Machine Learning in Healthcare: A Case Study
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
PPTX
Carma internet research module sample size considerations
Syracuse University
 
PPT
Old Presentation on Security Metrics 2005
Anton Chuvakin
 
PPTX
01 deloitte predictive analytics analytics summit-09-30-14_092514
bethferrara
 
PDF
BigML Education - Anomaly Detection
BigML, Inc
 
PDF
The Data Quality Formula
Experian Data Quality UK
 
PDF
Big data and Process Safety
cvandr4
 
PPTX
Medical data diagnosis
Bhargav Srinivasan
 
PPTX
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
raihansikdar
 
PDF
Make clinical prediction models great again
BenVanCalster
 
PPTX
Mohammed AL Madhani
Mohammad Al Madhani
 
PDF
0940 diamondsponsor de
Rising Media, Inc.
 
DOCX
White paper on medical devices
Dr.Govind Nidigattu
 
PPTX
Security Administration Vii 2 Statistical Analysis
Carter F. Smith, J.D., Ph.D.
 
PPTX
Houston, we have a problem! Using live data to warn of current and upcoming i...
Health Informatics New Zealand
 
PPTX
Big data chicago v2 5 14 14
Tim Gilchrist
 
PDF
Simulation pitfalls p302023
vijaykale1981
 
Ayasdi strata
Alpine Data
 
Guide to data analytics
Debashish Jana
 
Predicting diabetes using a machine learning approach linked in
venkatvajradhar1
 
Machine Learning in Healthcare: A Case Study
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Carma internet research module sample size considerations
Syracuse University
 
Old Presentation on Security Metrics 2005
Anton Chuvakin
 
01 deloitte predictive analytics analytics summit-09-30-14_092514
bethferrara
 
BigML Education - Anomaly Detection
BigML, Inc
 
The Data Quality Formula
Experian Data Quality UK
 
Big data and Process Safety
cvandr4
 
Medical data diagnosis
Bhargav Srinivasan
 
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
raihansikdar
 
Make clinical prediction models great again
BenVanCalster
 
Mohammed AL Madhani
Mohammad Al Madhani
 
0940 diamondsponsor de
Rising Media, Inc.
 
White paper on medical devices
Dr.Govind Nidigattu
 
Security Administration Vii 2 Statistical Analysis
Carter F. Smith, J.D., Ph.D.
 
Houston, we have a problem! Using live data to warn of current and upcoming i...
Health Informatics New Zealand
 
Big data chicago v2 5 14 14
Tim Gilchrist
 
Simulation pitfalls p302023
vijaykale1981
 
Ad

Similar to Data drift and machine learning (20)

PDF
Test Data Management Explained: Why It’s the Backbone of Quality Testing
Shubham Joshi
 
PDF
Machine Learning in Autonomous Data Warehouse
Sandesh Rao
 
PDF
Big Data Tools PowerPoint Presentation Slides
SlideTeam
 
PPTX
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
Value Amplify Consulting
 
PDF
Introduction to Machine Learning and Data Science using Autonomous Database ...
Sandesh Rao
 
PDF
What is Data Observability.pdf
4dalert
 
PDF
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
United States Artificial Intelligence Institute
 
PDF
Introduction to Machine Learning and Data Science using the Autonomous databa...
Sandesh Rao
 
PDF
How Organizations are Using AI for Data Management
tamizhias2003
 
PDF
Unlock the power of MLOps.pdf
AnastasiaSteele10
 
PDF
Unlock the power of MLOps.pdf
JamieDornan2
 
PDF
Unlock the power of MLOps.pdf
StephenAmell4
 
PDF
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
Keerthi Mindnotix
 
PPTX
Enterprise Test Data Generation.pptx
GenRocket Inc
 
PDF
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
EMC
 
PDF
How to generate Synthetic Data for an effective App Testing strategy.pdf
pCloudy
 
PPTX
Data_analyst_types of data, Structured, Unstructured and Semi-structured Data
grsssyw24
 
PDF
The power of AI and ML in Testing .
tisnatom
 
PDF
MetaSuite and_hp_quality_center_enterprise
Minerva SoftCare GmbH
 
PDF
Machine Learning for Business - Eight Best Practices for Getting Started
Bhupesh Chaurasia
 
Test Data Management Explained: Why It’s the Backbone of Quality Testing
Shubham Joshi
 
Machine Learning in Autonomous Data Warehouse
Sandesh Rao
 
Big Data Tools PowerPoint Presentation Slides
SlideTeam
 
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
Value Amplify Consulting
 
Introduction to Machine Learning and Data Science using Autonomous Database ...
Sandesh Rao
 
What is Data Observability.pdf
4dalert
 
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
United States Artificial Intelligence Institute
 
Introduction to Machine Learning and Data Science using the Autonomous databa...
Sandesh Rao
 
How Organizations are Using AI for Data Management
tamizhias2003
 
Unlock the power of MLOps.pdf
AnastasiaSteele10
 
Unlock the power of MLOps.pdf
JamieDornan2
 
Unlock the power of MLOps.pdf
StephenAmell4
 
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
Keerthi Mindnotix
 
Enterprise Test Data Generation.pptx
GenRocket Inc
 
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
EMC
 
How to generate Synthetic Data for an effective App Testing strategy.pdf
pCloudy
 
Data_analyst_types of data, Structured, Unstructured and Semi-structured Data
grsssyw24
 
The power of AI and ML in Testing .
tisnatom
 
MetaSuite and_hp_quality_center_enterprise
Minerva SoftCare GmbH
 
Machine Learning for Business - Eight Best Practices for Getting Started
Bhupesh Chaurasia
 
Ad

Recently uploaded (20)

PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Presentation on animal welfare a good topic
kidscream385
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 

Data drift and machine learning

  • 1. Machine Learning Methods and Analysis Smita Agrawal
  • 3. What is Machine Learning?  Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data  Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques Supervised Learning Unsupervised Learning Reinforcement Learning Machine Learning Classification Churn Prediction Fraud Detection Image Classification Regression Market Mix Modelling ARPU Forecasting Life/Age Expectancy Advertising Popularity Prediction Dimensionality Reduction Meaningful Compression Feature Selection Clustering Customer Profiling Targeted Marketing Recommender Systems Real-time Decisions Robot Navigations Self-Learning Models The crux of Machine Learning lies in “History Repeats itself!”
  • 4. How to be Machine Learning enabled? A Superficial View of steps 1. Data Curation Data curation involves gathering data across relevant attributes such that they can distinguish and thereby help us learn more about why something is happening 2. Processing Data 3. Resampling Resampling data so that data is free of any biased or over-powered characteristics of data are selected such that all the characteristics are balanced Leveraging measures like Mean, Median or Mode to process curated data so that any outliers or anomalies can be addressed 4. Variable Selection Not all the variables in the curated data really distinguish the characteristics. Variable selection is a critical step which enables in retaining these variables 5. Predictive Model A suitable Machine Learning can now be applied to the data obtained from step 4 to store the patterns observed 6. Generate Predictions This is the most exciting step as it is the stage where we can predict with certain confidence whether the data points belong to a characteristic Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine! Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
  • 5.  Emergence of big data has created tremendous opportunities for businesses to gain real-time insights  Make more informed decisions by leveraging data from the exploding number of digital systems  However, as often is the case with disruptive technologies, the innovations behind big data have created a critical problem – one that we call Data Drift  Data drift creates serious challenges to fully harness the insights available from big data Data drift is defined as: The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Some Severe Impacts -  Erodes data fidelity  Operational reliability  Ultimately the productivity of your data scientists and engineers  It increases your costs  Delays time to analysis  Decreases the productivity and agility of your data engineers  Leads to poor decision-making by data scientists and the line of business Source: Streamsets Data Drift
  • 6. Fail Safe Mechanism - Prepping Validation for Automated Predictions 1. Ensure variable data types match in validation data with that in the historical data 2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in the daily data 3. Measure the mean and most frequent value for numerical variables in historical and validation data 4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe mechanism 5. There would be no predictions generated for these records as your predictive model does not know that new value as there were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions 6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model 7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across variables to ensure that none of the data curation stage was broken which could have a consequent impact on the predictions 8. The variation should be compared to the historical data’s modal values 9. This should be followed as a real-time(daily) practice even while automatically updating the model
  • 7. Some Common Data Drift Measures Population Stability Index Kolmogorov - Smirnov Statistic Kullback - Leibler Divergence Histogram Intersection 01 02 03 04