SlideShare a Scribd company logo
4
Most read
6
Most read
7
Most read
Machine Learning
Methods and Analysis
Smita Agrawal
Data Scientist – Infosys
Popular AI Methodologies
What is Machine Learning?
 Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data
 Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
Machine
Learning
Classification
Churn Prediction
Fraud Detection
Image Classification
Regression
Market Mix Modelling
ARPU Forecasting
Life/Age Expectancy
Advertising Popularity Prediction
Dimensionality Reduction
Meaningful Compression
Feature Selection
Clustering
Customer Profiling
Targeted Marketing
Recommender Systems
Real-time Decisions
Robot Navigations
Self-Learning Models
The crux of
Machine
Learning lies in
“History Repeats
itself!”
How to be Machine Learning enabled?
A Superficial View of steps
1. Data Curation
Data curation involves gathering data
across relevant attributes such that they
can distinguish and thereby help us learn
more about why something is happening
2. Processing Data
3. Resampling
Resampling data so that data is free of
any biased or over-powered
characteristics of data are selected such
that all the characteristics are balanced
Leveraging measures like Mean,
Median or Mode to process
curated data so that any outliers
or anomalies can be addressed
4. Variable Selection
Not all the variables in the curated data
really distinguish the characteristics.
Variable selection is a critical step which
enables in retaining these variables
5. Predictive Model
A suitable Machine Learning can now
be applied to the data obtained from
step 4 to store the patterns observed
6. Generate Predictions
This is the most exciting step as it is the
stage where we can predict with certain
confidence whether the data points
belong to a characteristic
Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine!
Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
 Emergence of big data has created tremendous opportunities for businesses to gain real-time insights
 Make more informed decisions by leveraging data from the exploding number of digital systems
 However, as often is the case with disruptive technologies, the innovations behind big data have created a critical
problem – one that we call Data Drift
 Data drift creates serious challenges to fully harness the insights available from big data
Data drift is defined as:
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation,
maintenance and modernization of the systems that produce the data
Some Severe Impacts -
 Erodes data fidelity
 Operational reliability
 Ultimately the productivity of your data scientists and engineers
 It increases your costs
 Delays time to analysis
 Decreases the productivity and agility of your data engineers
 Leads to poor decision-making by data scientists and the line of business
Source: Streamsets
Data Drift
Fail Safe Mechanism - Prepping Validation for Automated Predictions
1. Ensure variable data types match in validation data with that in the historical data
2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in
the daily data
3. Measure the mean and most frequent value for numerical variables in historical and validation data
4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe
mechanism
5. There would be no predictions generated for these records as your predictive model does not know that new value as there
were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions
6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model
7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across
variables to ensure that none of the data curation stage was broken which could have a consequent impact on the
predictions
8. The variation should be compared to the historical data’s modal values
9. This should be followed as a real-time(daily) practice even while automatically updating the model
Some Common Data Drift Measures
Population Stability Index
Kolmogorov - Smirnov Statistic
Kullback - Leibler Divergence
Histogram Intersection
01
02
03
04
Smita Agrawal
Data Scientist – Infosys
Thank you!!!

More Related Content

What's hot (20)

PDF
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
PPTX
MLOps.pptx
AllenPeter7
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
AI Transformation
Liming Zhu
 
PDF
Handling concept drift in data stream mining
Manuel Martín
 
PPTX
Data Quality Management: Cleaner Data, Better Reporting
accenture
 
PPTX
Machine Learning
Kumar P
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
PDF
Detecting fraud with Python and machine learning
wgyn
 
PDF
Introduction to Data Science
Niko Vuokko
 
PDF
What is Data Science
Ioannis Kourouklides
 
PDF
Understanding random forests
Marc Garcia
 
PPTX
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
PDF
Dimensionality Reduction
Saad Elbeleidy
 
PDF
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
PPTX
Data science
SouravSadhukhan6
 
PDF
What is MLOps
Henrik Skogström
 
PPTX
Introduction to Data Analytics
NR Computer Learning Center
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
MLOps.pptx
AllenPeter7
 
Data Pipline Observability meetup
Omid Vahdaty
 
AI Transformation
Liming Zhu
 
Handling concept drift in data stream mining
Manuel Martín
 
Data Quality Management: Cleaner Data, Better Reporting
accenture
 
Machine Learning
Kumar P
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Detecting fraud with Python and machine learning
wgyn
 
Introduction to Data Science
Niko Vuokko
 
What is Data Science
Ioannis Kourouklides
 
Understanding random forests
Marc Garcia
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
Dimensionality Reduction
Saad Elbeleidy
 
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
Data science
SouravSadhukhan6
 
What is MLOps
Henrik Skogström
 
Introduction to Data Analytics
NR Computer Learning Center
 

Similar to Data drift and machine learning (20)

PPTX
Data drift and machine learning
Smita Agrawal
 
PDF
Data Mining methodology
rebeccatho
 
DOCX
Data Analytics Using R - Report
Akanksha Gohil
 
PPTX
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
PPTX
Informs presentation new ppt
Salford Systems
 
PDF
Data analytics, a (short) tour
Venkatesh Prasad Ranganath
 
PPTX
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
cloudserviceuit
 
PPTX
IntroToDataMining : Key Components and process
indumuruganandhan
 
PPTX
Machine Learning: A Fast Review
Ahmad Ali Abin
 
PPTX
Programming-Introduction-to-Machine-Learning.pptx
SaitoHiraga17
 
PDF
machinelearning-191005133446.pdf
LellaLinton
 
PDF
Data preprocessing in Data Mining
Samad Baseer Khan
 
PDF
Exploring the Data science Process
Vishal Patel
 
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
PPTX
Data mining an introduction
Dr-Dipali Meher
 
PDF
Machine_Learning_Trushita
Trushita Redij
 
PPTX
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
PDF
Data Driven Engineering 2014
Roger Barga
 
PDF
ml.pdf by Tee
mweweblogger
 
PPTX
Machine Learning - Startup weekend UCSB 2018
Raul Eulogio
 
Data drift and machine learning
Smita Agrawal
 
Data Mining methodology
rebeccatho
 
Data Analytics Using R - Report
Akanksha Gohil
 
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
Informs presentation new ppt
Salford Systems
 
Data analytics, a (short) tour
Venkatesh Prasad Ranganath
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
cloudserviceuit
 
IntroToDataMining : Key Components and process
indumuruganandhan
 
Machine Learning: A Fast Review
Ahmad Ali Abin
 
Programming-Introduction-to-Machine-Learning.pptx
SaitoHiraga17
 
machinelearning-191005133446.pdf
LellaLinton
 
Data preprocessing in Data Mining
Samad Baseer Khan
 
Exploring the Data science Process
Vishal Patel
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Data mining an introduction
Dr-Dipali Meher
 
Machine_Learning_Trushita
Trushita Redij
 
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
Data Driven Engineering 2014
Roger Barga
 
ml.pdf by Tee
mweweblogger
 
Machine Learning - Startup weekend UCSB 2018
Raul Eulogio
 
Ad

Recently uploaded (20)

PPTX
MICROBIOLOGY PART-1 INTRODUCTION .pptx
Mohit Kumar
 
PDF
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
PDF
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PPTX
PEDIA IDS IN A GIST_6488b6b5-3152-4a4a-a943-20a56efddd43 (2).pptx
tdas83504
 
PPTX
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
PPT
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
PPT
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
PPTX
MODULE 2 Effects of Lifestyle in the Function of Respiratory and Circulator...
judithgracemangunday
 
PDF
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Quarter 4 - Module 4A -Plate Tectonics-Seismic waves in Earth's Mechanism.pptx
JunimarAggabao
 
PPTX
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
PDF
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
PDF
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
PPTX
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
PPTX
Q1 - W1 - D2 - Models of matter for science.pptx
RyanCudal3
 
PDF
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
PPT
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
mode_of_action_of_fungicides_final[1] (2).pptx
MrRABIRANJAN
 
PDF
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
MICROBIOLOGY PART-1 INTRODUCTION .pptx
Mohit Kumar
 
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PEDIA IDS IN A GIST_6488b6b5-3152-4a4a-a943-20a56efddd43 (2).pptx
tdas83504
 
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
MODULE 2 Effects of Lifestyle in the Function of Respiratory and Circulator...
judithgracemangunday
 
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Quarter 4 - Module 4A -Plate Tectonics-Seismic waves in Earth's Mechanism.pptx
JunimarAggabao
 
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
Q1 - W1 - D2 - Models of matter for science.pptx
RyanCudal3
 
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
mode_of_action_of_fungicides_final[1] (2).pptx
MrRABIRANJAN
 
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
Ad

Data drift and machine learning

  • 1. Machine Learning Methods and Analysis Smita Agrawal Data Scientist – Infosys
  • 3. What is Machine Learning?  Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data  Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques Supervised Learning Unsupervised Learning Reinforcement Learning Machine Learning Classification Churn Prediction Fraud Detection Image Classification Regression Market Mix Modelling ARPU Forecasting Life/Age Expectancy Advertising Popularity Prediction Dimensionality Reduction Meaningful Compression Feature Selection Clustering Customer Profiling Targeted Marketing Recommender Systems Real-time Decisions Robot Navigations Self-Learning Models The crux of Machine Learning lies in “History Repeats itself!”
  • 4. How to be Machine Learning enabled? A Superficial View of steps 1. Data Curation Data curation involves gathering data across relevant attributes such that they can distinguish and thereby help us learn more about why something is happening 2. Processing Data 3. Resampling Resampling data so that data is free of any biased or over-powered characteristics of data are selected such that all the characteristics are balanced Leveraging measures like Mean, Median or Mode to process curated data so that any outliers or anomalies can be addressed 4. Variable Selection Not all the variables in the curated data really distinguish the characteristics. Variable selection is a critical step which enables in retaining these variables 5. Predictive Model A suitable Machine Learning can now be applied to the data obtained from step 4 to store the patterns observed 6. Generate Predictions This is the most exciting step as it is the stage where we can predict with certain confidence whether the data points belong to a characteristic Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine! Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
  • 5.  Emergence of big data has created tremendous opportunities for businesses to gain real-time insights  Make more informed decisions by leveraging data from the exploding number of digital systems  However, as often is the case with disruptive technologies, the innovations behind big data have created a critical problem – one that we call Data Drift  Data drift creates serious challenges to fully harness the insights available from big data Data drift is defined as: The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Some Severe Impacts -  Erodes data fidelity  Operational reliability  Ultimately the productivity of your data scientists and engineers  It increases your costs  Delays time to analysis  Decreases the productivity and agility of your data engineers  Leads to poor decision-making by data scientists and the line of business Source: Streamsets Data Drift
  • 6. Fail Safe Mechanism - Prepping Validation for Automated Predictions 1. Ensure variable data types match in validation data with that in the historical data 2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in the daily data 3. Measure the mean and most frequent value for numerical variables in historical and validation data 4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe mechanism 5. There would be no predictions generated for these records as your predictive model does not know that new value as there were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions 6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model 7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across variables to ensure that none of the data curation stage was broken which could have a consequent impact on the predictions 8. The variation should be compared to the historical data’s modal values 9. This should be followed as a real-time(daily) practice even while automatically updating the model
  • 7. Some Common Data Drift Measures Population Stability Index Kolmogorov - Smirnov Statistic Kullback - Leibler Divergence Histogram Intersection 01 02 03 04
  • 8. Smita Agrawal Data Scientist – Infosys Thank you!!!