Data drift and machine learning

Machine Learning
Methods and Analysis
Smita Agrawal
Data Scientist – Infosys

What is Machine Learning?
 Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data
 Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
Machine
Learning
Classification
Churn Prediction
Fraud Detection
Image Classification
Regression
Market Mix Modelling
ARPU Forecasting
Life/Age Expectancy
Advertising Popularity Prediction
Dimensionality Reduction
Meaningful Compression
Feature Selection
Clustering
Customer Profiling
Targeted Marketing
Recommender Systems
Real-time Decisions
Robot Navigations
Self-Learning Models
The crux of
Machine
Learning lies in
“History Repeats
itself!”

How to be Machine Learning enabled?
A Superficial View of steps
1. Data Curation
Data curation involves gathering data
across relevant attributes such that they
can distinguish and thereby help us learn
more about why something is happening
2. Processing Data
3. Resampling
Resampling data so that data is free of
any biased or over-powered
characteristics of data are selected such
that all the characteristics are balanced
Leveraging measures like Mean,
Median or Mode to process
curated data so that any outliers
or anomalies can be addressed
4. Variable Selection
Not all the variables in the curated data
really distinguish the characteristics.
Variable selection is a critical step which
enables in retaining these variables
5. Predictive Model
A suitable Machine Learning can now
be applied to the data obtained from
step 4 to store the patterns observed
6. Generate Predictions
This is the most exciting step as it is the
stage where we can predict with certain
confidence whether the data points
belong to a characteristic
Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine!
Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!

 Emergence of big data has created tremendous opportunities for businesses to gain real-time insights
 Make more informed decisions by leveraging data from the exploding number of digital systems
 However, as often is the case with disruptive technologies, the innovations behind big data have created a critical
problem – one that we call Data Drift
 Data drift creates serious challenges to fully harness the insights available from big data
Data drift is defined as:
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation,
maintenance and modernization of the systems that produce the data
Some Severe Impacts -
 Erodes data fidelity
 Operational reliability
 Ultimately the productivity of your data scientists and engineers
 It increases your costs
 Delays time to analysis
 Decreases the productivity and agility of your data engineers
 Leads to poor decision-making by data scientists and the line of business
Source: Streamsets
Data Drift

Fail Safe Mechanism - Prepping Validation for Automated Predictions
1. Ensure variable data types match in validation data with that in the historical data
2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in
the daily data
3. Measure the mean and most frequent value for numerical variables in historical and validation data
4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe
mechanism
5. There would be no predictions generated for these records as your predictive model does not know that new value as there
were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions
6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model
7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across
variables to ensure that none of the data curation stage was broken which could have a consequent impact on the
predictions
8. The variation should be compared to the historical data’s modal values
9. This should be followed as a real-time(daily) practice even while automatically updating the model

Some Common Data Drift Measures
Population Stability Index
Kolmogorov - Smirnov Statistic
Kullback - Leibler Divergence
Histogram Intersection
01
02
03
04

Smita Agrawal
Data Scientist – Infosys
Thank you!!!

Data drift and machine learning

More Related Content

What's hot (20)

Similar to Data drift and machine learning (20)

Recently uploaded (20)

Data drift and machine learning