SlideShare a Scribd company logo
2nd edition
#MLSEV 2
Anomaly Detectors
Practical Examples with BigML
Guillem Vidal
Machine Learning Engineer, BigML
#MLSEV 3
Outline
2 Demo 1: Removing Outliers
3 Demo 2: Fraud Detection
4 Demo 3: Novel Categories Discovery
1 Anomaly Detection Recap
#MLSEV 4
Anomaly Detection Recap
#MLSEV 5
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
An unsupervised algorithm that looks for unusual instances in a dataset. Anomaly
detectors provide an anomaly score to each instance, the higher is the score the
most unusual is the instance. Example:
• Amount $2,459 is higher than all other transactions
• Only transaction
• In zip 21350
• For the purchase class “tech"
#MLSEV 6
Graphical Example
Which object appears more unusual within this group?
#MLSEV
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Most unusual
7
Graphical Example
#MLSEV 8
Isolation Forest
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several
times and use average depth to
compute anomaly score:
0 (similar) 1 (dissimilar)
Isolation Forest: Grow random
decision trees until each instance is
in its own leaf. Random features
and splits
#MLSEV 9
Isolation Forest Splits
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
AnomalyUsual data point
#MLSEV 10
Removing Outliers
#MLSEV 11
Removing Outliers
https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e
#MLSEV 12
Outliers
• Data points that differ significantly from other observations
• Outliers can cause serious problems in statistical analyses
• Examples:
1
2
3
4
5
6
10 20 30 40 50 60 70 80 900
Price
(100k €)
Square Meters
Regression:
1
2
3
4
5
6
0
Price
(100k €)
10 20 30 40 50 60 70 80 90
Square Meters
Unsold
Sold
Classification:
#MLSEV 13
Outliers
• Data points that differ significantly from other observations
• Outliers can cause serious problems in statistical analyses
• Examples:
1
2
3
4
5
6
10 20 30 40 50 60 70 80 900
Price
(100k €)
Square Meters
Regression:
1
2
3
4
5
6
0
Price
(100k €)
10 20 30 40 50 60 70 80 90
Square Meters
Unsold
Sold
Classification:
#MLSEV 14
Removing Outliers
ORIGINAL
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
REJECT MOST
ANOMALOUS
CLEAN
MODEL
COMPARE
EVALUATIONS
ANOMALY
DETECTOR
• Anomaly detectors can be used to remove outliers
• With this methodology outliers removal can be tested
ALL
EVALUATION
CLEAN
EVALUATION
#MLSEV 15
Outliers Demo
pregnancies
plasma
glucose
blood
pressure
triceps skin
thickness
insulin bmi
diabetes
pedigree
age diabetes
6 148 72 35 0 33.6 627 50 TRUE
1 85 66 29 0 26.6 351 31 FALSE
8 183 64 0 0 23.3 672 32 TRUE
1 89 66 23 94 28.1 167 21 FALSE
0 137 40 35 168 43.1 2.288 33 TRUE
5 116 74 0 0 25.6 201 30 FALSE
3 78 50 32 88 31.0 248 26 TRUE
10 115 0 0 0 35.3 134 29 FALSE
2 197 70 45 543 30.5 158 53 TRUE
8 125 96 0 0 0.0 232 54 TRUE
4 110 92 0 0 37.6 191 30 FALSE
10 168 74 0 0 38.0 537 34 TRUE
Diabetes dataset
• Predict whether patients are diabetic or not
BigML Gallery
#MLSEV 16
Summary
•An anomaly detector improved a classifier performance by removing top
10 anomalies as outliers
•Usually removing anomalies with score over 60% works
#MLSEV 17
Fraud Detection
#MLSEV 18
Fraud Detection
HISTORIC NON
FRAUD
TRANSACTIONS
ANOMALY
DETECTOR
NEW
TRANSACTION(S)
ANOMALY
SCORE
KEEP HIGH
SCORES
SUSPICIOUS
TRANSACTION(S)
FRAUD
ANALYST
• Use Machine Learning to detect fraudulent financial transactions
• Fraud transactions being unusual can be detected with an anomaly
detector
#MLSEV 19
Fraud Detection Demo
Credit card transactions dataset
• Anonymized credit card transactions with a fraud label
• Very unbalanced
Time V1 V2 V3 V4
0 -1.3598 -0.0727 2.5363 1.3781
0 1.1918 0.2661 0.1664 0.4481
1 -1.3583 -1.3401 1.7732 0.3797
1 -0.9662 -0.1852 1.7929 -0.8632
2 -1.1582 0.8777 1.5487 0.4030
2 -0.4259 0.9605 1.1411 -0.1682
4 1.2296 0.1410 0.0453 1.2026
7 -0.6442 1.4179 1.0743 -0.4921
7 -0.8942 0.2861 -0.1131 -0.2715
9 -0.3382 1.1195 1.0443 -0.2221
10 1.4490 -1.1763 0.9138 -1.3756
V27 V28 Amount Class
0.1335 -0.0210 149.62 0
-0.0089 0.0147 2.69 0
-0.0553 -0.0597 378.66 0
0.0627 0.0614 123.5 0
0.2194 0.2151 69.99 0
0.2538 0.0810 3.67 0
0.0345 0.0051 4.99 0
-1.2069 -1.0853 40.8 1
0.0117 0.1424 93.2 0
0.2462 0.0830 3.68 0
0.0428 0.0162 7.8 0
…
…
…
https://www.kaggle.com/mlg-ulb/creditcardfraud
#MLSEV 20
Summary
• Anomaly detectors can be an unsupervised alternative to classifiers
in extremely unbalanced datasets
• Fraud detection is an example. A similar approach can be used for other
use cases such as predictive maintenance or network intrusion
detection
• With this approach, the most challenging aspect is finding the features
that work
#MLSEV 21
Novel Categories Discovery
#MLSEV 22
Novel Categories
• A classification model performance could be reduced over time in
production with real data evolution over time
• Model degradation can be addressed by retraining with new data
• What if new data is not labeled?
• What if new data contains novel categories?
• Anomaly detectors can be used to spot model degradation and to
discover novel categories
#MLSEV 23
Novel Categories Discovery
ORIGINAL
DATASET
CLASSIFICATION
MODEL
ANOMALY
DETECTOR
NEW
INSTANCES
HIGH SCORED
INSTANCES, POTENTIAL
NOVEL CATEGORIES
REJECT HIGH
ANOMALY SCORES
SIMILAR
INSTANCES
PREDICTION
LABEL/RETRAIN
MODEL ALERT
WHEN CUMULATED
ANOMALY
SCORE
DATA ANALYST
#MLSEV 24
Novel Categories Demo
Steel plates faults dataset
• Each instance represents a faulty steel plate with fault type label
• Objective: predict fault type given a faulty steel plate
…
…
…
X_Min X_Max Y_Min Y_Max Pixels Areas X_Perim Y_Perim
42 50 270900 270944 267 17 44
645 651 2538079 2538108 108 10 30
829 835 1553913 1553931 71 8 19
853 860 369370 369415 176 13 45
1289 1306 498078 498335 2409 60 260
430 441 100250 100337 630 20 87
413 446 138468 138883 9052 230 432
190 200 210936 210956 132 11 20
330 343 429227 429253 264 15 26
74 90 779144 779308 1506 46 167
106 118 813452 813500 442 13 48
Orientation_Index Luminosity_Index SigmoidOfAreas Fault
0.8182 -0.2913 0.5822 Pastry
0.7931 -0.1756 0.2984 Bumps
0.6667 -0.1228 215 Bumps
0.8444 -0.1568 0.5212 Dirty
0.9338 -0.1992 1.0 Stains
0.8736 -0.2267 0.9874 Pastry
0.9205 0.2791 1.0 Stains
0.5 0.1841 0.3359 Bumps
0.5 -0.1197 0.5593 Pastry
0.9024 -0.0651 1.0 Pastry
0.75 -0.1093 0.8612 Pastry
28
fields
total
BigML Gallery
#MLSEV 25
Summary
• Novel plates faults categories could be spotted with this method
• Model degradation in general can be monitored with anomaly detectors
MLSEV Virtual. Anomaly Detection Examples

More Related Content

PDF
Can We Automate Predictive Analytics
odsc
 
PDF
Common mistakes in measurement uncertainty calculations
GH Yeoh
 
PDF
L14. Anomaly Detection
Machine Learning Valencia
 
PPTX
Worked examples of sampling uncertainty evaluation
GH Yeoh
 
PPTX
Anomaly detection
Dr. Stylianos Kampakis
 
PDF
Causal Inference in Data Science and Machine Learning
Bill Liu
 
PDF
Introduction to Machine Learning
FINBOURNE Technology
 
PDF
Lecture7 cross validation
Stéphane Canu
 
Can We Automate Predictive Analytics
odsc
 
Common mistakes in measurement uncertainty calculations
GH Yeoh
 
L14. Anomaly Detection
Machine Learning Valencia
 
Worked examples of sampling uncertainty evaluation
GH Yeoh
 
Anomaly detection
Dr. Stylianos Kampakis
 
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Introduction to Machine Learning
FINBOURNE Technology
 
Lecture7 cross validation
Stéphane Canu
 

Similar to MLSEV Virtual. Anomaly Detection Examples (20)

PDF
BSSML17 - Anomaly Detection
BigML, Inc
 
PDF
Anomaly detection Workshop slides
QuantUniversity
 
PPTX
Making Big Data relevant: Importance of Data Visualization and Analytics
Gramener
 
PPTX
Feature Engineering
odsc
 
PPTX
14. Statistical Process Control.pptx
SarthakGupta856447
 
PPTX
Quantity and unit
Leliana Febrianti
 
PDF
IMCSummit 2015 - Day 2 Developer Track - Catch Them in the Act - Fraud Detect...
In-Memory Computing Summit
 
PPTX
Mathematics of anomalies
CSIRO
 
PDF
Multivariate Analysis
Stig-Arne Kristoffersen
 
PDF
Randy Rice - Defect Sampling – An Innovation for Focused Testing - EuroSTAR 2012
TEST Huddle
 
PDF
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
ssuserf58323
 
PPT
Multivariate analysis
DrMuhammadMobeenShaf
 
PPT
Multivariate Analysis.ppt
JayaChandran570837
 
PPTX
HYDSPIN Dec14 visual story telling
Gramener
 
PPTX
Database Marketing - Dominick's stores in Chicago distric
Demin Wang
 
PDF
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
PDF
Online Detection of Shutdown Periods in Chemical Plants: A Case Study
Manuel Martín
 
PPT
SPC Training by D&H Engineers
D&H Engineers
 
PDF
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
Alejandro Correa Bahnsen, PhD
 
PPT
Multivariate Analysis Power point Slides
aqscribd
 
BSSML17 - Anomaly Detection
BigML, Inc
 
Anomaly detection Workshop slides
QuantUniversity
 
Making Big Data relevant: Importance of Data Visualization and Analytics
Gramener
 
Feature Engineering
odsc
 
14. Statistical Process Control.pptx
SarthakGupta856447
 
Quantity and unit
Leliana Febrianti
 
IMCSummit 2015 - Day 2 Developer Track - Catch Them in the Act - Fraud Detect...
In-Memory Computing Summit
 
Mathematics of anomalies
CSIRO
 
Multivariate Analysis
Stig-Arne Kristoffersen
 
Randy Rice - Defect Sampling – An Innovation for Focused Testing - EuroSTAR 2012
TEST Huddle
 
SIMPLE CORRECTION FOR MEASUREMENT ERRORS WITH STATA
ssuserf58323
 
Multivariate analysis
DrMuhammadMobeenShaf
 
Multivariate Analysis.ppt
JayaChandran570837
 
HYDSPIN Dec14 visual story telling
Gramener
 
Database Marketing - Dominick's stores in Chicago distric
Demin Wang
 
Detecting Malicious Websites using Machine Learning
Andrew Beard
 
Online Detection of Shutdown Periods in Chemical Plants: A Case Study
Manuel Martín
 
SPC Training by D&H Engineers
D&H Engineers
 
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
Alejandro Correa Bahnsen, PhD
 
Multivariate Analysis Power point Slides
aqscribd
 
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
PDF
DutchMLSchool 2022 - Automation
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
PDF
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
PDF
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
PDF
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
PDF
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
PDF
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
PDF
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
PDF
Democratizing Object Detection
BigML, Inc
 
PDF
BigML Release: Image Processing
BigML, Inc
 
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
PDF
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
DutchMLSchool 2022 - Automation
BigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
Democratizing Object Detection
BigML, Inc
 
BigML Release: Image Processing
BigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 
Ad

Recently uploaded (20)

PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 

MLSEV Virtual. Anomaly Detection Examples

  • 2. #MLSEV 2 Anomaly Detectors Practical Examples with BigML Guillem Vidal Machine Learning Engineer, BigML
  • 3. #MLSEV 3 Outline 2 Demo 1: Removing Outliers 3 Demo 2: Fraud Detection 4 Demo 3: Novel Categories Discovery 1 Anomaly Detection Recap
  • 5. #MLSEV 5 Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 An unsupervised algorithm that looks for unusual instances in a dataset. Anomaly detectors provide an anomaly score to each instance, the higher is the score the most unusual is the instance. Example: • Amount $2,459 is higher than all other transactions • Only transaction • In zip 21350 • For the purchase class “tech"
  • 6. #MLSEV 6 Graphical Example Which object appears more unusual within this group?
  • 7. #MLSEV “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Most unusual 7 Graphical Example
  • 8. #MLSEV 8 Isolation Forest “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average depth to compute anomaly score: 0 (similar) 1 (dissimilar) Isolation Forest: Grow random decision trees until each instance is in its own leaf. Random features and splits
  • 9. #MLSEV 9 Isolation Forest Splits https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf AnomalyUsual data point
  • 12. #MLSEV 12 Outliers • Data points that differ significantly from other observations • Outliers can cause serious problems in statistical analyses • Examples: 1 2 3 4 5 6 10 20 30 40 50 60 70 80 900 Price (100k €) Square Meters Regression: 1 2 3 4 5 6 0 Price (100k €) 10 20 30 40 50 60 70 80 90 Square Meters Unsold Sold Classification:
  • 13. #MLSEV 13 Outliers • Data points that differ significantly from other observations • Outliers can cause serious problems in statistical analyses • Examples: 1 2 3 4 5 6 10 20 30 40 50 60 70 80 900 Price (100k €) Square Meters Regression: 1 2 3 4 5 6 0 Price (100k €) 10 20 30 40 50 60 70 80 90 Square Meters Unsold Sold Classification:
  • 14. #MLSEV 14 Removing Outliers ORIGINAL DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET REJECT MOST ANOMALOUS CLEAN MODEL COMPARE EVALUATIONS ANOMALY DETECTOR • Anomaly detectors can be used to remove outliers • With this methodology outliers removal can be tested ALL EVALUATION CLEAN EVALUATION
  • 15. #MLSEV 15 Outliers Demo pregnancies plasma glucose blood pressure triceps skin thickness insulin bmi diabetes pedigree age diabetes 6 148 72 35 0 33.6 627 50 TRUE 1 85 66 29 0 26.6 351 31 FALSE 8 183 64 0 0 23.3 672 32 TRUE 1 89 66 23 94 28.1 167 21 FALSE 0 137 40 35 168 43.1 2.288 33 TRUE 5 116 74 0 0 25.6 201 30 FALSE 3 78 50 32 88 31.0 248 26 TRUE 10 115 0 0 0 35.3 134 29 FALSE 2 197 70 45 543 30.5 158 53 TRUE 8 125 96 0 0 0.0 232 54 TRUE 4 110 92 0 0 37.6 191 30 FALSE 10 168 74 0 0 38.0 537 34 TRUE Diabetes dataset • Predict whether patients are diabetic or not BigML Gallery
  • 16. #MLSEV 16 Summary •An anomaly detector improved a classifier performance by removing top 10 anomalies as outliers •Usually removing anomalies with score over 60% works
  • 18. #MLSEV 18 Fraud Detection HISTORIC NON FRAUD TRANSACTIONS ANOMALY DETECTOR NEW TRANSACTION(S) ANOMALY SCORE KEEP HIGH SCORES SUSPICIOUS TRANSACTION(S) FRAUD ANALYST • Use Machine Learning to detect fraudulent financial transactions • Fraud transactions being unusual can be detected with an anomaly detector
  • 19. #MLSEV 19 Fraud Detection Demo Credit card transactions dataset • Anonymized credit card transactions with a fraud label • Very unbalanced Time V1 V2 V3 V4 0 -1.3598 -0.0727 2.5363 1.3781 0 1.1918 0.2661 0.1664 0.4481 1 -1.3583 -1.3401 1.7732 0.3797 1 -0.9662 -0.1852 1.7929 -0.8632 2 -1.1582 0.8777 1.5487 0.4030 2 -0.4259 0.9605 1.1411 -0.1682 4 1.2296 0.1410 0.0453 1.2026 7 -0.6442 1.4179 1.0743 -0.4921 7 -0.8942 0.2861 -0.1131 -0.2715 9 -0.3382 1.1195 1.0443 -0.2221 10 1.4490 -1.1763 0.9138 -1.3756 V27 V28 Amount Class 0.1335 -0.0210 149.62 0 -0.0089 0.0147 2.69 0 -0.0553 -0.0597 378.66 0 0.0627 0.0614 123.5 0 0.2194 0.2151 69.99 0 0.2538 0.0810 3.67 0 0.0345 0.0051 4.99 0 -1.2069 -1.0853 40.8 1 0.0117 0.1424 93.2 0 0.2462 0.0830 3.68 0 0.0428 0.0162 7.8 0 … … … https://www.kaggle.com/mlg-ulb/creditcardfraud
  • 20. #MLSEV 20 Summary • Anomaly detectors can be an unsupervised alternative to classifiers in extremely unbalanced datasets • Fraud detection is an example. A similar approach can be used for other use cases such as predictive maintenance or network intrusion detection • With this approach, the most challenging aspect is finding the features that work
  • 22. #MLSEV 22 Novel Categories • A classification model performance could be reduced over time in production with real data evolution over time • Model degradation can be addressed by retraining with new data • What if new data is not labeled? • What if new data contains novel categories? • Anomaly detectors can be used to spot model degradation and to discover novel categories
  • 23. #MLSEV 23 Novel Categories Discovery ORIGINAL DATASET CLASSIFICATION MODEL ANOMALY DETECTOR NEW INSTANCES HIGH SCORED INSTANCES, POTENTIAL NOVEL CATEGORIES REJECT HIGH ANOMALY SCORES SIMILAR INSTANCES PREDICTION LABEL/RETRAIN MODEL ALERT WHEN CUMULATED ANOMALY SCORE DATA ANALYST
  • 24. #MLSEV 24 Novel Categories Demo Steel plates faults dataset • Each instance represents a faulty steel plate with fault type label • Objective: predict fault type given a faulty steel plate … … … X_Min X_Max Y_Min Y_Max Pixels Areas X_Perim Y_Perim 42 50 270900 270944 267 17 44 645 651 2538079 2538108 108 10 30 829 835 1553913 1553931 71 8 19 853 860 369370 369415 176 13 45 1289 1306 498078 498335 2409 60 260 430 441 100250 100337 630 20 87 413 446 138468 138883 9052 230 432 190 200 210936 210956 132 11 20 330 343 429227 429253 264 15 26 74 90 779144 779308 1506 46 167 106 118 813452 813500 442 13 48 Orientation_Index Luminosity_Index SigmoidOfAreas Fault 0.8182 -0.2913 0.5822 Pastry 0.7931 -0.1756 0.2984 Bumps 0.6667 -0.1228 215 Bumps 0.8444 -0.1568 0.5212 Dirty 0.9338 -0.1992 1.0 Stains 0.8736 -0.2267 0.9874 Pastry 0.9205 0.2791 1.0 Stains 0.5 0.1841 0.3359 Bumps 0.5 -0.1197 0.5593 Pastry 0.9024 -0.0651 1.0 Pastry 0.75 -0.1093 0.8612 Pastry 28 fields total BigML Gallery
  • 25. #MLSEV 25 Summary • Novel plates faults categories could be spotted with this method • Model degradation in general can be monitored with anomaly detectors