SlideShare a Scribd company logo
N O V E M B E R 2 9 , 2 0 1 7
BigML, Inc 2
Anomaly Detection
Finding the Unusual
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Anomaly Detection
What is Anomaly Detection?
• An unsupervised learning technique
• No labels necessary
• Useful for finding unusual instances
• Filtering, finding mistakes, 1-class classifiers
• Finds instances that do not match
• Customer: big or small spender for profile
• Medical: healthy patient despite indicative diagnostics
• Defines each unusual instance by an “anomaly score”
• in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5

• Standard deviation, distributions, etc
BigML, Inc 4Anomaly Detection
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 5Anomaly Detection
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
BigML, Inc 6Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 7Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
• Amount $2,459 is higher than all other transactions
• It is the only transaction
• In zip 21350
• for the purchase class "tech"
BigML, Inc 8Anomaly Detection
Use Cases
• Unusual instance discovery - "exploration"
• Intrusion Detection - "looking for unusual usage patterns"
• Fraud - "looking for unusual behavior"
• Identify Incorrect Data - "looking for mistakes"
• Remove Outliers - "improve model quality"
• Model Competence / Input Data Drift
BigML, Inc 9Anomaly Detection
Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
BigML, Inc 10Anomaly Detection
Diabetes Anomalies
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
FILTER
ALL
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
BigML, Inc 11
Anomaly Demo #1
BigML, Inc 12Anomaly Detection
Intrusion Detection
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
flags, working directories, etc.
• Assumption: Users typically issue the
same flag patterns and work in certain
directories
Per User Per Dir All User All Dir
BigML, Inc 13Anomaly Detection
Fraud
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
BigML, Inc 14Anomaly Detection
Model Competence
• After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
Confidence 86 % 84 %
Anomaly Score 0,5367 0,7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
BigML, Inc 15Anomaly Detection
Benford’s Law
• In real-life numeric sets the small digits occur
disproportionately often as leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
BigML, Inc 16Anomaly Detection
Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the squared
differences from the Mean.)
• Depending on the number of instances, choose a “multiple”
of standard deviations to indicate an anomaly. A multiple of 3
for 1000 instances removes ~ 3 outliers.
BigML, Inc 17Anomaly Detection
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
BigML, Inc 18Anomaly Detection
Multivariate Matters
BigML, Inc 19Anomaly Detection
Multivariate Matters
BigML, Inc 20Anomaly Detection
Human Expert
Most Unusual?
BigML, Inc 21Anomaly Detection
Human Expert
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight

The “most unusual” object

is different in some way from

every partition of the features.
Most unusual
BigML, Inc 22Anomaly Detection
Human Expert
• Human used prior knowledge to select possible features
that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen features
• Each cluster was then examined to see which object fit
the least well in its cluster and did not fit any other cluster
BigML, Inc 23Anomaly Detection
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
BigML, Inc 24Anomaly Detection
Anomaly Features
Object Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2,75 6 TRUE
box 1 6 TRUE
block 1,6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4,25 3 FALSE
bead 1 2 TRUE
BigML, Inc 25Anomaly Detection
length/width > 5
smooth?
box
blockeraser
knob
penny/dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that “splits” matter - don’t know the order
TrueFalse
TrueFalse TrueFalse
FalseTrue
TrueFalse
Random Splits
BigML, Inc 26Anomaly Detection
Isolation Forest
Grow a random decision tree until
each instance from a sample is in
its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML, Inc 27Anomaly Detection
Isolation Forest Scoring
D = 3
D = 6
D = 2
S=0.45
Map avg depth
to final score
f1 f2 f3
i1 red cat ball
i2 red cat ball
i3 red cat box
i4 blue dog pen
For the instance, i2
Find the depth in each tree
BigML, Inc 28Anomaly Detection
Model Competence
• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you should not trust the
model.
Prediction T T
Confidence
86 % 84 %
Anomaly
Score
0,5367 0,7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR
BigML, Inc 29
Anomaly Demo #2
BigML, Inc 30Anomaly Detection
1-Class Classifier?
• You place an advertisement in a local newspaper
• You collect demographic information about all responders
• Now you want to market in a new locality with direct letters
• To optimize mailing costs, need to predict who will respond
• But, can not distinguish not interested from didn’t see the ad
• Train an anomaly detector on the 1-class data
• Pick the households with the lowest scores for mailing:
• If a household has a low anomaly score, then they are
“similar” to enough of your positive responders and
therefore may respond as well
• If an individual has a high anomaly score, then they are
dissimilar from all previous responders and therefore are
less likely to respond.
BigML, Inc 31Anomaly Detection
Summary
• Anomaly detection is the process of finding unusual instances
• Some techniques and how they work:
• Univariate: standard deviation
• Benford’s law
• Isolation Forest
• Applications
• Filtering to improve models
• Finding mistakes, fraud, and intruders
• Knowing when to retrain a model (competence)
• 1-class classifiers
• In general… unsupervised learning techniques:
• Require more finesse and interpretation
• Are more commonly part of a multistep workflow
BSSML17 - Anomaly Detection

More Related Content

PDF
BSSML17 - Clusters
BigML, Inc
 
PDF
BSSML17 - Basic Data Transformations
BigML, Inc
 
PDF
BSSML17 - Ensembles
BigML, Inc
 
PDF
BSSML17 - Introduction, Models, Evaluations
BigML, Inc
 
PDF
BSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
BSSML16 L4. Association Discovery and Topic Modeling
BigML, Inc
 
PDF
VSSML17 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
BSSML17 - Topic Models
BigML, Inc
 
BSSML17 - Clusters
BigML, Inc
 
BSSML17 - Basic Data Transformations
BigML, Inc
 
BSSML17 - Ensembles
BigML, Inc
 
BSSML17 - Introduction, Models, Evaluations
BigML, Inc
 
BSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
BSSML16 L4. Association Discovery and Topic Modeling
BigML, Inc
 
VSSML17 L3. Clusters and Anomaly Detection
BigML, Inc
 
BSSML17 - Topic Models
BigML, Inc
 

What's hot (20)

PDF
BSSML17 - Association Discovery
BigML, Inc
 
PDF
BSSML16 L1. Introduction, Models, and Evaluations
BigML, Inc
 
PDF
VSSML17 L2. Ensembles and Logistic Regressions
BigML, Inc
 
PDF
BSSML16 L2. Ensembles and Logistic Regressions
BigML, Inc
 
PDF
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
BigML, Inc
 
PDF
DutchMLSchool. Clusters and Anomalies
BigML, Inc
 
PDF
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
PDF
VSSML17 L6. Time Series and Deepnets
BigML, Inc
 
PDF
MLSEV. Cluster Analysis and Anomaly Detection
BigML, Inc
 
PPTX
Feature Engineering
odsc
 
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
PDF
VSSML17 L4. Association Discovery and Latent Dirichlet Allocation
BigML, Inc
 
PDF
L14. Anomaly Detection
Machine Learning Valencia
 
PDF
Fairly Measuring Fairness In Machine Learning
HJ van Veen
 
PDF
VSSML18. Feature Engineering
BigML, Inc
 
PDF
MLSEV. Models, Evaluations and Ensembles
BigML, Inc
 
PPTX
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
PDF
L11. The Future of Machine Learning
Machine Learning Valencia
 
PPTX
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
 
BSSML17 - Association Discovery
BigML, Inc
 
BSSML16 L1. Introduction, Models, and Evaluations
BigML, Inc
 
VSSML17 L2. Ensembles and Logistic Regressions
BigML, Inc
 
BSSML16 L2. Ensembles and Logistic Regressions
BigML, Inc
 
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
BigML, Inc
 
DutchMLSchool. Clusters and Anomalies
BigML, Inc
 
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
VSSML17 L6. Time Series and Deepnets
BigML, Inc
 
MLSEV. Cluster Analysis and Anomaly Detection
BigML, Inc
 
Feature Engineering
odsc
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
VSSML17 L4. Association Discovery and Latent Dirichlet Allocation
BigML, Inc
 
L14. Anomaly Detection
Machine Learning Valencia
 
Fairly Measuring Fairness In Machine Learning
HJ van Veen
 
VSSML18. Feature Engineering
BigML, Inc
 
MLSEV. Models, Evaluations and Ensembles
BigML, Inc
 
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
L11. The Future of Machine Learning
Machine Learning Valencia
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
 
Ad

Similar to BSSML17 - Anomaly Detection (20)

PDF
BigML Education - Anomaly Detection
BigML, Inc
 
PDF
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
PDF
MLSEV Virtual. Anomaly Detection Examples
BigML, Inc
 
PDF
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
PDF
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
PPTX
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
PDF
Analytics for large-scale time series and event data
Anodot
 
PDF
Anomaly Detection using Deep Auto-Encoders
Gianmario Spacagna
 
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
PPTX
Anomaly Detection Technique
Chakrit Phain
 
PDF
anomalydetection-191104083630.pdf
hanadi40
 
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
PPTX
Anomaly detection with machine learning at scale
Impetus Technologies
 
PDF
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
United States Artificial Intelligence Institute
 
PDF
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
An Introduction to Anomaly Detection
Kenneth Graham
 
PDF
Data pipelines and anomaly detection
Sho Fola Soboyejo
 
PDF
VSSML18. Association Discovery and Anomaly Detection
BigML, Inc
 
PPTX
Anomaly Detection Using Isolation Forests
Turi, Inc.
 
PPTX
Machine Learning in Action
Splunk
 
BigML Education - Anomaly Detection
BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
MLSEV Virtual. Anomaly Detection Examples
BigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
Analytics for large-scale time series and event data
Anodot
 
Anomaly Detection using Deep Auto-Encoders
Gianmario Spacagna
 
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
Anomaly Detection Technique
Chakrit Phain
 
anomalydetection-191104083630.pdf
hanadi40
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
Anomaly detection with machine learning at scale
Impetus Technologies
 
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
United States Artificial Intelligence Institute
 
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
An Introduction to Anomaly Detection
Kenneth Graham
 
Data pipelines and anomaly detection
Sho Fola Soboyejo
 
VSSML18. Association Discovery and Anomaly Detection
BigML, Inc
 
Anomaly Detection Using Isolation Forests
Turi, Inc.
 
Machine Learning in Action
Splunk
 
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
PDF
DutchMLSchool 2022 - Automation
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
PDF
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
PDF
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
PDF
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
PDF
Democratizing Object Detection
BigML, Inc
 
PDF
BigML Release: Image Processing
BigML, Inc
 
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
PDF
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 
PDF
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
BigML, Inc
 
PDF
Intelligent Mobility: Machine Learning in the Mobility Industry
BigML, Inc
 
PPTX
Intelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
BigML, Inc
 
PDF
Intelligent Mobility: Business Value of IoT and ML in Logistics
BigML, Inc
 
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
DutchMLSchool 2022 - Automation
BigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
Democratizing Object Detection
BigML, Inc
 
BigML Release: Image Processing
BigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
BigML, Inc
 
Intelligent Mobility: Machine Learning in the Mobility Industry
BigML, Inc
 
Intelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
BigML, Inc
 
Intelligent Mobility: Business Value of IoT and ML in Logistics
BigML, Inc
 

Recently uploaded (20)

PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 

BSSML17 - Anomaly Detection

  • 1. N O V E M B E R 2 9 , 2 0 1 7
  • 2. BigML, Inc 2 Anomaly Detection Finding the Unusual Poul Petersen CIO, BigML, Inc
  • 3. BigML, Inc 3Anomaly Detection What is Anomaly Detection? • An unsupervised learning technique • No labels necessary • Useful for finding unusual instances • Filtering, finding mistakes, 1-class classifiers • Finds instances that do not match • Customer: big or small spender for profile • Medical: healthy patient despite indicative diagnostics • Defines each unusual instance by an “anomaly score” • in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5 • Standard deviation, distributions, etc
  • 4. BigML, Inc 4Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 5. BigML, Inc 5Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  • 6. BigML, Inc 6Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 7. BigML, Inc 7Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly • Amount $2,459 is higher than all other transactions • It is the only transaction • In zip 21350 • for the purchase class "tech"
  • 8. BigML, Inc 8Anomaly Detection Use Cases • Unusual instance discovery - "exploration" • Intrusion Detection - "looking for unusual usage patterns" • Fraud - "looking for unusual behavior" • Identify Incorrect Data - "looking for mistakes" • Remove Outliers - "improve model quality" • Model Competence / Input Data Drift
  • 9. BigML, Inc 9Anomaly Detection Removing Outliers • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  • 10. BigML, Inc 10Anomaly Detection Diabetes Anomalies DIABETES SOURCE DIABETES DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET FILTER ALL MODEL ALL EVALUATION CLEAN EVALUATION COMPARE EVALUATIONS ANAOMALY DETECTOR
  • 12. BigML, Inc 12Anomaly Detection Intrusion Detection GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  • 13. BigML, Inc 13Anomaly Detection Fraud • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  • 14. BigML, Inc 14Anomaly Detection Model Competence • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted. Prediction T T Confidence 86 % 84 % Anomaly Score 0,5367 0,7124 Competent? Y N At Prediction TimeAt Training Time DATASET MODEL ANOMALY DETECTOR
  • 15. BigML, Inc 15Anomaly Detection Benford’s Law • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  • 16. BigML, Inc 16Anomaly Detection Univariate Approach • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  • 17. BigML, Inc 17Anomaly Detection Univariate Approach measurement frequency outliersoutliers • Available in BigML API
  • 18. BigML, Inc 18Anomaly Detection Multivariate Matters
  • 19. BigML, Inc 19Anomaly Detection Multivariate Matters
  • 20. BigML, Inc 20Anomaly Detection Human Expert Most Unusual?
  • 21. BigML, Inc 21Anomaly Detection Human Expert “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  • 22. BigML, Inc 22Anomaly Detection Human Expert • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  • 23. BigML, Inc 23Anomaly Detection Human Expert • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  • 24. BigML, Inc 24Anomaly Detection Anomaly Features Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2,75 6 TRUE box 1 6 TRUE block 1,6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4,25 3 FALSE bead 1 2 TRUE
  • 25. BigML, Inc 25Anomaly Detection length/width > 5 smooth? box blockeraser knob penny/dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Know that “splits” matter - don’t know the order TrueFalse TrueFalse TrueFalse FalseTrue TrueFalse Random Splits
  • 26. BigML, Inc 26Anomaly Detection Isolation Forest Grow a random decision tree until each instance from a sample is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  • 27. BigML, Inc 27Anomaly Detection Isolation Forest Scoring D = 3 D = 6 D = 2 S=0.45 Map avg depth to final score f1 f2 f3 i1 red cat ball i2 red cat ball i3 red cat box i4 blue dog pen For the instance, i2 Find the depth in each tree
  • 28. BigML, Inc 28Anomaly Detection Model Competence • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you should not trust the model. Prediction T T Confidence 86 % 84 % Anomaly Score 0,5367 0,7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  • 30. BigML, Inc 30Anomaly Detection 1-Class Classifier? • You place an advertisement in a local newspaper • You collect demographic information about all responders • Now you want to market in a new locality with direct letters • To optimize mailing costs, need to predict who will respond • But, can not distinguish not interested from didn’t see the ad • Train an anomaly detector on the 1-class data • Pick the households with the lowest scores for mailing: • If a household has a low anomaly score, then they are “similar” to enough of your positive responders and therefore may respond as well • If an individual has a high anomaly score, then they are dissimilar from all previous responders and therefore are less likely to respond.
  • 31. BigML, Inc 31Anomaly Detection Summary • Anomaly detection is the process of finding unusual instances • Some techniques and how they work: • Univariate: standard deviation • Benford’s law • Isolation Forest • Applications • Filtering to improve models • Finding mistakes, fraud, and intruders • Knowing when to retrain a model (competence) • 1-class classifiers • In general… unsupervised learning techniques: • Require more finesse and interpretation • Are more commonly part of a multistep workflow