SlideShare a Scribd company logo
1st edition
March 7-8, 2019
BigML, Inc
Clusters
Finding Similarities
Poul Petersen
CIO, BigML, Inc
!2
BigML, Inc #MLSEV: Cluster Analysis
What is Clustering?
!3
• An unsupervised learning technique
• No labels necessary
• Useful for finding similar instances
• Smart sampling/labelling
• Finds “self-similar" groups of instances
• Customer: groups with similar behavior
• Medical: patients with similar diagnostic measurements
• Defines each group by a “centroid”
• Geometric center of the group
• Represents the “average” member
• Number of centroids (k) can be specified or determined
BigML, Inc #MLSEV: Cluster Analysis
Cluster Centroids
!4
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc #MLSEV: Cluster Analysis
Cluster Centroids
!5
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
auth = pin
amount ~ $100
Same:
date: Mon != Wed
customer: Sally != Bob
account: 6788 != 3421
class: clothes != gas
zip: 26339 != 46140
Different:
date = Wed (2 out of 3)
customer = Bob
account = 3421
auth = pin
class = gas
zip = 46140
amount = $104
Centroid:
similar
BigML, Inc #MLSEV: Cluster Analysis
Use Cases
!6
• Customer segmentation
• Which customers are similar?
• How many natural groups are there?
• Item discovery
• What other items are similar to this one?
• Similarity
• What other instances share a specific property?
• Recommender (almost)
• If you like this item, what other items might you like?
• Active learning
• Labelling unlabelled data efficiently
BigML, Inc #MLSEV: Cluster Analysis
Customer Segmentation
!7
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-
game purchases
• Assumption: Usage correlates to LTV
0%
3%
1%
BigML, Inc #MLSEV: Cluster Analysis
Similarity
!8
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has
even been late as “trouble”
0%
3%
7%
1%
BigML, Inc #MLSEV: Cluster Analysis
Active Learning
!9
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for
diabetes and label the dataset to
build a model but the test is
expensive*.
BigML, Inc #MLSEV: Cluster Analysis
Active Learning
!10
*For a more realistic example of high cost, imagine a dataset with a
billion transactions, each one needing to be labelled as fraud/not-
fraud. Or a million images which need to be labeled as cat/not-cat.
2323
BigML, Inc #MLSEV: Cluster Analysis
Item Discovery
!11
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible flavor
characteristics.
Smoky
Fruity
BigML, Inc #MLSEV
Clusters Demo #1
!12
BigML, Inc #MLSEV: Cluster Analysis
Human Expert
!13
Cluster into 3 groups…
BigML, Inc #MLSEV: Cluster Analysis
Human Expert
!14
BigML, Inc #MLSEV: Cluster Analysis
Human Expert
!15
• Jesa used prior knowledge to select possible features that
separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
BigML, Inc #MLSEV: Cluster Analysis
Human Expert
!16
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
BigML, Inc #MLSEV: Cluster Analysis
Clustering Features
!17
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2,75 6
box 1 6
block 1,6 6
screw 8 3
battery 5 3
key 4,25 3
bead 1 2
BigML, Inc #MLSEV: Cluster Analysis
Plot by Features
!18
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can find clusters using distances

in n-dimensional feature space
K=3
BigML, Inc #MLSEV: Cluster Analysis
Plot by Features
!19
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points
BigML, Inc #MLSEV: Cluster Analysis
K-Means Algorithm
!20
K=3
BigML, Inc #MLSEV: Cluster Analysis
K-Means Algorithm
!21
K=3
Repeat until centroids stop moving
BigML, Inc #MLSEV: Cluster Analysis
Features Matter
!22
Metal Other
Wood
BigML, Inc #MLSEV: Cluster Analysis
Convergence
!23
Convergence guaranteed

but not necessarily unique

Starting points important (K++)
BigML, Inc #MLSEV: Cluster Analysis
Starting Points
!24
• Random points or instances in n-dimensional space
• Might start "too close"
• Risk of sub-optimal convergence
BigML, Inc #MLSEV: Cluster Analysis
Sub-Optimal Converge
!25
Arbitrarily Far Apart

Sub-Optimal
Arbitrarily Far Apart

Optimal
BigML, Inc #MLSEV: Cluster Analysis
Starting Points
!26
• Random points or instances in n-dimensional space
• Might start "too close"
• Risk of sub-optimal convergence
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first point is chosen randomly from instances
• each subsequent point is chosen from the remaining
instances with a probability proportional to the squared
distance from the point's closest existing cluster center
BigML, Inc #MLSEV: Cluster Analysis
K++ Initial Centers
!27
Low

Probability
High

ProbabilityHighest

Probability
K=3
BigML, Inc #MLSEV: Cluster Analysis
K++ Initial Centers
!28
Low

Probability
Low

Probability
K=3
BigML, Inc #MLSEV: Cluster Analysis
K++ Initial Centers
!29
K=3
BigML, Inc #MLSEV: Cluster Analysis
Scaling Matters
!30
price
number of bedrooms
d = 160,000
d = 1
BigML, Inc #MLSEV: Cluster Analysis
Other Tricks
!31
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• How far is “red” from “green”?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown ideal number of clusters, “K”?
BigML, Inc #MLSEV: Cluster Analysis
Distance to Missing?
!32
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
BigML, Inc #MLSEV: Cluster Analysis
Distance to Categorical?
!33
• Define special distance function: For two instances 𝑥 and 𝑦
and the categorical field 𝑎:
• if 𝑥 𝑎 = 𝑦 𝑎 then

(𝑥,𝑦)distance=0 (or field scaling value) 

else 

(𝑥,𝑦)distance=1
Approach: similar to “k-prototypes”
BigML, Inc #MLSEV: Cluster Analysis
Distance to Categorical?
!34
animal favorite toy toy color
cat ball red
cat ball green
d=0 d=0 d=1
cat laser red
dog squeaky red
d=1 d=1 d=0
D = 1
Then compute Euclidean distance between vectors
D = √2
Note: the centroid is assigned the most common
category of the member instances
BigML, Inc #MLSEV: Cluster Analysis
Text Vectors
!35
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
1 0 1 …
1 1 0 …
0 1 1 …
Text Field #1
Text Field #2
Features(thousands)
• Cosine Similarity
• cos() between two vectors
• 1 if collinear, 0 if orthogonal
• only positive vectors: 0 ≤ CS ≤ 1
• Cosine Distance=1-Cosine
Similarity
• CD(TF1, TF2) = 0.5
BigML, Inc #MLSEV: Cluster Analysis
Finding K: G-Means
!36
BigML, Inc #MLSEV: Cluster Analysis
Finding K: G-Means
!37
BigML, Inc #MLSEV: Cluster Analysis
Finding K: G-Means
!38
Let K=2
Keep 1, Split 1
New K=3
BigML, Inc #MLSEV: Cluster Analysis
Finding K: G-Means
!39
Let K=3
Keep 1, Split 2
New K=5
BigML, Inc #MLSEV: Cluster Analysis
Finding K: G-Means
!40
Let K=5
K=5
BigML, Inc #MLSEV
Clusters Demo #2
!41
BigML, Inc #MLSEV: Cluster Analysis
Summary
!42
• Cluster Purpose
• Unsupervised technique for finding self-similar groups
of instances
• Number of centroids (k) can be inputed or computed
• Outputs list of centroids
• Configuration:
• Algorithm: K-means / G-means
• Cluster Parameter: k or critical value
• Default missing / Summary fields / Scales / Weights
• Model Clusters
• Centroid / Batchcentroids
BigML, Inc
Anomaly Detection
Finding the Unusual
Poul Petersen
CIO, BigML, Inc
!43
BigML, Inc #MLSEV: Anomaly Detection
What is Anomaly Detection?
!44
• An unsupervised learning technique
• No labels necessary
• Useful for finding unusual instances
• Filtering, finding mistakes, 1-class classifiers
• Finds instances that do not match
• Customer: big or small spender for profile
• Medical: healthy patient despite indicative diagnostics
• Defines each unusual instance by an “anomaly score”
• in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5

• Standard deviation, distributions, etc
BigML, Inc #MLSEV: Anomaly Detection
Clusters
!45
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc #MLSEV: Anomaly Detection
Clusters
!46
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
BigML, Inc #MLSEV: Anomaly Detection
Anomaly Detection
!47
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc #MLSEV: Anomaly Detection
Anomaly Detection
!48
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
• Amount $2,459 is higher than all other transactions
• It is the only transaction
• In zip 21350
• for the purchase class "tech"
BigML, Inc #MLSEV: Anomaly Detection
Use Cases
!49
• Unusual instance discovery - "exploration"
• Intrusion Detection - "looking for unusual usage patterns"
• Fraud - "looking for unusual behavior"
• Identify Incorrect Data - "looking for mistakes"
• Remove Outliers - "improve model quality"
• Model Competence / Input Data Drift
BigML, Inc #MLSEV: Anomaly Detection
Removing Outliers
!50
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
BigML, Inc #MLSEV: Anomaly Detection
Diabetes Anomalies
!51
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
FILTER
ALL
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
BigML, Inc #MLSEV
Anomaly Demo #1
!52
BigML, Inc #MLSEV: Anomaly Detection
Intrusion Detection
!53
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
flags, working directories, etc.
• Assumption: Users typically issue the
same flag patterns and work in certain
directories
Per User Per Dir All User All Dir
BigML, Inc #MLSEV: Anomaly Detection
Fraud
!54
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
BigML, Inc #MLSEV: Anomaly Detection
Model Competence
!55
• After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
Confidence 0,86 0,84
Anomaly Score 0,5367 0,7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
BigML, Inc #MLSEV: Anomaly Detection
Benford’s Law
!56
• In real-life numeric sets the small digits occur
disproportionately often as leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
BigML, Inc #MLSEV: Anomaly Detection
Univariate Approach
!57
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the squared
differences from the Mean.)
• Depending on the number of instances, choose a “multiple”
of standard deviations to indicate an anomaly. A multiple of 3
for 1000 instances removes ~ 3 outliers.
BigML, Inc #MLSEV: Anomaly Detection
Univariate Approach
!58
measurement
frequency
outliersoutliers
• Available in BigML API
BigML, Inc #MLSEV: Anomaly Detection
Multivariate Matters
!59
BigML, Inc #MLSEV: Anomaly Detection
Multivariate Matters
!60
BigML, Inc #MLSEV: Anomaly Detection
Human Expert
!61
Most Unusual?
BigML, Inc #MLSEV: Anomaly Detection
Human Expert
!62
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight

The “most unusual” object

is different in some way from

every partition of the features.
Most unusual
BigML, Inc #MLSEV: Anomaly Detection
Human Expert
!63
• Human used prior knowledge to select possible features
that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen features
• Each cluster was then examined to see which object fit
the least well in its cluster and did not fit any other cluster
BigML, Inc #MLSEV: Anomaly Detection
Human Expert
!64
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
BigML, Inc #MLSEV: Anomaly Detection
Anomaly Features
!65
Object Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2,75 6 TRUE
box 1 6 TRUE
block 1,6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4,25 3 FALSE
bead 1 2 TRUE
BigML, Inc #MLSEV: Anomaly Detection
length/width > 5
smooth?
box
blockeraser
knob
penny/dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that “splits” matter - don’t know the order
TrueFalse
TrueFalse TrueFalse
FalseTrue
TrueFalse
Random Splits
!66
BigML, Inc #MLSEV: Anomaly Detection
Isolation Forest
!67
Grow a random decision tree until
each instance from a sample is in
its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML, Inc #MLSEV: Anomaly Detection
Isolation Forest Scoring
!68
D = 3
D = 6
D = 2
S=0.45
Map avg depth
to final score
f1 f2 f3
i1 red cat ball
i2 red cat ball
i3 red cat box
i4 blue dog pen
For the instance, i2
Find the depth in each tree
BigML, Inc #MLSEV: Anomaly Detection
Model Competence
!69
• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you should not trust the
model.
Prediction T T
Confidence
0,86 0,84
Anomaly
Score
0,5367 0,7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR
BigML, Inc #MLSEV
Anomaly Demo #2
!70
BigML, Inc #MLSEV: Anomaly Detection
1-Class Classifier?
!71
• You place an advertisement in a local newspaper
• You collect demographic information about all responders
• Now you want to market in a new locality with direct letters
• To optimize mailing costs, need to predict who will respond
• But, can not distinguish not interested from didn’t see the ad
• Train an anomaly detector on the 1-class data
• Pick the households with the lowest scores for mailing:
• If a household has a low anomaly score, then they are
“similar” to enough of your positive responders and
therefore may respond as well
• If an individual has a high anomaly score, then they are
dissimilar from all previous responders and therefore are
less likely to respond.
BigML, Inc #MLSEV: Anomaly Detection
Summary
!72
• Anomaly detection is the process of finding unusual instances
• Some techniques and how they work:
• Univariate: standard deviation
• Benford’s law
• Isolation Forest
• Applications
• Filtering to improve models
• Finding mistakes, fraud, and intruders
• Knowing when to retrain a model (competence)
• 1-class classifiers
• In general… unsupervised learning techniques:
• Require more finesse and interpretation
• Are more commonly part of a multistep workflow
MLSEV. Cluster Analysis and Anomaly Detection

More Related Content

PDF
MLSEV. Association Discovery and Topic Modeling
BigML, Inc
 
PDF
MLSEV. Models, Evaluations and Ensembles
BigML, Inc
 
PDF
MLSEV. Use Case: Online and Offline World in the Retail Sector
BigML, Inc
 
PDF
MLSEV. Machine Learning: Business Perspective
BigML, Inc
 
PDF
DutchMLSchool. Clusters and Anomalies
BigML, Inc
 
PDF
MLSEV. Anatomy of an ML Application
BigML, Inc
 
PDF
DutchMLSchool. Machine Learning End-to-End
BigML, Inc
 
PDF
DutchMLSchool. Supervised vs Unsupervised Learning
BigML, Inc
 
MLSEV. Association Discovery and Topic Modeling
BigML, Inc
 
MLSEV. Models, Evaluations and Ensembles
BigML, Inc
 
MLSEV. Use Case: Online and Offline World in the Retail Sector
BigML, Inc
 
MLSEV. Machine Learning: Business Perspective
BigML, Inc
 
DutchMLSchool. Clusters and Anomalies
BigML, Inc
 
MLSEV. Anatomy of an ML Application
BigML, Inc
 
DutchMLSchool. Machine Learning End-to-End
BigML, Inc
 
DutchMLSchool. Supervised vs Unsupervised Learning
BigML, Inc
 

What's hot (20)

PDF
BSSML17 - Anomaly Detection
BigML, Inc
 
PDF
BSSML16 L4. Association Discovery and Topic Modeling
BigML, Inc
 
PDF
BSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
PDF
BSSML17 - Clusters
BigML, Inc
 
PDF
VSSML17 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
BSSML17 - Association Discovery
BigML, Inc
 
PDF
BSSML16 L1. Introduction, Models, and Evaluations
BigML, Inc
 
PDF
BSSML17 - Basic Data Transformations
BigML, Inc
 
PDF
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
PDF
BSSML16 L2. Ensembles and Logistic Regressions
BigML, Inc
 
PDF
VSSML16 L5. Basic Data Transformations
BigML, Inc
 
PDF
BSSML17 - Ensembles
BigML, Inc
 
PDF
VSSML17 L6. Time Series and Deepnets
BigML, Inc
 
PDF
Explainability and bias in AI
Bill Liu
 
PDF
Penguin, SEO and the Apocalypse
Ian Lurie
 
PDF
When recommendation go bad
IntoTheMinds
 
PDF
BSSML17 - Topic Models
BigML, Inc
 
PDF
MLSD18. Supervised Workshop
BigML, Inc
 
PPTX
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Sri Ambati
 
BSSML17 - Anomaly Detection
BigML, Inc
 
BSSML16 L4. Association Discovery and Topic Modeling
BigML, Inc
 
BSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
BSSML17 - Clusters
BigML, Inc
 
VSSML17 L3. Clusters and Anomaly Detection
BigML, Inc
 
BSSML17 - Association Discovery
BigML, Inc
 
BSSML16 L1. Introduction, Models, and Evaluations
BigML, Inc
 
BSSML17 - Basic Data Transformations
BigML, Inc
 
BSSML16 L5. Summary Day 1 Sessions
BigML, Inc
 
BSSML16 L2. Ensembles and Logistic Regressions
BigML, Inc
 
VSSML16 L5. Basic Data Transformations
BigML, Inc
 
BSSML17 - Ensembles
BigML, Inc
 
VSSML17 L6. Time Series and Deepnets
BigML, Inc
 
Explainability and bias in AI
Bill Liu
 
Penguin, SEO and the Apocalypse
Ian Lurie
 
When recommendation go bad
IntoTheMinds
 
BSSML17 - Topic Models
BigML, Inc
 
MLSD18. Supervised Workshop
BigML, Inc
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Sri Ambati
 
Ad

Similar to MLSEV. Cluster Analysis and Anomaly Detection (20)

PDF
VSSML18. Clustering and Latent Dirichlet Allocation
BigML, Inc
 
PDF
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
BigML Education - Clusters
BigML, Inc
 
PPTX
07 learning
ankit_ppt
 
PDF
L13. Cluster Analysis
Machine Learning Valencia
 
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
PPTX
Customer segmentation.pptx
Addalashashikumar
 
PDF
L14. Anomaly Detection
Machine Learning Valencia
 
PPTX
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
ankit_ppt
 
PPTX
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
PDF
DutchMLSchool. Automating Decision Making
BigML, Inc
 
PDF
BSSML17 - Introduction, Models, Evaluations
BigML, Inc
 
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
BigML, Inc
 
PDF
MLSEV. Automating Decision Making
BigML, Inc
 
PPTX
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
PPTX
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
PPTX
Detailed_KMeans_Unsupervised_Learning_Presentation.pptx
Mansi Sharma
 
PPTX
Clustering as a unsupervised learning method inin machine learning
tanishqgujari
 
PDF
How to do Predictive Analytics with Limited Data
Datameer
 
PDF
Machine learning by using python By: Professor Lili Saghafi
Professor Lili Saghafi
 
VSSML18. Clustering and Latent Dirichlet Allocation
BigML, Inc
 
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
BigML Education - Clusters
BigML, Inc
 
07 learning
ankit_ppt
 
L13. Cluster Analysis
Machine Learning Valencia
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
Customer segmentation.pptx
Addalashashikumar
 
L14. Anomaly Detection
Machine Learning Valencia
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
ankit_ppt
 
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
DutchMLSchool. Automating Decision Making
BigML, Inc
 
BSSML17 - Introduction, Models, Evaluations
BigML, Inc
 
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
BigML, Inc
 
MLSEV. Automating Decision Making
BigML, Inc
 
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Detailed_KMeans_Unsupervised_Learning_Presentation.pptx
Mansi Sharma
 
Clustering as a unsupervised learning method inin machine learning
tanishqgujari
 
How to do Predictive Analytics with Limited Data
Datameer
 
Machine learning by using python By: Professor Lili Saghafi
Professor Lili Saghafi
 
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
PDF
DutchMLSchool 2022 - Automation
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
PDF
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
PDF
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
PDF
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
PDF
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
PDF
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
PDF
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
PDF
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
PDF
Democratizing Object Detection
BigML, Inc
 
PDF
BigML Release: Image Processing
BigML, Inc
 
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
PDF
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 
Digital Transformation and Process Optimization in Manufacturing
BigML, Inc
 
DutchMLSchool 2022 - Automation
BigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
BigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
BigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
BigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
BigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
BigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
BigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
BigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
BigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
BigML, Inc
 
Democratizing Object Detection
BigML, Inc
 
BigML Release: Image Processing
BigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
BigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
BigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
BigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
BigML, Inc
 

Recently uploaded (20)

PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 

MLSEV. Cluster Analysis and Anomaly Detection

  • 2. BigML, Inc Clusters Finding Similarities Poul Petersen CIO, BigML, Inc !2
  • 3. BigML, Inc #MLSEV: Cluster Analysis What is Clustering? !3 • An unsupervised learning technique • No labels necessary • Useful for finding similar instances • Smart sampling/labelling • Finds “self-similar" groups of instances • Customer: groups with similar behavior • Medical: patients with similar diagnostic measurements • Defines each group by a “centroid” • Geometric center of the group • Represents the “average” member • Number of centroids (k) can be specified or determined
  • 4. BigML, Inc #MLSEV: Cluster Analysis Cluster Centroids !4 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 5. BigML, Inc #MLSEV: Cluster Analysis Cluster Centroids !5 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 auth = pin amount ~ $100 Same: date: Mon != Wed customer: Sally != Bob account: 6788 != 3421 class: clothes != gas zip: 26339 != 46140 Different: date = Wed (2 out of 3) customer = Bob account = 3421 auth = pin class = gas zip = 46140 amount = $104 Centroid: similar
  • 6. BigML, Inc #MLSEV: Cluster Analysis Use Cases !6 • Customer segmentation • Which customers are similar? • How many natural groups are there? • Item discovery • What other items are similar to this one? • Similarity • What other instances share a specific property? • Recommender (almost) • If you like this item, what other items might you like? • Active learning • Labelling unlabelled data efficiently
  • 7. BigML, Inc #MLSEV: Cluster Analysis Customer Segmentation !7 GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in- game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  • 8. BigML, Inc #MLSEV: Cluster Analysis Similarity !8 GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as “trouble” 0% 3% 7% 1%
  • 9. BigML, Inc #MLSEV: Cluster Analysis Active Learning !9 GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  • 10. BigML, Inc #MLSEV: Cluster Analysis Active Learning !10 *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not- fraud. Or a million images which need to be labeled as cat/not-cat. 2323
  • 11. BigML, Inc #MLSEV: Cluster Analysis Item Discovery !11 GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  • 13. BigML, Inc #MLSEV: Cluster Analysis Human Expert !13 Cluster into 3 groups…
  • 14. BigML, Inc #MLSEV: Cluster Analysis Human Expert !14
  • 15. BigML, Inc #MLSEV: Cluster Analysis Human Expert !15 • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently “distant” • no crossover
  • 16. BigML, Inc #MLSEV: Cluster Analysis Human Expert !16 • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  • 17. BigML, Inc #MLSEV: Cluster Analysis Clustering Features !17 Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2,75 6 box 1 6 block 1,6 6 screw 8 3 battery 5 3 key 4,25 3 bead 1 2
  • 18. BigML, Inc #MLSEV: Cluster Analysis Plot by Features !18 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space K=3
  • 19. BigML, Inc #MLSEV: Cluster Analysis Plot by Features !19 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  • 20. BigML, Inc #MLSEV: Cluster Analysis K-Means Algorithm !20 K=3
  • 21. BigML, Inc #MLSEV: Cluster Analysis K-Means Algorithm !21 K=3 Repeat until centroids stop moving
  • 22. BigML, Inc #MLSEV: Cluster Analysis Features Matter !22 Metal Other Wood
  • 23. BigML, Inc #MLSEV: Cluster Analysis Convergence !23 Convergence guaranteed but not necessarily unique Starting points important (K++)
  • 24. BigML, Inc #MLSEV: Cluster Analysis Starting Points !24 • Random points or instances in n-dimensional space • Might start "too close" • Risk of sub-optimal convergence
  • 25. BigML, Inc #MLSEV: Cluster Analysis Sub-Optimal Converge !25 Arbitrarily Far Apart
 Sub-Optimal Arbitrarily Far Apart
 Optimal
  • 26. BigML, Inc #MLSEV: Cluster Analysis Starting Points !26 • Random points or instances in n-dimensional space • Might start "too close" • Risk of sub-optimal convergence • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first point is chosen randomly from instances • each subsequent point is chosen from the remaining instances with a probability proportional to the squared distance from the point's closest existing cluster center
  • 27. BigML, Inc #MLSEV: Cluster Analysis K++ Initial Centers !27 Low
 Probability High
 ProbabilityHighest
 Probability K=3
  • 28. BigML, Inc #MLSEV: Cluster Analysis K++ Initial Centers !28 Low
 Probability Low
 Probability K=3
  • 29. BigML, Inc #MLSEV: Cluster Analysis K++ Initial Centers !29 K=3
  • 30. BigML, Inc #MLSEV: Cluster Analysis Scaling Matters !30 price number of bedrooms d = 160,000 d = 1
  • 31. BigML, Inc #MLSEV: Cluster Analysis Other Tricks !31 • What is the distance to a “missing value”? • What is the distance between categorical values? • How far is “red” from “green”? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown ideal number of clusters, “K”?
  • 32. BigML, Inc #MLSEV: Cluster Analysis Distance to Missing? !32 • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  • 33. BigML, Inc #MLSEV: Cluster Analysis Distance to Categorical? !33 • Define special distance function: For two instances 𝑥 and 𝑦 and the categorical field 𝑎: • if 𝑥 𝑎 = 𝑦 𝑎 then
 (𝑥,𝑦)distance=0 (or field scaling value) 
 else 
 (𝑥,𝑦)distance=1 Approach: similar to “k-prototypes”
  • 34. BigML, Inc #MLSEV: Cluster Analysis Distance to Categorical? !34 animal favorite toy toy color cat ball red cat ball green d=0 d=0 d=1 cat laser red dog squeaky red d=1 d=1 d=0 D = 1 Then compute Euclidean distance between vectors D = √2 Note: the centroid is assigned the most common category of the member instances
  • 35. BigML, Inc #MLSEV: Cluster Analysis Text Vectors !35 1 Cosine Similarity 0 -1 "hippo" "safari" "zebra" …. 1 0 1 … 1 1 0 … 0 1 1 … Text Field #1 Text Field #2 Features(thousands) • Cosine Similarity • cos() between two vectors • 1 if collinear, 0 if orthogonal • only positive vectors: 0 ≤ CS ≤ 1 • Cosine Distance=1-Cosine Similarity • CD(TF1, TF2) = 0.5
  • 36. BigML, Inc #MLSEV: Cluster Analysis Finding K: G-Means !36
  • 37. BigML, Inc #MLSEV: Cluster Analysis Finding K: G-Means !37
  • 38. BigML, Inc #MLSEV: Cluster Analysis Finding K: G-Means !38 Let K=2 Keep 1, Split 1 New K=3
  • 39. BigML, Inc #MLSEV: Cluster Analysis Finding K: G-Means !39 Let K=3 Keep 1, Split 2 New K=5
  • 40. BigML, Inc #MLSEV: Cluster Analysis Finding K: G-Means !40 Let K=5 K=5
  • 42. BigML, Inc #MLSEV: Cluster Analysis Summary !42 • Cluster Purpose • Unsupervised technique for finding self-similar groups of instances • Number of centroids (k) can be inputed or computed • Outputs list of centroids • Configuration: • Algorithm: K-means / G-means • Cluster Parameter: k or critical value • Default missing / Summary fields / Scales / Weights • Model Clusters • Centroid / Batchcentroids
  • 43. BigML, Inc Anomaly Detection Finding the Unusual Poul Petersen CIO, BigML, Inc !43
  • 44. BigML, Inc #MLSEV: Anomaly Detection What is Anomaly Detection? !44 • An unsupervised learning technique • No labels necessary • Useful for finding unusual instances • Filtering, finding mistakes, 1-class classifiers • Finds instances that do not match • Customer: big or small spender for profile • Medical: healthy patient despite indicative diagnostics • Defines each unusual instance by an “anomaly score” • in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5 • Standard deviation, distributions, etc
  • 45. BigML, Inc #MLSEV: Anomaly Detection Clusters !45 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 46. BigML, Inc #MLSEV: Anomaly Detection Clusters !46 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  • 47. BigML, Inc #MLSEV: Anomaly Detection Anomaly Detection !47 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 48. BigML, Inc #MLSEV: Anomaly Detection Anomaly Detection !48 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly • Amount $2,459 is higher than all other transactions • It is the only transaction • In zip 21350 • for the purchase class "tech"
  • 49. BigML, Inc #MLSEV: Anomaly Detection Use Cases !49 • Unusual instance discovery - "exploration" • Intrusion Detection - "looking for unusual usage patterns" • Fraud - "looking for unusual behavior" • Identify Incorrect Data - "looking for mistakes" • Remove Outliers - "improve model quality" • Model Competence / Input Data Drift
  • 50. BigML, Inc #MLSEV: Anomaly Detection Removing Outliers !50 • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  • 51. BigML, Inc #MLSEV: Anomaly Detection Diabetes Anomalies !51 DIABETES SOURCE DIABETES DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET FILTER ALL MODEL ALL EVALUATION CLEAN EVALUATION COMPARE EVALUATIONS ANAOMALY DETECTOR
  • 53. BigML, Inc #MLSEV: Anomaly Detection Intrusion Detection !53 GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  • 54. BigML, Inc #MLSEV: Anomaly Detection Fraud !54 • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  • 55. BigML, Inc #MLSEV: Anomaly Detection Model Competence !55 • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted. Prediction T T Confidence 0,86 0,84 Anomaly Score 0,5367 0,7124 Competent? Y N At Prediction TimeAt Training Time DATASET MODEL ANOMALY DETECTOR
  • 56. BigML, Inc #MLSEV: Anomaly Detection Benford’s Law !56 • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  • 57. BigML, Inc #MLSEV: Anomaly Detection Univariate Approach !57 • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  • 58. BigML, Inc #MLSEV: Anomaly Detection Univariate Approach !58 measurement frequency outliersoutliers • Available in BigML API
  • 59. BigML, Inc #MLSEV: Anomaly Detection Multivariate Matters !59
  • 60. BigML, Inc #MLSEV: Anomaly Detection Multivariate Matters !60
  • 61. BigML, Inc #MLSEV: Anomaly Detection Human Expert !61 Most Unusual?
  • 62. BigML, Inc #MLSEV: Anomaly Detection Human Expert !62 “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  • 63. BigML, Inc #MLSEV: Anomaly Detection Human Expert !63 • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  • 64. BigML, Inc #MLSEV: Anomaly Detection Human Expert !64 • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  • 65. BigML, Inc #MLSEV: Anomaly Detection Anomaly Features !65 Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2,75 6 TRUE box 1 6 TRUE block 1,6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4,25 3 FALSE bead 1 2 TRUE
  • 66. BigML, Inc #MLSEV: Anomaly Detection length/width > 5 smooth? box blockeraser knob penny/dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Know that “splits” matter - don’t know the order TrueFalse TrueFalse TrueFalse FalseTrue TrueFalse Random Splits !66
  • 67. BigML, Inc #MLSEV: Anomaly Detection Isolation Forest !67 Grow a random decision tree until each instance from a sample is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  • 68. BigML, Inc #MLSEV: Anomaly Detection Isolation Forest Scoring !68 D = 3 D = 6 D = 2 S=0.45 Map avg depth to final score f1 f2 f3 i1 red cat ball i2 red cat ball i3 red cat box i4 blue dog pen For the instance, i2 Find the depth in each tree
  • 69. BigML, Inc #MLSEV: Anomaly Detection Model Competence !69 • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you should not trust the model. Prediction T T Confidence 0,86 0,84 Anomaly Score 0,5367 0,7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  • 71. BigML, Inc #MLSEV: Anomaly Detection 1-Class Classifier? !71 • You place an advertisement in a local newspaper • You collect demographic information about all responders • Now you want to market in a new locality with direct letters • To optimize mailing costs, need to predict who will respond • But, can not distinguish not interested from didn’t see the ad • Train an anomaly detector on the 1-class data • Pick the households with the lowest scores for mailing: • If a household has a low anomaly score, then they are “similar” to enough of your positive responders and therefore may respond as well • If an individual has a high anomaly score, then they are dissimilar from all previous responders and therefore are less likely to respond.
  • 72. BigML, Inc #MLSEV: Anomaly Detection Summary !72 • Anomaly detection is the process of finding unusual instances • Some techniques and how they work: • Univariate: standard deviation • Benford’s law • Isolation Forest • Applications • Filtering to improve models • Finding mistakes, fraud, and intruders • Knowing when to retrain a model (competence) • 1-class classifiers • In general… unsupervised learning techniques: • Require more finesse and interpretation • Are more commonly part of a multistep workflow