SlideShare a Scribd company logo
2
Most read
15
Most read
23
Most read
Anomaly / Outlier Detection
Introduction
• Anomaly is a pattern in the data that does not provide the expected behaviour called as outliers,
exceptions, peculiarities, surprise, etc.
• Anomalies are the set of data points that are considerably different than the remaining data.
• Outliers are data points that are considered out of the ordinary or abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/ critical/actionable information which
could be of interest to analysts.
 Anomalous events occur infrequently.
• Anomalies translate to significant (often critical) real life entities
• Cyber intrusions
• Credit card fraud
• Applications:
• Credit card fraud detection
• Telecommunication fraud detection
• Network intrusion detection
• Fault detection
Example
• N1 and N2 are regions of normal behavior
• Points o1 and o2 are anomalies
• Points in region O3 are anomalies
X
Y
N1
N2
o1
o2
O3
Aspects of Anomaly Detection Problem
• Nature of input data
• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques
Input Data
• Most common form of data handled by
anomaly detection techniques is Record
Data
• Univariate
• Multivariate
Tid SrcIP
Start
time
Dest IP Dest
Port
Number
of bytes
Attack
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No
2 206.163.37.95 11:13:56 160.94.179.219 139 195 No
3 206.163.37.95 11:14:29 160.94.179.217 139 180 No
4 206.163.37.95 11:14:30 160.94.179.255 139 199 No
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes
6 206.163.37.95 11:14:35 160.94.179.253 139 177 No
7 206.163.37.95 11:14:36 160.94.179.252 139 172 No
8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes
9 206.163.37.95 11:14:41 160.94.179.250 139 195 No
10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes
10
Input Data – Nature of Attributes
• Nature of attributes
• Binary
• Categorical
• Continuous Tid SrcIP Duration Dest IP
Number
of bytes
Internal
1 206.163.37.81 0.10 160.94.179.208 150 No
2 206.163.37.99 0.27 160.94.179.235 208 No
3 160.94.123.45 1.23 160.94.179.221 195 Yes
4 206.163.37.37 112.03 160.94.179.253 199 No
5 206.163.37.41 0.32 160.94.179.244 181 No
Input Data – Complex Data Types
• Relationship among data instances
• Sequential
• Temporal
• Spatial
• Spatio-temporal
• Graph
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Labels
• Supervised Anomaly Detection
• Labels available for both normal data and anomalies
• Semi-supervised Anomaly Detection
• Labels available only for normal data
• Unsupervised Anomaly Detection
• No labels assumed
• Based on the assumption that anomalies are very rare compared to normal data
Type of Anomaly
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
Point Anomalies
• An individual data instance is anomalous with respect to the data
X
Y
N1
N2
o1
o2
O3
Contextual Anomalies
• An individual data instance is anomalous within a context
• Requires a notion of context
• Also called as conditional anomalies
Normal
Anomaly
Collective Anomalies
• A collection of related data instances is anomalous.
• Requires a relationship among data instances
• Sequential Data
• Spatial Data
• Graph Data
• The individual instances within a collective anomaly are not anomalous by themselves.
Anomalous Subsequence
Output of Anomaly Detection
• Label
• Each test instance is given a normal or anomaly label
• Score
• Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
Anomaly Detection Schemes
• General Steps
• Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
• Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics differ significantly from the normal profile
• Types of anomaly detection schemes
• Graphical & Statistical-based
• Distance-based
• Model-based
Clustering Based Anomaly Detection
• Key assumption: Normal data records belong to large and dense clusters, while anomalies belong do not belong
to any of the clusters or form very small clusters
• Categorization according to labels
• Semi-supervised – Cluster normal data to create modes of normal behavior. If any new instance does not
belong to any of the clusters or it is not close to any cluster, is called anomaly
• Unsupervised – Post-processing is needed after a clustering step to determine the size of the clusters and the
distance from the clusters is required from the point to be anomaly.
• Anomalies detected using clustering based methods can be:
1. Data records that do not fit into any cluster (residuals from clustering)
2. Small clusters
3. Low density clusters or local anomalies (far from other points within the same cluster)
Anomaly Detection: Clustering Approach
Anomaly score function:
• Given a data point x from a dataset D,
Alternate definitions:
1. f(x) = distance between x and its closest centroid
2. f(x) : (called relative distance)
= ratio between the data point's distance from the centroid to the median distance
of all data points in the cluster from the centroid
3. f(x) = improvement in the goodness of a cluster (as measured by an objective
function) when x is removed
Anomaly Detection: K-means Clustering Approach
Step 1: Select k random data points from the training dataset as the centroids of the clusters C1, C2, ...Ck.
Step 2: For each training data point x:
a. Compute the Euclidean distance D(Ci, x), i = 1...k
b. Find cluster Cq that is closest to data point x.
c. Assign data point x to Cq. Update the centroid of Cq. (The centroid of a cluster is the
arithmetic mean of the data points in the cluster.)
Step 3: Repeat Step 2 until the centroids of clusters C1, C2, ...Ck stabilize in terms of convergence criterion.
Step 4: For each test (new) data point y:
a. Compute the Euclidean distance D(Ci, y), i = 1...k called anomaly score. Find the cluster Cr that is
closest to y.
b. Use a threshold t on this score to determine anomalies or outliers. i.e, x is an outlier iff score > t.
Otherwise, x is normal datapoint.
• Points in small clusters – anomalies**
Using K-means with 2 clusters. Fig uses distance of point from closest centroids
(D is not considered outlier)
Fig uses relative distance of point from closest centroids to adjust for the difference of densities among the clusters
Clustering Based Anomaly Detection
• Advantages:
• No need to be supervised
• Easily adaptable to on-line anomaly detection from temporal data
• Drawbacks
• Computationally expensive – Time complexity is O(cn), c - # of clusters
• Using indexing structures (k-d tree, R* tree) may alleviate this problem
• If normal points do not create any clusters the techniques may fail
• In high dimensional spaces, data is sparse and distances between any two data records may
become quite similar.
• Clustering algorithms may not give any meaningful clusters
Visualization Based Techniques
• Use visualization tools to observe the data
• Provide alternate views of data for manual inspection
• Anomalies are detected visually
• Advantages
• Keeps a human in the loop
• Disadvantages
• Works well for low dimensional data
• Can provide only aggregated or partial views for high dimension data
Application of Dynamic Graphics
• Apply dynamic graphics to the exploratory analysis of spatial data.
• Visualization tools are used to examine local variability to detect anomalies
• Manual inspection of plots of the data that display its marginal and multivariate distributions
Applications of Anomaly Detection
• Network intrusion detection
• Insurance / Credit card fraud detection
• Healthcare Informatics / Medical diagnostics
• Industrial Damage Detection
• Image Processing / Video surveillance
• Novel Topic Detection in Text Mining
References
• Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly
Media,2017
• Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010
• Artificial Intelligence and Machine Learning, Chandra S.S. & H.S. Anand, PHI Publications
• Machine Learning, Rajiv Chopra, Khanna Publishing House

More Related Content

What's hot (20)

PDF
Anomaly detection
QuantUniversity
 
PPTX
Anomaly detection
Dr. Stylianos Kampakis
 
PDF
Anomaly Detection in Seasonal Time Series
Humberto Marchezi
 
PPTX
Intrusion Detection with Neural Networks
antoniomorancardenas
 
PDF
Anomaly Detection using Deep Auto-Encoders
Gianmario Spacagna
 
PPTX
Random forest algorithm
Rashid Ansari
 
PDF
Moving Object Detection And Tracking Using CNN
NITISHKUMAR1401
 
PPTX
Machine Learning and Real-World Applications
MachinePulse
 
PPT
Decision tree
Soujanya V
 
PPT
Decision tree
Ami_Surati
 
PDF
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
PyData
 
PDF
Decision trees in Machine Learning
Mohammad Junaid Khan
 
PPTX
Random forest
Ujjawal
 
PPTX
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
PPTX
Semi-Supervised Learning
Lukas Tencer
 
PDF
Training Neural Networks
Databricks
 
PPT
pattern classification
Ranjan Ganguli
 
PPTX
Machine learning clustering
CosmoAIMS Bassett
 
PPTX
Support Vector Machine ppt presentation
AyanaRukasar
 
PPTX
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Anomaly detection
QuantUniversity
 
Anomaly detection
Dr. Stylianos Kampakis
 
Anomaly Detection in Seasonal Time Series
Humberto Marchezi
 
Intrusion Detection with Neural Networks
antoniomorancardenas
 
Anomaly Detection using Deep Auto-Encoders
Gianmario Spacagna
 
Random forest algorithm
Rashid Ansari
 
Moving Object Detection And Tracking Using CNN
NITISHKUMAR1401
 
Machine Learning and Real-World Applications
MachinePulse
 
Decision tree
Soujanya V
 
Decision tree
Ami_Surati
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
PyData
 
Decision trees in Machine Learning
Mohammad Junaid Khan
 
Random forest
Ujjawal
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
Semi-Supervised Learning
Lukas Tencer
 
Training Neural Networks
Databricks
 
pattern classification
Ranjan Ganguli
 
Machine learning clustering
CosmoAIMS Bassett
 
Support Vector Machine ppt presentation
AyanaRukasar
 
Outlier analysis and anomaly detection
ShantanuDeosthale
 

Similar to Anomaly detection (Unsupervised Learning) in Machine Learning (20)

PDF
Term_Paper_Shengzhe_Wang
Shengzhe Wang
 
PDF
Anomly and fraud detection using AI - Artivatic.ai
Artivatic.ai
 
PDF
Fraud detection- Retail, Banking, Finance & FMCG
Artivatic.ai
 
PDF
anomalydetection-191104083630.pdf
hanadi40
 
PDF
Analytics for large-scale time series and event data
Anodot
 
PPTX
Chapter 10 Anomaly Detection
Khalid Elshafie
 
DOCX
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
amrit47
 
PDF
AI in anomaly detection - An Overview.pdf
StephenAmell4
 
PDF
Outlier analysis for Temporal Datasets
QuantUniversity
 
PPTX
Anomaly detection workshop
gforgovind
 
PDF
Outlier Detection Using Unsupervised Learning on High Dimensional Data
IJERA Editor
 
PDF
Annommaly detection techniques and approaches
Jay Sahoo
 
PDF
AI in anomaly detection.pdf
StephenAmell4
 
PDF
Dataday Texas 2016 - Datadog
Datadog
 
PPTX
Anomalies and events keep us on our toes
CSIRO
 
PDF
POSTER_Ewonye.pdf
kwadwoAmedi
 
PPTX
Traffic anomaly detection and attack
Qrator Labs
 
PPT
Chap10 Anomaly Detection
guest76d673
 
PPTX
Anomaly Detection
DataminingTools Inc
 
PPTX
Anomaly Detection
guest0edcaf
 
Term_Paper_Shengzhe_Wang
Shengzhe Wang
 
Anomly and fraud detection using AI - Artivatic.ai
Artivatic.ai
 
Fraud detection- Retail, Banking, Finance & FMCG
Artivatic.ai
 
anomalydetection-191104083630.pdf
hanadi40
 
Analytics for large-scale time series and event data
Anodot
 
Chapter 10 Anomaly Detection
Khalid Elshafie
 
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
amrit47
 
AI in anomaly detection - An Overview.pdf
StephenAmell4
 
Outlier analysis for Temporal Datasets
QuantUniversity
 
Anomaly detection workshop
gforgovind
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
IJERA Editor
 
Annommaly detection techniques and approaches
Jay Sahoo
 
AI in anomaly detection.pdf
StephenAmell4
 
Dataday Texas 2016 - Datadog
Datadog
 
Anomalies and events keep us on our toes
CSIRO
 
POSTER_Ewonye.pdf
kwadwoAmedi
 
Traffic anomaly detection and attack
Qrator Labs
 
Chap10 Anomaly Detection
guest76d673
 
Anomaly Detection
DataminingTools Inc
 
Anomaly Detection
guest0edcaf
 
Ad

More from Kuppusamy P (20)

PDF
Recurrent neural networks rnn
Kuppusamy P
 
PDF
Deep learning
Kuppusamy P
 
PDF
Image segmentation
Kuppusamy P
 
PDF
Image enhancement
Kuppusamy P
 
PDF
Feature detection and matching
Kuppusamy P
 
PDF
Image processing, Noise, Noise Removal filters
Kuppusamy P
 
PDF
Flowchart design for algorithms
Kuppusamy P
 
PDF
Algorithm basics
Kuppusamy P
 
PDF
Problem solving using Programming
Kuppusamy P
 
PDF
Parts of Computer, Hardware and Software
Kuppusamy P
 
PDF
Strings in java
Kuppusamy P
 
PDF
Java methods or Subroutines or Functions
Kuppusamy P
 
PDF
Java arrays
Kuppusamy P
 
PDF
Java iterative statements
Kuppusamy P
 
PDF
Java conditional statements
Kuppusamy P
 
PDF
Java data types
Kuppusamy P
 
PDF
Java introduction
Kuppusamy P
 
PDF
Logistic regression in Machine Learning
Kuppusamy P
 
PDF
Machine Learning Performance metrics for classification
Kuppusamy P
 
PDF
Machine learning Introduction
Kuppusamy P
 
Recurrent neural networks rnn
Kuppusamy P
 
Deep learning
Kuppusamy P
 
Image segmentation
Kuppusamy P
 
Image enhancement
Kuppusamy P
 
Feature detection and matching
Kuppusamy P
 
Image processing, Noise, Noise Removal filters
Kuppusamy P
 
Flowchart design for algorithms
Kuppusamy P
 
Algorithm basics
Kuppusamy P
 
Problem solving using Programming
Kuppusamy P
 
Parts of Computer, Hardware and Software
Kuppusamy P
 
Strings in java
Kuppusamy P
 
Java methods or Subroutines or Functions
Kuppusamy P
 
Java arrays
Kuppusamy P
 
Java iterative statements
Kuppusamy P
 
Java conditional statements
Kuppusamy P
 
Java data types
Kuppusamy P
 
Java introduction
Kuppusamy P
 
Logistic regression in Machine Learning
Kuppusamy P
 
Machine Learning Performance metrics for classification
Kuppusamy P
 
Machine learning Introduction
Kuppusamy P
 
Ad

Recently uploaded (20)

PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
Difference between write and update in odoo 18
Celine George
 
PPTX
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PDF
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
Introduction to Indian Writing in English
Trushali Dodiya
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Difference between write and update in odoo 18
Celine George
 
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
Controller Request and Response in Odoo18
Celine George
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
Horarios de distribución de agua en julio
pegazohn1978
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
Introduction to Indian Writing in English
Trushali Dodiya
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 

Anomaly detection (Unsupervised Learning) in Machine Learning

  • 1. Anomaly / Outlier Detection
  • 2. Introduction • Anomaly is a pattern in the data that does not provide the expected behaviour called as outliers, exceptions, peculiarities, surprise, etc. • Anomalies are the set of data points that are considerably different than the remaining data. • Outliers are data points that are considered out of the ordinary or abnormal . This includes noise. • Anomalies are a special kind of outlier that has significant/ critical/actionable information which could be of interest to analysts.  Anomalous events occur infrequently. • Anomalies translate to significant (often critical) real life entities • Cyber intrusions • Credit card fraud • Applications: • Credit card fraud detection • Telecommunication fraud detection • Network intrusion detection • Fault detection
  • 3. Example • N1 and N2 are regions of normal behavior • Points o1 and o2 are anomalies • Points in region O3 are anomalies X Y N1 N2 o1 o2 O3
  • 4. Aspects of Anomaly Detection Problem • Nature of input data • Availability of supervision • Type of anomaly: point, contextual, structural • Output of anomaly detection • Evaluation of anomaly detection techniques
  • 5. Input Data • Most common form of data handled by anomaly detection techniques is Record Data • Univariate • Multivariate Tid SrcIP Start time Dest IP Dest Port Number of bytes Attack 1 206.135.38.95 11:07:20 160.94.179.223 139 192 No 2 206.163.37.95 11:13:56 160.94.179.219 139 195 No 3 206.163.37.95 11:14:29 160.94.179.217 139 180 No 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No 5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes 6 206.163.37.95 11:14:35 160.94.179.253 139 177 No 7 206.163.37.95 11:14:36 160.94.179.252 139 172 No 8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes 9 206.163.37.95 11:14:41 160.94.179.250 139 195 No 10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes 10
  • 6. Input Data – Nature of Attributes • Nature of attributes • Binary • Categorical • Continuous Tid SrcIP Duration Dest IP Number of bytes Internal 1 206.163.37.81 0.10 160.94.179.208 150 No 2 206.163.37.99 0.27 160.94.179.235 208 No 3 160.94.123.45 1.23 160.94.179.221 195 Yes 4 206.163.37.37 112.03 160.94.179.253 199 No 5 206.163.37.41 0.32 160.94.179.244 181 No
  • 7. Input Data – Complex Data Types • Relationship among data instances • Sequential • Temporal • Spatial • Spatio-temporal • Graph GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
  • 8. Data Labels • Supervised Anomaly Detection • Labels available for both normal data and anomalies • Semi-supervised Anomaly Detection • Labels available only for normal data • Unsupervised Anomaly Detection • No labels assumed • Based on the assumption that anomalies are very rare compared to normal data
  • 9. Type of Anomaly • Point Anomalies • Contextual Anomalies • Collective Anomalies
  • 10. Point Anomalies • An individual data instance is anomalous with respect to the data X Y N1 N2 o1 o2 O3
  • 11. Contextual Anomalies • An individual data instance is anomalous within a context • Requires a notion of context • Also called as conditional anomalies Normal Anomaly
  • 12. Collective Anomalies • A collection of related data instances is anomalous. • Requires a relationship among data instances • Sequential Data • Spatial Data • Graph Data • The individual instances within a collective anomaly are not anomalous by themselves. Anomalous Subsequence
  • 13. Output of Anomaly Detection • Label • Each test instance is given a normal or anomaly label • Score • Each test instance is assigned an anomaly score • Allows the output to be ranked • Requires an additional threshold parameter
  • 14. Anomaly Detection Schemes • General Steps • Build a profile of the “normal” behavior • Profile can be patterns or summary statistics for the overall population • Use the “normal” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile • Types of anomaly detection schemes • Graphical & Statistical-based • Distance-based • Model-based
  • 15. Clustering Based Anomaly Detection • Key assumption: Normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters • Categorization according to labels • Semi-supervised – Cluster normal data to create modes of normal behavior. If any new instance does not belong to any of the clusters or it is not close to any cluster, is called anomaly • Unsupervised – Post-processing is needed after a clustering step to determine the size of the clusters and the distance from the clusters is required from the point to be anomaly. • Anomalies detected using clustering based methods can be: 1. Data records that do not fit into any cluster (residuals from clustering) 2. Small clusters 3. Low density clusters or local anomalies (far from other points within the same cluster)
  • 16. Anomaly Detection: Clustering Approach Anomaly score function: • Given a data point x from a dataset D, Alternate definitions: 1. f(x) = distance between x and its closest centroid 2. f(x) : (called relative distance) = ratio between the data point's distance from the centroid to the median distance of all data points in the cluster from the centroid 3. f(x) = improvement in the goodness of a cluster (as measured by an objective function) when x is removed
  • 17. Anomaly Detection: K-means Clustering Approach Step 1: Select k random data points from the training dataset as the centroids of the clusters C1, C2, ...Ck. Step 2: For each training data point x: a. Compute the Euclidean distance D(Ci, x), i = 1...k b. Find cluster Cq that is closest to data point x. c. Assign data point x to Cq. Update the centroid of Cq. (The centroid of a cluster is the arithmetic mean of the data points in the cluster.) Step 3: Repeat Step 2 until the centroids of clusters C1, C2, ...Ck stabilize in terms of convergence criterion. Step 4: For each test (new) data point y: a. Compute the Euclidean distance D(Ci, y), i = 1...k called anomaly score. Find the cluster Cr that is closest to y. b. Use a threshold t on this score to determine anomalies or outliers. i.e, x is an outlier iff score > t. Otherwise, x is normal datapoint. • Points in small clusters – anomalies**
  • 18. Using K-means with 2 clusters. Fig uses distance of point from closest centroids (D is not considered outlier)
  • 19. Fig uses relative distance of point from closest centroids to adjust for the difference of densities among the clusters
  • 20. Clustering Based Anomaly Detection • Advantages: • No need to be supervised • Easily adaptable to on-line anomaly detection from temporal data • Drawbacks • Computationally expensive – Time complexity is O(cn), c - # of clusters • Using indexing structures (k-d tree, R* tree) may alleviate this problem • If normal points do not create any clusters the techniques may fail • In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. • Clustering algorithms may not give any meaningful clusters
  • 21. Visualization Based Techniques • Use visualization tools to observe the data • Provide alternate views of data for manual inspection • Anomalies are detected visually • Advantages • Keeps a human in the loop • Disadvantages • Works well for low dimensional data • Can provide only aggregated or partial views for high dimension data
  • 22. Application of Dynamic Graphics • Apply dynamic graphics to the exploratory analysis of spatial data. • Visualization tools are used to examine local variability to detect anomalies • Manual inspection of plots of the data that display its marginal and multivariate distributions
  • 23. Applications of Anomaly Detection • Network intrusion detection • Insurance / Credit card fraud detection • Healthcare Informatics / Medical diagnostics • Industrial Damage Detection • Image Processing / Video surveillance • Novel Topic Detection in Text Mining
  • 24. References • Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly Media,2017 • Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach • Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010 • Artificial Intelligence and Machine Learning, Chandra S.S. & H.S. Anand, PHI Publications • Machine Learning, Rajiv Chopra, Khanna Publishing House