SlideShare a Scribd company logo
Anomaly Detection
What are anomalies/outliers?The set of data points that are considerably different than the remainder of the dataApplications:  Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection
Variants of Anomaly/Outlier Detection ProblemsGiven a database D, find all the data points x D with anomaly scores greater than some threshold tGiven a database D, find all the data points x D having the top-n largest anomaly scores f(x)Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
Anomaly DetectionChallengesHow many outliers are there in the data?Method is unsupervised Validation can be quite challenging (just like for clustering)Finding needle in a haystackWorking assumption:There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Anomaly Detection Schemes General Steps:Build a profile of the “normal” behaviorProfile can be patterns or summary statistics for the overall populationUse the “normal” profile to detect anomaliesAnomalies are observations whose characteristicsdiffer significantly from the normal profile
Types of anomaly detection schemesStatistical-based
Distance-based
Model-basedStatistical ApproachesAssume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distributionParameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit)
Grubbs’ TestDetect outliers in univariate dataAssume data comes from normal distributionDetects one outlier at a time, remove the outlier, and repeatH0: There is no outlier in dataHA: There is at least one outlierGrubbs’ test statistic: Reject H0 if:
Statistical-based – Likelihood ApproachAssume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution)General Approach:Initially, assume all the data points belong to MLet Lt(D) be the log likelihood of D at time t
Contd…For each point xtthat belongs to M, move it to A Let Lt+1 (D) be the new log likelihood. Compute the difference,  = Lt(D) – Lt+1 (D) If  > c  (some threshold), then xt is declared as an anomaly and moved permanently from M to A
Limitations of Statistical Approaches Most of the tests are for a single attributeIn many cases, data distribution may not be knownFor high dimensional data, it may be difficult to estimate the true distribution
Distance-based ApproachesData is represented as a vector of featuresThree major approachesNearest-neighbor basedDensity basedClustering based
Nearest-Neighbor Based ApproachApproach:Compute the distance between every pair of data pointsThere are various ways to define outliers:Data points for which there are fewer than p neighboring points within a distance DThe top n data points whose distance to the kth nearest neighbor is greatestThe top n data points whose average distance to the k nearest neighbors is greatest
Density-based: LOF approachFor each point, compute the density of its local neighborhoodCompute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighborsOutliers are points with largest LOF value
Clustering-BasedBasic idea:Cluster the data into groups of different densityChoose points in small cluster as candidate outliersCompute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers
Pros and ConsAdvantages: No need to be supervised Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data
Pros and ConsDrawbacks Computationally expensive Using indexing structures (k-d tree, R* tree) may alleviate this problem If normal points do not create any clusters the techniques may fail In high dimensional spaces, datais sparse and distances between any two data records may become quite similar. Clustering algorithms may not give any meaningful clusters
conclusionAnomaly detection in data mining is dealt in detail in this presentationTypes of anomaly detection and their merits and demerits are briefly discussed.

More Related Content

PPTX
Anomaly Detection Technique
Chakrit Phain
 
PDF
Anomaly Detection
Carol Hargreaves
 
PDF
Anomaly Detection: A Survey
Konkuk University, Korea
 
PPTX
Anomaly detection
Dr. Stylianos Kampakis
 
PDF
Anomaly/Novelty detection with scikit-learn
agramfort
 
PPT
Data cleaning-outlier-detection
Chathurangi Shyalika
 
PDF
22 Machine Learning Feature Selection
Andres Mendez-Vazquez
 
PDF
Support Vector Machines for Classification
Prakash Pimpale
 
Anomaly Detection Technique
Chakrit Phain
 
Anomaly Detection
Carol Hargreaves
 
Anomaly Detection: A Survey
Konkuk University, Korea
 
Anomaly detection
Dr. Stylianos Kampakis
 
Anomaly/Novelty detection with scikit-learn
agramfort
 
Data cleaning-outlier-detection
Chathurangi Shyalika
 
22 Machine Learning Feature Selection
Andres Mendez-Vazquez
 
Support Vector Machines for Classification
Prakash Pimpale
 

What's hot (20)

PDF
Anomaly detection
Hitesh Mohapatra
 
PPTX
05 Clustering in Data Mining
Valerii Klymchuk
 
PPTX
Outlier analysis and anomaly detection
ShantanuDeosthale
 
PPTX
Deep learning approach for network intrusion detection system
Avinash Kumar
 
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
PPTX
Anomaly Detection
DataminingTools Inc
 
PPT
Cure, Clustering Algorithm
Lino Possamai
 
PDF
DBSCAN
ssuseraef7e0
 
PPTX
Community detection in social networks
Francisco Restivo
 
PDF
Anomaly detection Workshop slides
QuantUniversity
 
PPTX
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
ODP
Dimensionality Reduction
Knoldus Inc.
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPTX
Belief Networks & Bayesian Classification
Adnan Masood
 
PPT
Clustering
NLPseminar
 
PDF
Dimensionality Reduction
Saad Elbeleidy
 
PDF
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
QuantUniversity
 
PDF
Outlier Detection
Dr. Abdul Ahad Abro
 
PDF
Optics ordering points to identify the clustering structure
Rajesh Piryani
 
Anomaly detection
Hitesh Mohapatra
 
05 Clustering in Data Mining
Valerii Klymchuk
 
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Deep learning approach for network intrusion detection system
Avinash Kumar
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
Anomaly Detection
DataminingTools Inc
 
Cure, Clustering Algorithm
Lino Possamai
 
DBSCAN
ssuseraef7e0
 
Community detection in social networks
Francisco Restivo
 
Anomaly detection Workshop slides
QuantUniversity
 
Association Rule Learning Part 1: Frequent Itemset Generation
Knoldus Inc.
 
Dimensionality Reduction
Knoldus Inc.
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Belief Networks & Bayesian Classification
Adnan Masood
 
Clustering
NLPseminar
 
Dimensionality Reduction
Saad Elbeleidy
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
QuantUniversity
 
Outlier Detection
Dr. Abdul Ahad Abro
 
Optics ordering points to identify the clustering structure
Rajesh Piryani
 
Ad

Viewers also liked (19)

PPTX
Open Source Private Cloud Management with OpenStack and Security Evaluation w...
XHANI TRUNGU
 
PDF
Architecture Challenges In Cloud Computing
IndicThreads
 
PPTX
Analysis and Design for Intrusion Detection System Based on Data Mining
Pritesh Ranjan
 
PDF
Leverage points for wicked problems
Demos Helsinki
 
PPTX
Herding Cats: Innovation Management in an Unpredictable World
Michael von Kutzschenbach
 
PDF
Network Functions Virtualization – Our Strategy
ADVA
 
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
PPTX
Examples for leverage points
Georges Grinstein
 
PPTX
Module 5 Bayesian belief network modelling
Think2Impact
 
PPTX
Module 4 Leverage points and systemic interventions
Think2Impact
 
PPTX
Module 3 Systems archetypes
Think2Impact
 
PPTX
Design Tools for Systems Thinking
Peter Vermaercke
 
PPTX
FIne Grain Multithreading
Dharmesh Tank
 
PPTX
Anomaly detection in deep learning (Updated) English
Adam Gibson
 
PDF
Anomaly detection in deep learning
Adam Gibson
 
PPTX
Update Your Disaster Recovery Plans with Virtualization
Jason Dea
 
PPTX
Disaster recovery and the cloud
Jason Dea
 
PPT
Intrusion detection system ppt
Sheetal Verma
 
PPT
Cloud computing simple ppt
Agarwaljay
 
Open Source Private Cloud Management with OpenStack and Security Evaluation w...
XHANI TRUNGU
 
Architecture Challenges In Cloud Computing
IndicThreads
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Pritesh Ranjan
 
Leverage points for wicked problems
Demos Helsinki
 
Herding Cats: Innovation Management in an Unpredictable World
Michael von Kutzschenbach
 
Network Functions Virtualization – Our Strategy
ADVA
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
Examples for leverage points
Georges Grinstein
 
Module 5 Bayesian belief network modelling
Think2Impact
 
Module 4 Leverage points and systemic interventions
Think2Impact
 
Module 3 Systems archetypes
Think2Impact
 
Design Tools for Systems Thinking
Peter Vermaercke
 
FIne Grain Multithreading
Dharmesh Tank
 
Anomaly detection in deep learning (Updated) English
Adam Gibson
 
Anomaly detection in deep learning
Adam Gibson
 
Update Your Disaster Recovery Plans with Virtualization
Jason Dea
 
Disaster recovery and the cloud
Jason Dea
 
Intrusion detection system ppt
Sheetal Verma
 
Cloud computing simple ppt
Agarwaljay
 
Ad

Similar to Anomaly Detection (20)

PPT
Chap10 Anomaly Detection
guest76d673
 
DOCX
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
randyburney60861
 
PPT
Anomaly Detection in DataMining
BilalAbbasAwan
 
PPT
Anomaly detection
Institute of Technology Telkom
 
PDF
Outlier Detection Using Unsupervised Learning on High Dimensional Data
IJERA Editor
 
PDF
Annommaly detection techniques and approaches
Jay Sahoo
 
PDF
Anomaly detection
QuantUniversity
 
DOCX
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
amrit47
 
PDF
Data wrangling week 10
Ferdin Joe John Joseph PhD
 
PDF
Introduction to unsupervised learning: outlier detection
Joseph Itopa Abubakar
 
PPTX
chap9_anomaly_detection.pptx
BnhTrnTrng
 
PDF
angle based outlier de
Kruthikka Palraj
 
PDF
Kdd08 abod
Kruthikka Palraj
 
PDF
Multiple Linear Regression Models in Outlier Detection
IJORCS
 
PDF
Anomaly detection Meetup Slides
QuantUniversity
 
PDF
Anomaly detection : QuantUniversity Workshop
QuantUniversity
 
PPTX
Anomaly Detection for Real-World Systems
Manojit Nandi
 
PDF
Anomaly detection
QuantUniversity
 
PPTX
Chapter 10 Anomaly Detection
Khalid Elshafie
 
PPT
3.7 outlier analysis
Krish_ver2
 
Chap10 Anomaly Detection
guest76d673
 
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
randyburney60861
 
Anomaly Detection in DataMining
BilalAbbasAwan
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
IJERA Editor
 
Annommaly detection techniques and approaches
Jay Sahoo
 
Anomaly detection
QuantUniversity
 
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
amrit47
 
Data wrangling week 10
Ferdin Joe John Joseph PhD
 
Introduction to unsupervised learning: outlier detection
Joseph Itopa Abubakar
 
chap9_anomaly_detection.pptx
BnhTrnTrng
 
angle based outlier de
Kruthikka Palraj
 
Kdd08 abod
Kruthikka Palraj
 
Multiple Linear Regression Models in Outlier Detection
IJORCS
 
Anomaly detection Meetup Slides
QuantUniversity
 
Anomaly detection : QuantUniversity Workshop
QuantUniversity
 
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Anomaly detection
QuantUniversity
 
Chapter 10 Anomaly Detection
Khalid Elshafie
 
3.7 outlier analysis
Krish_ver2
 

More from guest0edcaf (6)

PPT
Textmining Retrieval And Clustering
guest0edcaf
 
PPT
Textmining Predictive Models
guest0edcaf
 
PPT
Textmining Introduction
guest0edcaf
 
PPTX
Textmining Information Extraction
guest0edcaf
 
PPTX
Cluster Analysis
guest0edcaf
 
PPTX
Association Analysis
guest0edcaf
 
Textmining Retrieval And Clustering
guest0edcaf
 
Textmining Predictive Models
guest0edcaf
 
Textmining Introduction
guest0edcaf
 
Textmining Information Extraction
guest0edcaf
 
Cluster Analysis
guest0edcaf
 
Association Analysis
guest0edcaf
 

Recently uploaded (20)

PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Doc9.....................................
SofiaCollazos
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of Artificial Intelligence (AI)
Mukul
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 

Anomaly Detection

  • 2. What are anomalies/outliers?The set of data points that are considerably different than the remainder of the dataApplications: Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection
  • 3. Variants of Anomaly/Outlier Detection ProblemsGiven a database D, find all the data points x D with anomaly scores greater than some threshold tGiven a database D, find all the data points x D having the top-n largest anomaly scores f(x)Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
  • 4. Anomaly DetectionChallengesHow many outliers are there in the data?Method is unsupervised Validation can be quite challenging (just like for clustering)Finding needle in a haystackWorking assumption:There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
  • 5. Anomaly Detection Schemes General Steps:Build a profile of the “normal” behaviorProfile can be patterns or summary statistics for the overall populationUse the “normal” profile to detect anomaliesAnomalies are observations whose characteristicsdiffer significantly from the normal profile
  • 6. Types of anomaly detection schemesStatistical-based
  • 8. Model-basedStatistical ApproachesAssume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distributionParameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit)
  • 9. Grubbs’ TestDetect outliers in univariate dataAssume data comes from normal distributionDetects one outlier at a time, remove the outlier, and repeatH0: There is no outlier in dataHA: There is at least one outlierGrubbs’ test statistic: Reject H0 if:
  • 10. Statistical-based – Likelihood ApproachAssume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution)General Approach:Initially, assume all the data points belong to MLet Lt(D) be the log likelihood of D at time t
  • 11. Contd…For each point xtthat belongs to M, move it to A Let Lt+1 (D) be the new log likelihood. Compute the difference,  = Lt(D) – Lt+1 (D) If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A
  • 12. Limitations of Statistical Approaches Most of the tests are for a single attributeIn many cases, data distribution may not be knownFor high dimensional data, it may be difficult to estimate the true distribution
  • 13. Distance-based ApproachesData is represented as a vector of featuresThree major approachesNearest-neighbor basedDensity basedClustering based
  • 14. Nearest-Neighbor Based ApproachApproach:Compute the distance between every pair of data pointsThere are various ways to define outliers:Data points for which there are fewer than p neighboring points within a distance DThe top n data points whose distance to the kth nearest neighbor is greatestThe top n data points whose average distance to the k nearest neighbors is greatest
  • 15. Density-based: LOF approachFor each point, compute the density of its local neighborhoodCompute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighborsOutliers are points with largest LOF value
  • 16. Clustering-BasedBasic idea:Cluster the data into groups of different densityChoose points in small cluster as candidate outliersCompute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers
  • 17. Pros and ConsAdvantages: No need to be supervised Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data
  • 18. Pros and ConsDrawbacks Computationally expensive Using indexing structures (k-d tree, R* tree) may alleviate this problem If normal points do not create any clusters the techniques may fail In high dimensional spaces, datais sparse and distances between any two data records may become quite similar. Clustering algorithms may not give any meaningful clusters
  • 19. conclusionAnomaly detection in data mining is dealt in detail in this presentationTypes of anomaly detection and their merits and demerits are briefly discussed.
  • 20. Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net