SlideShare a Scribd company logo
Anomaly Detection
made easy
Piotr Guzik
$whoami
2
● Data Engineer @Allegro (Scala, Kafka,
Spark, Ansible, ML)
● Trainer @GetInData
● https://twitter.com/guzik_io
● Data Science flavour
Why anomaly detection is interesting ?
3
Anomaly detection on clickstream is all about:
● SLA (data should be a first-class citizen)
● You should be the first to know if something is wrong
“Engineers in XX Century made mistakes but never by more than
one order of magnitude. In IT it is not that good.”
Motivation and goals
4
Goal: Get quick information if the data is lost
● Losing data is somehow similar to losing money
● “You cannot improve what you cannot measure”
● Team responsible for given service should be alerted when
something is wrong
How to start ? - important questions
5
● How to get the data ?
● Real-time detection ?
● Delay ?
● What is an anomaly ?
Discovering datasource and data itself
6
● Datasource - Druid
● OLAP cube dimensions as domains
● Data aggregated every 15 minutes
● Metric - simplest count
What is a core data ?
7
Data ~= result of the query:
● select count (*) as cnt, category,action,time_window_15_m
from page_views
where category = ‘Search’ and action = ’ShowItem’
group by category, action, time_window_15_m
First look at the data
8
Knowing the data
9
● Clickstream is periodical
● Week == period
● Days of week differs a lot
● There is a rapid increase in web traffic about 6PM and it starts
to fall at about 10PM
Research
10
Motto: Solution must be easy. Not only for data scientist.
Available solutions:
● Twitter library - too hard, heavy math, many hyperparameters
● HTM algorithms - way too hard, neural networks, deep
learning, very hard to reason about algorithm and its results
We have to create our own simple model
How our model should be ?
11
Perfect model:
● Simple
● Time aware
● Detection is in minutes rather than hours
● Adapt to trends (ads, currently popular items)
● Should not report too many false-positives
● Use confidence intervals
Best tool for inventing algorithm
12
Model draft
13
F.A.I.L. - first attempt in learning
14
Simple statistical model in R
First results:
● Rapid change of metric is a
problem
● Trend is important but cannot
lead to overfitting
Experimenting in progress
15
After model evolution:
● Outliers are problematic (sd)
● Outliers == duplicates of data
on HDFS (thank you Camus!)
● Percentiles are great for
outliers removal
Problems with R
16
● Only Data Scientist knows R
● There is not an easy way to deploy it
● You cannot monitor it easily
● It is hard to maintain
Decision: we have to rewrite it. From scratch. In Scala.
Input from Druid
17
Model
18
Some math (ema !)
19
Trend (fast changing world)
20
Learning is a difficult process
21
What if we learned something that is not valid anymore ?
Mean could be bad, but what about ema ?
Anomaly Detection - almost there ?
22
Anomaly Detection - did we miss something ?
23
● Long lasting anomaly is not
an anomaly anymore
● Loss of data is crucial
● Output should be easy to
understand
Long lasting anomalies - key concepts
24
Output: probability (with sign) of anomaly
● Small anomalies should be smoothen and larger should be
outraged (monitoring and alerting)
● We define where obvious anomalies starts
● We define after how long we should treat anomalies as a
norm (be careful here)
Long lasting anomalies - fix
25
In case of long lasting anomalies, we multiply all model params,
as if we were wrong from the beginning
Deployment
26
SaaS model
● Multiple deployments with same codebase
● Different configuration
● Clients define how they want to react
Configuration example
27
Whole team - thank you
28
It was more than just me and my team involved in this process:
Big thanks to:
● My team for motivation and hot discussions :)
● Paweł Zawistowski - initial model in R
● Other teams for real use cases (that is why you would like to
be in production quickly)
Thank you
Q & A
Piotr Guzik

More Related Content

PDF
Robots, Testing and LAST
Anne-Marie Charrett
 
PPTX
Monitoring Is Never Done
Melanie Cey
 
PDF
Seguridad en Bases de Datos y Performance Tuning
Software Guru
 
PPTX
Observability - A mindset worth pursuing
Eyal Kenig
 
PDF
OSMC 2015: Testing in Production by Devdas Bhagat
NETWAYS
 
PPT
Pragmatic Programmer
Bert Añasco
 
PDF
The pragmatic programmer
Nilesh Sharma
 
PDF
Lessons learned after 190M lessons served
Ricardo Bánffy
 
Robots, Testing and LAST
Anne-Marie Charrett
 
Monitoring Is Never Done
Melanie Cey
 
Seguridad en Bases de Datos y Performance Tuning
Software Guru
 
Observability - A mindset worth pursuing
Eyal Kenig
 
OSMC 2015: Testing in Production by Devdas Bhagat
NETWAYS
 
Pragmatic Programmer
Bert Añasco
 
The pragmatic programmer
Nilesh Sharma
 
Lessons learned after 190M lessons served
Ricardo Bánffy
 

Viewers also liked (6)

PDF
Real-time fraud detection in credit card transactions
Mariusz Rafało
 
PDF
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
PDF
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
PDF
Data science w ubezpieczeniach
Data Science Warsaw
 
PDF
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Lukasz Kaluzny
 
PPTX
Self-service BI for SAP and HANA – Dream or Reality?
Ocean9, Inc.
 
Real-time fraud detection in credit card transactions
Mariusz Rafało
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
Data science w ubezpieczeniach
Data Science Warsaw
 
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Lukasz Kaluzny
 
Self-service BI for SAP and HANA – Dream or Reality?
Ocean9, Inc.
 
Ad

Similar to Anomaly detection made easy (20)

PDF
Agile Data Science
Volodymyr Kazantsev
 
PDF
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
PDF
Mqug2015 july richard whyte
Richard Whyte
 
PDF
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
PDF
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero
Angela Sun
 
PDF
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2
 
PDF
Black Ops Testing Workshop from Agile Testing Days 2014
Alan Richardson
 
PDF
Limits of Machine Learning
Alexey Grigorev
 
PDF
Real-Time Anomaly Detection and Root Cause Analysis
Yotascale
 
PDF
AI in the Real World: Challenges, and Risks and how to handle them?
Srinath Perera
 
PDF
Ml masterclass
Maxwell Rebo
 
PDF
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
PDF
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
PDF
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Lviv Startup Club
 
PDF
Evaluation of big data analysis
Καρολίνα Κάτι
 
PDF
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
PPTX
Symposium 2019 : Gestion de projet en Intelligence Artificielle
PMI-Montréal
 
PDF
Better Living Through Analytics - Louis Cialdella Product School
Louis Cialdella
 
PDF
DataOps - Lean principles and lean practices
Lars Albertsson
 
PPTX
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 
Agile Data Science
Volodymyr Kazantsev
 
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
Mqug2015 july richard whyte
Richard Whyte
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero
Angela Sun
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2
 
Black Ops Testing Workshop from Agile Testing Days 2014
Alan Richardson
 
Limits of Machine Learning
Alexey Grigorev
 
Real-Time Anomaly Detection and Root Cause Analysis
Yotascale
 
AI in the Real World: Challenges, and Risks and how to handle them?
Srinath Perera
 
Ml masterclass
Maxwell Rebo
 
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
Beat the Benchmark.
Pruthuvi Maheshakya Wijewardena
 
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Lviv Startup Club
 
Evaluation of big data analysis
Καρολίνα Κάτι
 
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
Symposium 2019 : Gestion de projet en Intelligence Artificielle
PMI-Montréal
 
Better Living Through Analytics - Louis Cialdella Product School
Louis Cialdella
 
DataOps - Lean principles and lean practices
Lars Albertsson
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 
Ad

Recently uploaded (20)

PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
Web_Engineering_Assignment_Clean.pptxfor college
HUSNAINAHMAD39
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Web_Engineering_Assignment_Clean.pptxfor college
HUSNAINAHMAD39
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 

Anomaly detection made easy

  • 2. $whoami 2 ● Data Engineer @Allegro (Scala, Kafka, Spark, Ansible, ML) ● Trainer @GetInData ● https://twitter.com/guzik_io ● Data Science flavour
  • 3. Why anomaly detection is interesting ? 3 Anomaly detection on clickstream is all about: ● SLA (data should be a first-class citizen) ● You should be the first to know if something is wrong “Engineers in XX Century made mistakes but never by more than one order of magnitude. In IT it is not that good.”
  • 4. Motivation and goals 4 Goal: Get quick information if the data is lost ● Losing data is somehow similar to losing money ● “You cannot improve what you cannot measure” ● Team responsible for given service should be alerted when something is wrong
  • 5. How to start ? - important questions 5 ● How to get the data ? ● Real-time detection ? ● Delay ? ● What is an anomaly ?
  • 6. Discovering datasource and data itself 6 ● Datasource - Druid ● OLAP cube dimensions as domains ● Data aggregated every 15 minutes ● Metric - simplest count
  • 7. What is a core data ? 7 Data ~= result of the query: ● select count (*) as cnt, category,action,time_window_15_m from page_views where category = ‘Search’ and action = ’ShowItem’ group by category, action, time_window_15_m
  • 8. First look at the data 8
  • 9. Knowing the data 9 ● Clickstream is periodical ● Week == period ● Days of week differs a lot ● There is a rapid increase in web traffic about 6PM and it starts to fall at about 10PM
  • 10. Research 10 Motto: Solution must be easy. Not only for data scientist. Available solutions: ● Twitter library - too hard, heavy math, many hyperparameters ● HTM algorithms - way too hard, neural networks, deep learning, very hard to reason about algorithm and its results We have to create our own simple model
  • 11. How our model should be ? 11 Perfect model: ● Simple ● Time aware ● Detection is in minutes rather than hours ● Adapt to trends (ads, currently popular items) ● Should not report too many false-positives ● Use confidence intervals
  • 12. Best tool for inventing algorithm 12
  • 14. F.A.I.L. - first attempt in learning 14 Simple statistical model in R First results: ● Rapid change of metric is a problem ● Trend is important but cannot lead to overfitting
  • 15. Experimenting in progress 15 After model evolution: ● Outliers are problematic (sd) ● Outliers == duplicates of data on HDFS (thank you Camus!) ● Percentiles are great for outliers removal
  • 16. Problems with R 16 ● Only Data Scientist knows R ● There is not an easy way to deploy it ● You cannot monitor it easily ● It is hard to maintain Decision: we have to rewrite it. From scratch. In Scala.
  • 19. Some math (ema !) 19
  • 20. Trend (fast changing world) 20
  • 21. Learning is a difficult process 21 What if we learned something that is not valid anymore ? Mean could be bad, but what about ema ?
  • 22. Anomaly Detection - almost there ? 22
  • 23. Anomaly Detection - did we miss something ? 23 ● Long lasting anomaly is not an anomaly anymore ● Loss of data is crucial ● Output should be easy to understand
  • 24. Long lasting anomalies - key concepts 24 Output: probability (with sign) of anomaly ● Small anomalies should be smoothen and larger should be outraged (monitoring and alerting) ● We define where obvious anomalies starts ● We define after how long we should treat anomalies as a norm (be careful here)
  • 25. Long lasting anomalies - fix 25 In case of long lasting anomalies, we multiply all model params, as if we were wrong from the beginning
  • 26. Deployment 26 SaaS model ● Multiple deployments with same codebase ● Different configuration ● Clients define how they want to react
  • 28. Whole team - thank you 28 It was more than just me and my team involved in this process: Big thanks to: ● My team for motivation and hot discussions :) ● Paweł Zawistowski - initial model in R ● Other teams for real use cases (that is why you would like to be in production quickly)
  • 29. Thank you Q & A Piotr Guzik