SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Andreu Mora, Adyen
Time series forecasting
and monitoring with
Apache Spark and
ElasticSearch
#UnifiedDataAnalytics #SparkAISummit
Adyen
Payments Processor
Tech company
International customers (aka merchants)
Omnichannel
Back in the day…
The legacy monitor was based on a SQL query that
would compute an average for the hour of the week
and compare to a threshold.
Doesn’t quite work:
• Generates loads of False Positives
• It was fairly trimmed down: top merchants.
Reduce False
Positives
Catch anomalies
Do that at scale
Harness the
detection
performance
Connect to a live
platform
OK, but
What is an anomaly?
No luxury of a labelled dataset, divergence 

of opinions.
Connecting to a live platform without 

ML deployment hooks ready.
We were working on MLflow but not there yet.
No standard for timeseries forecasting at scale
With spark, several choices.
Considerations when dealing with Big Data
Big Technology
Leverage on mature Tech to
solve the problem (hello Spark).
Big diversity
Many different topologies for
our merchants and yet one
algorithm to track them all.
Big consequences
1000 merchants * 10 min * 95%
accuracy = 50400 emails/week
Big Data Platform
Volumes Predictions
Big Data Platform
Volumes Predictions
TimeSeries Ecosystem
Flint
Spark-ts
FB Prophet
Stats models
TimeSeries Ecosystem
Flint
Spark-ts
FB Prophet
Stats models
Data size
consideration
1 year @ 1 min @ double64 = 4.2 mb
Scoring in Java
While working on a fully functional engine to
deploy ML models based on MLflow.
Launch fast and iterate!
Transporting the model
The model transported for tens of thousands of
accounts needs to be lightweight.
Harness the maths
No using blackboxed models, equations need to
be understood and replicated in Java.
Needs to perform fast
Score and decide whether our seen traffic form
ElasticSearch is actually anomalous on the ms
scale.
Big Data Platform
Volumes Predictions
Big Data Platform
Volumes
Model
Coefficients
Fourier
components
Would not
optimise the
business cycles
ARIMA
Not perfect for
picking up
seasonality
Isolation Forests
Great for
multidimensional
data, not so much
for time series.
Autoencoders
Good luck
transporting the
model for each
merchant.
XGBM
Noice, but score
that in Java.
Research stage
Understand a problem and build a solution, decide what’s best.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy.
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Ridge Regression
Makes scoring in Java nice and
kinda easy. easy
Residuals
Confidence intervals modelled
through quantile regression of
observed values.
Events
Recurrent or one-off events are
shown to the model.
Piece-wise linear trends
Breaks down the signal into pieces
and learn the last trends.
Gaussian Basis Functions
Allow us to teach the model to
understand business cycles
The model
Discover anomalous behaviour
based on a probability p.
Pre-sampling
Allow us to sample and bucketize
the merchants to adequate
intervals.
Trendspotting
Estimating hinges and trends and offering it as
subproduct to Account Managers for evaluating
the low variations of volume.
Train set: 90 days
Test set: 7 days
Real volume
Predicted volume
95% confidence
How do the predictions look like?
Missed event
The implementation
on Spark
How did we get there, on the Spark side.
Reusability
Overloads of scikit-learns and pandas allow us to
ensure reusability
Cross-validation
Ensure the best tuning through tuning of
hyperparameters.
Scalability
Using Spark’s map-reduce paradigm we totally
control the computational performances.
SeasonalEstimator(BaseEstimator,RegressorMixin)
Input daily time series —> {t:[…], v:[…]}
Collect to list —> [{t:[…], v:[…]}]
Hinges and Hyperparameters
Distribute UDF
Making it happen at scale
Cross-validation
F4-sampling score: favours higher sampling
considering classical precision and recall.
Custom cv folds split TimeSeriesWeekSplit get the
sense of the business cycle
The output
Harnessing 

the prediction
performance
Enabling canary
roll-out based on
scores
Overcoming
unsupervised
learning
Alarm rate and synthetic recall allow us to
know for each case how many alarms would
have been captured and raised, even without
having a labelled dataset.
Trade-off alarm
rates and recall
We provide a number of choices (95%, 97%,
99% probability and completely profile what to
expect in terms of anomalies.
The model payload
Go Live
Houston? Houston? …
Grafana dashboard
So we saw this on the data
Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye
Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye
Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye
’You don’t call us, we call you’
Post on Medium
https://medium.com/adyen
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PPTX
VARIABLE REFRIGERANT FLOW(VRF) ppt
Ezhil Raj s
 
PPTX
Fire fighting system in buildings
Hetvi Trada
 
PDF
HVAC - Ducting system by Chidanand
ChidanandaChandrashe
 
PPT
Maintenance work
E-one K Rizuwan
 
PDF
Variable refrigerant flow systems
VAHAB ABDUL
 
PDF
Sem 2 bs1 Hot water supply system 2
Est
 
PPTX
HVAC System (Heating, Ventilation and Air Conditioning)
Maliha Mehr
 
PPTX
Fire Protection Systems Unit-I
GAURAV. H .TANDON
 
PPS
Fire safety in building
Arvind Kumar
 
PPTX
Basics of HVAC by Jitendra Jha
Jitendra Jha
 
PPTX
WATER SUPPLY AND PLUMBING SERVICES
RIDDHESH VARIK
 
PPTX
Building Services Engg. (Electrical installations)
Ritesh Ambadkar
 
PPTX
Fire Prevention Measures for High Rise Buildings
Ranjeet Kumar
 
PPTX
HEATING VENTILATION AND AIR CONDITIONING SYSTEMS
Lakshmi Ravi Chandu Kolusu
 
PDF
Sem 2 bs1 hot water supply system 1
Est
 
PPT
Topik 3 sanitary pipework
Inazarina Ady
 
PPT
Sprinkler system.
Eiyla Hamdan
 
PPT
Drainage system
Nikhil Jp
 
PDF
Gas Installation Guidelines for Designers and Builders - Domestic Sites
Gas Networks Ireland
 
PDF
ACMV SYSTEM
YE MYO
 
VARIABLE REFRIGERANT FLOW(VRF) ppt
Ezhil Raj s
 
Fire fighting system in buildings
Hetvi Trada
 
HVAC - Ducting system by Chidanand
ChidanandaChandrashe
 
Maintenance work
E-one K Rizuwan
 
Variable refrigerant flow systems
VAHAB ABDUL
 
Sem 2 bs1 Hot water supply system 2
Est
 
HVAC System (Heating, Ventilation and Air Conditioning)
Maliha Mehr
 
Fire Protection Systems Unit-I
GAURAV. H .TANDON
 
Fire safety in building
Arvind Kumar
 
Basics of HVAC by Jitendra Jha
Jitendra Jha
 
WATER SUPPLY AND PLUMBING SERVICES
RIDDHESH VARIK
 
Building Services Engg. (Electrical installations)
Ritesh Ambadkar
 
Fire Prevention Measures for High Rise Buildings
Ranjeet Kumar
 
HEATING VENTILATION AND AIR CONDITIONING SYSTEMS
Lakshmi Ravi Chandu Kolusu
 
Sem 2 bs1 hot water supply system 1
Est
 
Topik 3 sanitary pipework
Inazarina Ady
 
Sprinkler system.
Eiyla Hamdan
 
Drainage system
Nikhil Jp
 
Gas Installation Guidelines for Designers and Builders - Domestic Sites
Gas Networks Ireland
 
ACMV SYSTEM
YE MYO
 

Similar to Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye (20)

PDF
Scaling Analytics with Apache Spark
QuantUniversity
 
PDF
Spark ml streaming
Adam Doyle
 
PDF
ForeStock Trading Advisor
bzinchenko
 
PPTX
Azure Databricks for Data Scientists
Richard Garris
 
PDF
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
PDF
Machine learning meetup
QuantUniversity
 
PDF
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
IRJET Journal
 
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
PPTX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
PDF
Machine Learning with Big Data using Apache Spark
InSemble
 
PDF
Databricks-EN-2.pdf
rutgermcgeek
 
PPTX
Informs presentation new ppt
Salford Systems
 
PDF
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
PDF
Machine Learning - Principles
Giorgio Alfredo Spedicato
 
PDF
Big Data, Bigger Analytics
Itzhak Kameli
 
PDF
IoT with Azure Machine Learning and InfluxDB
Ivo Andreev
 
PDF
Spark Summit EU talk by Josef Habdank
Spark Summit
 
DOCX
Data Analytics Using R - Report
Akanksha Gohil
 
PPTX
Time series Segmentation & Anomaly Detection
Aditya Bhattacharya
 
Scaling Analytics with Apache Spark
QuantUniversity
 
Spark ml streaming
Adam Doyle
 
ForeStock Trading Advisor
bzinchenko
 
Azure Databricks for Data Scientists
Richard Garris
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Machine learning meetup
QuantUniversity
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
IRJET Journal
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
Machine Learning with Big Data using Apache Spark
InSemble
 
Databricks-EN-2.pdf
rutgermcgeek
 
Informs presentation new ppt
Salford Systems
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
Machine Learning - Principles
Giorgio Alfredo Spedicato
 
Big Data, Bigger Analytics
Itzhak Kameli
 
IoT with Azure Machine Learning and InfluxDB
Ivo Andreev
 
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Data Analytics Using R - Report
Akanksha Gohil
 
Time series Segmentation & Anomaly Detection
Aditya Bhattacharya
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
BinarySearchTree in datastructures in detail
kichokuttu
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 

Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Andreu Mora, Adyen Time series forecasting and monitoring with Apache Spark and ElasticSearch #UnifiedDataAnalytics #SparkAISummit
  • 3. Adyen Payments Processor Tech company International customers (aka merchants) Omnichannel
  • 4. Back in the day… The legacy monitor was based on a SQL query that would compute an average for the hour of the week and compare to a threshold.
  • 5. Doesn’t quite work: • Generates loads of False Positives • It was fairly trimmed down: top merchants.
  • 8. Do that at scale
  • 10. Connect to a live platform
  • 11. OK, but What is an anomaly? No luxury of a labelled dataset, divergence 
 of opinions. Connecting to a live platform without 
 ML deployment hooks ready. We were working on MLflow but not there yet. No standard for timeseries forecasting at scale With spark, several choices.
  • 12. Considerations when dealing with Big Data Big Technology Leverage on mature Tech to solve the problem (hello Spark). Big diversity Many different topologies for our merchants and yet one algorithm to track them all. Big consequences 1000 merchants * 10 min * 95% accuracy = 50400 emails/week
  • 16. TimeSeries Ecosystem Flint Spark-ts FB Prophet Stats models Data size consideration 1 year @ 1 min @ double64 = 4.2 mb
  • 17. Scoring in Java While working on a fully functional engine to deploy ML models based on MLflow. Launch fast and iterate! Transporting the model The model transported for tens of thousands of accounts needs to be lightweight. Harness the maths No using blackboxed models, equations need to be understood and replicated in Java. Needs to perform fast Score and decide whether our seen traffic form ElasticSearch is actually anomalous on the ms scale.
  • 20. Fourier components Would not optimise the business cycles ARIMA Not perfect for picking up seasonality Isolation Forests Great for multidimensional data, not so much for time series. Autoencoders Good luck transporting the model for each merchant. XGBM Noice, but score that in Java. Research stage Understand a problem and build a solution, decide what’s best.
  • 21. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 22. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 23. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 24. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 25. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 26. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 27. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 28. Ridge Regression Makes scoring in Java nice and kinda easy. easy Residuals Confidence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.
  • 29. Trendspotting Estimating hinges and trends and offering it as subproduct to Account Managers for evaluating the low variations of volume.
  • 30. Train set: 90 days Test set: 7 days Real volume Predicted volume 95% confidence How do the predictions look like?
  • 32. The implementation on Spark How did we get there, on the Spark side. Reusability Overloads of scikit-learns and pandas allow us to ensure reusability Cross-validation Ensure the best tuning through tuning of hyperparameters. Scalability Using Spark’s map-reduce paradigm we totally control the computational performances.
  • 34. Input daily time series —> {t:[…], v:[…]} Collect to list —> [{t:[…], v:[…]}] Hinges and Hyperparameters Distribute UDF Making it happen at scale
  • 35. Cross-validation F4-sampling score: favours higher sampling considering classical precision and recall. Custom cv folds split TimeSeriesWeekSplit get the sense of the business cycle
  • 39. Overcoming unsupervised learning Alarm rate and synthetic recall allow us to know for each case how many alarms would have been captured and raised, even without having a labelled dataset.
  • 40. Trade-off alarm rates and recall We provide a number of choices (95%, 97%, 99% probability and completely profile what to expect in terms of anomalies.
  • 44. So we saw this on the data
  • 48. ’You don’t call us, we call you’
  • 50. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT