Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye

1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2. Andreu Mora, Adyen Time series forecasting and monitoring with Apache Spark and ElasticSearch #UnifiedDataAnalytics #SparkAISummit

3. Adyen Payments Processor Tech company International customers (aka merchants) Omnichannel

4. Back in the day… The legacy monitor was based on a SQL query that would compute an average for the hour of the week and compare to a threshold.

5. Doesn’t quite work: • Generates loads of False Positives • It was fairly trimmed down: top merchants.

6. Reduce False Positives

7. Catch anomalies

8. Do that at scale

9. Harness the detection performance

10. Connect to a live platform

11. OK, but What is an anomaly? No luxury of a labelled dataset, divergence   of opinions. Connecting to a live platform without   ML deployment hooks ready. We were working on MLﬂow but not there yet. No standard for timeseries forecasting at scale With spark, several choices.

12. Considerations when dealing with Big Data Big Technology Leverage on mature Tech to solve the problem (hello Spark). Big diversity Many different topologies for our merchants and yet one algorithm to track them all. Big consequences 1000 merchants * 10 min * 95% accuracy = 50400 emails/week

13. Big Data Platform Volumes Predictions

15. TimeSeries Ecosystem Flint Spark-ts FB Prophet Stats models

16. TimeSeries Ecosystem Flint Spark-ts FB Prophet Stats models Data size consideration 1 year @ 1 min @ double64 = 4.2 mb

17. Scoring in Java While working on a fully functional engine to deploy ML models based on MLﬂow. Launch fast and iterate! Transporting the model The model transported for tens of thousands of accounts needs to be lightweight. Harness the maths No using blackboxed models, equations need to be understood and replicated in Java. Needs to perform fast Score and decide whether our seen trafﬁc form ElasticSearch is actually anomalous on the ms scale.

19. Big Data Platform Volumes Model Coefﬁcients

20. Fourier components Would not optimise the business cycles ARIMA Not perfect for picking up seasonality Isolation Forests Great for multidimensional data, not so much for time series. Autoencoders Good luck transporting the model for each merchant. XGBM Noice, but score that in Java. Research stage Understand a problem and build a solution, decide what’s best.

21. Ridge Regression Makes scoring in Java nice and kinda easy. Residuals Conﬁdence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.

28. Ridge Regression Makes scoring in Java nice and kinda easy. easy Residuals Conﬁdence intervals modelled through quantile regression of observed values. Events Recurrent or one-off events are shown to the model. Piece-wise linear trends Breaks down the signal into pieces and learn the last trends. Gaussian Basis Functions Allow us to teach the model to understand business cycles The model Discover anomalous behaviour based on a probability p. Pre-sampling Allow us to sample and bucketize the merchants to adequate intervals.

29. Trendspotting Estimating hinges and trends and offering it as subproduct to Account Managers for evaluating the low variations of volume.

30. Train set: 90 days Test set: 7 days Real volume Predicted volume 95% conﬁdence How do the predictions look like?

31. Missed event

32. The implementation on Spark How did we get there, on the Spark side. Reusability Overloads of scikit-learns and pandas allow us to ensure reusability Cross-validation Ensure the best tuning through tuning of hyperparameters. Scalability Using Spark’s map-reduce paradigm we totally control the computational performances.

33. SeasonalEstimator(BaseEstimator,RegressorMixin)

34. Input daily time series —> {t:[…], v:[…]} Collect to list —> [{t:[…], v:[…]}] Hinges and Hyperparameters Distribute UDF Making it happen at scale

35. Cross-validation F4-sampling score: favours higher sampling considering classical precision and recall. Custom cv folds split TimeSeriesWeekSplit get the sense of the business cycle

36. The output

37. Harnessing   the prediction performance

38. Enabling canary roll-out based on scores

39. Overcoming unsupervised learning Alarm rate and synthetic recall allow us to know for each case how many alarms would have been captured and raised, even without having a labelled dataset.

40. Trade-off alarm rates and recall We provide a number of choices (95%, 97%, 99% probability and completely proﬁle what to expect in terms of anomalies.

41. The model payload

42. Go Live Houston? Houston? …

43. Grafana dashboard

44. So we saw this on the data

48. ’You don’t call us, we call you’

49. Post on Medium https://medium.com/adyen

50. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye

More Related Content

What's hot (20)

Similar to Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye (20)

More from Databricks (20)

Recently uploaded (20)

Scalable Time Series Forecasting and Monitoring using Apache Spark and ElasticSearch at Adye