Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Patrick Caldon – Director of Quant Research
Taylor Hess – Lead Quant Analyst
Morningstar Inc.
Lessons Learned
Replatforming A Large ML
Application
#UnifiedDataAnalytics #SparkAISummit

Roadmap
• Our Model
• Lessons Learned
1. Just Get A Bigger Cluster
2. End to End Models
3. Make It Easy To Iterate
4. Focus On Local Runs
3

Why Finance Models Are
Different
1. Hard to Validate
2. Probabilistic Outputs
3. More collaborative
4. Heavy compliance issues – models and data need
versioning
5

Large ML models even more
difficult…
• Software installations can be difficult
• Data can’t fit on computer
• Desktop/Laptop not powerful enough
6

What Are We Building?
• Financial Terms: Risk Factor Model
• ML Terms: Cross sectional regression + more
7
Daily
Factors
Apple Inc. | 2019-01-01
Momentum = 0.3
Size = 2.1
Health = 1.5
• Time series of each coefficient
• Forecasted return distributions
• Covariance estimates
All Stock
Data

• Essentially, we take features of financial securities and estimate
distributions of future returns
• We make millions of these estimates
• Try to understand how stock returns move together
• The feature engineering work has been studied extensively in academic
financial research (it is utilized by Quant hedge funds to invest as well)
– Some features are simple, some are complex
8

9
• Outputs (~500GB each run)
• Security and portfolio exposures (daily for
each security/portfolio)
• Security and portfolio forecasted
distributions (daily for each security/portfolio)
• Inputs (~500GB)
• Return data
• Financial information
• Security information (region, sector, etc.)
• Portfolio information
Security Date Size Momentum …
Apple 10/1/2018 2.1 0.4 …
Google 10/1/2018 2.0 0.3 …
Barclays 10/1/2018 1.6 -0.1 …
BP 10/1/2018 1.8 0.7 …

Risk Model 1.0
10
Years of research and development to come up with our
proprietary model
• Equity only model (~40,000 securities)
• Single server relying on database for many calculations
• 10 hours to run each day
• Producing ~10M datapoints daily
Data
Warehouse
On Prem
Server

Rethinking Our Approach
11
Hard to expand code
Validation is arduous New model creation painful
Long time to regenerate

New Architecture
12
Morningstar API
Amazon
S3
Amazon
RDS
Amazon
S3
Amazon
Route 53
Amazon
AthenaAmazon EMR
(Spark)
Amazon
Fargate
Airflow

Risk Model 2.0
13
Old Model
g 10 million datapoints each model run
g 40,000 securities (equity only)
g 1 model at a time
g Months to refresh all data
g Hard to get validation data
New Model
g 25 billion datapoints each model run
g 1,000,000 securities (equity + fixed
income)
g 10+ models at a time
g Hours to refresh all data
g Validation data automated
4000x faster
(full rebuild)
5,000x output data
(each model run)
50x parallel models

1. Just Get A Bigger Cluster
15

What is it?
• Get larger servers and more of
them – then trim down later
16

Why?
It’s easy to do
And we should do the easy things first
1. 2x larger >> 2x faster in many cases (so its cost effective)
2. Good joins can’t make up for poorly sized cluster (sometimes)
17

Some reasons to scale
18
I/O Caching Parallelization

I/O
• Too small of cluster will cause a spill to disk
• Writing and reading from disk are slooooooow
• Monitor Spark GUI for spills to disk and add more RAM
19

Caching
• If you use large datasets 2+ times, cache
• Caching requires lots of RAM
• Partial caching not good enough
20

Parallelization
• If data skew is not a problem: 2x larger = 2x faster
• Make sure cluster fully utilized
– Executor Count / Size
21

What is it?
• Ability to rerun models historically
• Always keep source data intact
23

Why?
24
• Distribution of data can shift over time – was your model stable?
• Necessary in projects with a time component
• Bugfixes quicker
• Makes it easy to tweak preprocessing steps

Model Deployment
• Liebig's law of the minimum (~1850) – plant growth is governed by the least
available nutrient. (cf. Amdahl's law).
• In any software-based environment, end-to-end system test is a necessity. The test
will likely find bugs. So the slowest process in our development governs release
and bugfix speed.
• Liebig's law of the minimum (rephrased) – model development speed is governed by
the slowest part of the development/deployment environment.
• Conclusion – if you have a multi-day process to rebuild a model, you’re at risk of this
process governing the release and bugfix cadence.
25

26
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Latest
Model

27
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Full
Latest
Model

28
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
End
To
End
Model

What is it?
• Our team consists of many analysts and a few developers
• Big focus on making it easy for analysts to contribute quickly
– Easy for analysts to setup
– Easy to test
– Easy to deploy full runs
– Easy to run locally
• It’s a simple idea that should be taken seriously
30

Why?
• The magic of compounding!
• Lower switching costs to onboard people to project
• Less experienced team members can contribute
31

32
Sprints
0 1 2 3 4
Lead Developer 20 20 20 20 20
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Total Work 60 60 60 60 60
Cumulative Work 60 120 180 240 300
Sprints
0 1 2 3 4
Lead Developer 5 8 11 15 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Total Work 45 56 67 79 90
Cumulative Work 45 101 168 247 337
0
50
100
150
200
250
300
350
400
1 2 3 4 5
Total Cumulative WorkIndividual Focus
Tooling Focus

33
Explore
data
Create
features
Build
model
Test
model

Some things we did
– Deployment Scripts to AWS
– Clear Documentation
– Containers / VM’s
– Allow everything to run locally
– Data exploration tools (Jupyter, Athena)
– Pair analysts and developers
34

• Make it easy to run the full process locally
• That means you may need:
1. Representative data samples
2. No / minor reliance on external data sources
• Api calls, long database queries, etc.
3. Process to snapshot data
36
What is it?

Why?
• Development is much cheaper / quicker
37

Representative data samples
• Not easy to do
– We use process that runs on large cluster to create smaller datasets
and upload to Cloud
– This process sits within code that abstracts data connections
• Any external data sources need to be mocked / set in config file to
pull from other source
38

39
Full
Abstracted
Data
ML Application
Full Raw Data
Trimmed Raw Data
Trimmed
Abstracted
Data
1. Snapshot to
Raw Data

Its about speed of iteration!
• What barriers can you remove?
• What process can you improve?
• What tools can you create?
• What flexibility can you add?
• What headache can you avoid?
• How can you make it easy to do the right thing?
• How can you accelerate the work of inexperienced analysts/devs?
41

Contact Us
• Patrick Caldon
– Patrick.Caldon@Morningstar.com
• Taylor Hess
– Taylor.Hess@Morningstar.com
– https://www.linkedin.com/in/taylorwhess/
42

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark

More Related Content

What's hot (20)

Similar to Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark