SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Patrick Caldon – Director of Quant Research
Taylor Hess – Lead Quant Analyst
Morningstar Inc.
Lessons Learned
Replatforming A Large ML
Application
#UnifiedDataAnalytics #SparkAISummit
Roadmap
• Our Model
• Lessons Learned
1. Just Get A Bigger Cluster
2. End to End Models
3. Make It Easy To Iterate
4. Focus On Local Runs
3
Our Model
4
Why Finance Models Are
Different
1. Hard to Validate
2. Probabilistic Outputs
3. More collaborative
4. Heavy compliance issues – models and data need
versioning
5
Large ML models even more
difficult…
• Software installations can be difficult
• Data can’t fit on computer
• Desktop/Laptop not powerful enough
6
What Are We Building?
• Financial Terms: Risk Factor Model
• ML Terms: Cross sectional regression + more
7
Daily
Factors
Apple Inc. | 2019-01-01
Momentum = 0.3
Size = 2.1
Health = 1.5
• Time series of each coefficient
• Forecasted return distributions
• Covariance estimates
All Stock
Data
What Are We Building?
• Essentially, we take features of financial securities and estimate
distributions of future returns
• We make millions of these estimates
• Try to understand how stock returns move together
• The feature engineering work has been studied extensively in academic
financial research (it is utilized by Quant hedge funds to invest as well)
– Some features are simple, some are complex
8
What Are We Building?
9
• Outputs (~500GB each run)
• Security and portfolio exposures (daily for
each security/portfolio)
• Security and portfolio forecasted
distributions (daily for each security/portfolio)
• Inputs (~500GB)
• Return data
• Financial information
• Security information (region, sector, etc.)
• Portfolio information
Security Date Size Momentum …
Apple 10/1/2018 2.1 0.4 …
Google 10/1/2018 2.0 0.3 …
Barclays 10/1/2018 1.6 -0.1 …
BP 10/1/2018 1.8 0.7 …
Risk Model 1.0
10
Years of research and development to come up with our
proprietary model
• Equity only model (~40,000 securities)
• Single server relying on database for many calculations
• 10 hours to run each day
• Producing ~10M datapoints daily
Data
Warehouse
On Prem
Server
Rethinking Our Approach
11
Hard to expand code
Validation is arduous New model creation painful
Long time to regenerate
New Architecture
12
Morningstar API
Amazon
S3
Amazon
RDS
Amazon
S3
Amazon
Route 53
Amazon
AthenaAmazon EMR
(Spark)
Amazon
Fargate
Airflow
Risk Model 2.0
13
Old Model
g 10 million datapoints each model run
g 40,000 securities (equity only)
g 1 model at a time
g Months to refresh all data
g Hard to get validation data
New Model
g 25 billion datapoints each model run
g 1,000,000 securities (equity + fixed
income)
g 10+ models at a time
g Hours to refresh all data
g Validation data automated
4000x faster
(full rebuild)
5,000x output data
(each model run)
50x parallel models
Four Lessons
14
1. Just Get A Bigger Cluster
15
What is it?
• Get larger servers and more of
them – then trim down later
16
Why?
It’s easy to do
And we should do the easy things first
1. 2x larger >> 2x faster in many cases (so its cost effective)
2. Good joins can’t make up for poorly sized cluster (sometimes)
17
Some reasons to scale
18
I/O Caching Parallelization
I/O
• Too small of cluster will cause a spill to disk
• Writing and reading from disk are slooooooow
• Monitor Spark GUI for spills to disk and add more RAM
19
Caching
• If you use large datasets 2+ times, cache
• Caching requires lots of RAM
• Partial caching not good enough
20
Parallelization
• If data skew is not a problem: 2x larger = 2x faster
• Make sure cluster fully utilized
– Executor Count / Size
21
2. Build Models End To End
22
What is it?
• Ability to rerun models historically
• Always keep source data intact
23
Why?
24
• Distribution of data can shift over time – was your model stable?
• Necessary in projects with a time component
• Bugfixes quicker
• Makes it easy to tweak preprocessing steps
Model Deployment
• Liebig's law of the minimum (~1850) – plant growth is governed by the least
available nutrient. (cf. Amdahl's law).
• In any software-based environment, end-to-end system test is a necessity. The test
will likely find bugs. So the slowest process in our development governs release
and bugfix speed.
• Liebig's law of the minimum (rephrased) – model development speed is governed by
the slowest part of the development/deployment environment.
• Conclusion – if you have a multi-day process to rebuild a model, you’re at risk of this
process governing the release and bugfix cadence.
25
26
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Latest
Model
27
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Full
Latest
Model
28
Time
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
Raw Storage
Prepared Data
Model
Output
End
To
End
Model
3. Make It Easy To Iterate
29
What is it?
• Our team consists of many analysts and a few developers
• Big focus on making it easy for analysts to contribute quickly
– Easy for analysts to setup
– Easy to test
– Easy to deploy full runs
– Easy to run locally
• It’s a simple idea that should be taken seriously
30
Why?
• The magic of compounding!
• Lower switching costs to onboard people to project
• Less experienced team members can contribute
31
32
Sprints
0 1 2 3 4
Lead Developer 20 20 20 20 20
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Developer 10 10 10 10 10
Total Work 60 60 60 60 60
Cumulative Work 60 120 180 240 300
Sprints
0 1 2 3 4
Lead Developer 5 8 11 15 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Developer 10 12 14 16 18
Total Work 45 56 67 79 90
Cumulative Work 45 101 168 247 337
0
50
100
150
200
250
300
350
400
1 2 3 4 5
Total Cumulative WorkIndividual Focus
Tooling Focus
33
Explore
data
Create
features
Build
model
Test
model
Some things we did
– Deployment Scripts to AWS
– Clear Documentation
– Containers / VM’s
– Allow everything to run locally
– Data exploration tools (Jupyter, Athena)
– Pair analysts and developers
34
4. Focus On Local Runs
35
• Make it easy to run the full process locally
• That means you may need:
1. Representative data samples
2. No / minor reliance on external data sources
• Api calls, long database queries, etc.
3. Process to snapshot data
36
What is it?
Why?
• Development is much cheaper / quicker
37
Representative data samples
• Not easy to do
– We use process that runs on large cluster to create smaller datasets
and upload to Cloud
– This process sits within code that abstracts data connections
• Any external data sources need to be mocked / set in config file to
pull from other source
38
39
Full
Abstracted
Data
ML Application
Full Raw Data
Trimmed Raw Data
Trimmed
Abstracted
Data
1. Snapshot to
Raw Data
Conclusion
40
Its about speed of iteration!
• What barriers can you remove?
• What process can you improve?
• What tools can you create?
• What flexibility can you add?
• What headache can you avoid?
• How can you make it easy to do the right thing?
• How can you accelerate the work of inexperienced analysts/devs?
41
Contact Us
• Patrick Caldon
– Patrick.Caldon@Morningstar.com
• Taylor Hess
– Taylor.Hess@Morningstar.com
– https://www.linkedin.com/in/taylorwhess/
42
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
PDF
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
PDF
Anomaly Detection at Scale!
Databricks
 
PDF
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
Databricks
 
PPTX
Machine Learning with Apache Spark
IBM Cloud Data Services
 
PDF
Koalas: How Well Does Koalas Work?
Databricks
 
PDF
Semantic Image Logging Using Approximate Statistics & MLflow
Databricks
 
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
PDF
Observability for Data Pipelines With OpenLineage
Databricks
 
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
 
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
PDF
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Spark Summit
 
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Anomaly Detection at Scale!
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
Databricks
 
Machine Learning with Apache Spark
IBM Cloud Data Services
 
Koalas: How Well Does Koalas Work?
Databricks
 
Semantic Image Logging Using Approximate Statistics & MLflow
Databricks
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Apache Spark At Scale in the Cloud
Databricks
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Observability for Data Pipelines With OpenLineage
Databricks
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Spark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 

Similar to Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark (20)

PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
 
PPTX
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
TriNimbus
 
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
PDF
Silicon Valley Code Camp 2014 - Advanced MongoDB
Daniel Coupal
 
PDF
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
PDF
Are we there Yet?? (The long journey of Migrating from close source to opens...
Marco Tusa
 
PPTX
Bdf16 big-data-warehouse-case-study-data kitchen
Christopher Bergh
 
PPTX
SCM Transformation Challenges and How to Overcome Them
Compuware
 
PPTX
Denver devops : enabling DevOps with data virtualization
Kyle Hailey
 
PDF
DevOps and AWS
Shiva Narayanaswamy
 
PPTX
The Hard Problems of Continuous Deployment
Timothy Fitz
 
PDF
Challenges of Operationalising Data Science in Production
iguazio
 
PPTX
Five Ways to Fix Your SQL Server Dev-Test Problems
Catalogic Software
 
PDF
Machine Learning Infrastructure
SigOpt
 
PDF
Machine Learning Operations Cababilities
davidsh11
 
PDF
Scaling tappsi
Óscar Andrés López
 
PDF
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Databricks
 
PPTX
CON5451_Brydon-OOW2014_Brydon_CON5451 (1).pptx
SergioBruno21
 
PDF
Chicago AWS user group - Raja Dheekonda: Replatforming ML
AWS Chicago
 
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
TriNimbus
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Daniel Coupal
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Marco Tusa
 
Bdf16 big-data-warehouse-case-study-data kitchen
Christopher Bergh
 
SCM Transformation Challenges and How to Overcome Them
Compuware
 
Denver devops : enabling DevOps with data virtualization
Kyle Hailey
 
DevOps and AWS
Shiva Narayanaswamy
 
The Hard Problems of Continuous Deployment
Timothy Fitz
 
Challenges of Operationalising Data Science in Production
iguazio
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Catalogic Software
 
Machine Learning Infrastructure
SigOpt
 
Machine Learning Operations Cababilities
davidsh11
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Databricks
 
CON5451_Brydon-OOW2014_Brydon_CON5451 (1).pptx
SergioBruno21
 
Chicago AWS user group - Raja Dheekonda: Replatforming ML
AWS Chicago
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 

Lessons Learned Replatforming A Large Machine Learning Application To Apache Spark

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Patrick Caldon – Director of Quant Research Taylor Hess – Lead Quant Analyst Morningstar Inc. Lessons Learned Replatforming A Large ML Application #UnifiedDataAnalytics #SparkAISummit
  • 3. Roadmap • Our Model • Lessons Learned 1. Just Get A Bigger Cluster 2. End to End Models 3. Make It Easy To Iterate 4. Focus On Local Runs 3
  • 5. Why Finance Models Are Different 1. Hard to Validate 2. Probabilistic Outputs 3. More collaborative 4. Heavy compliance issues – models and data need versioning 5
  • 6. Large ML models even more difficult… • Software installations can be difficult • Data can’t fit on computer • Desktop/Laptop not powerful enough 6
  • 7. What Are We Building? • Financial Terms: Risk Factor Model • ML Terms: Cross sectional regression + more 7 Daily Factors Apple Inc. | 2019-01-01 Momentum = 0.3 Size = 2.1 Health = 1.5 • Time series of each coefficient • Forecasted return distributions • Covariance estimates All Stock Data
  • 8. What Are We Building? • Essentially, we take features of financial securities and estimate distributions of future returns • We make millions of these estimates • Try to understand how stock returns move together • The feature engineering work has been studied extensively in academic financial research (it is utilized by Quant hedge funds to invest as well) – Some features are simple, some are complex 8
  • 9. What Are We Building? 9 • Outputs (~500GB each run) • Security and portfolio exposures (daily for each security/portfolio) • Security and portfolio forecasted distributions (daily for each security/portfolio) • Inputs (~500GB) • Return data • Financial information • Security information (region, sector, etc.) • Portfolio information Security Date Size Momentum … Apple 10/1/2018 2.1 0.4 … Google 10/1/2018 2.0 0.3 … Barclays 10/1/2018 1.6 -0.1 … BP 10/1/2018 1.8 0.7 …
  • 10. Risk Model 1.0 10 Years of research and development to come up with our proprietary model • Equity only model (~40,000 securities) • Single server relying on database for many calculations • 10 hours to run each day • Producing ~10M datapoints daily Data Warehouse On Prem Server
  • 11. Rethinking Our Approach 11 Hard to expand code Validation is arduous New model creation painful Long time to regenerate
  • 12. New Architecture 12 Morningstar API Amazon S3 Amazon RDS Amazon S3 Amazon Route 53 Amazon AthenaAmazon EMR (Spark) Amazon Fargate Airflow
  • 13. Risk Model 2.0 13 Old Model g 10 million datapoints each model run g 40,000 securities (equity only) g 1 model at a time g Months to refresh all data g Hard to get validation data New Model g 25 billion datapoints each model run g 1,000,000 securities (equity + fixed income) g 10+ models at a time g Hours to refresh all data g Validation data automated 4000x faster (full rebuild) 5,000x output data (each model run) 50x parallel models
  • 15. 1. Just Get A Bigger Cluster 15
  • 16. What is it? • Get larger servers and more of them – then trim down later 16
  • 17. Why? It’s easy to do And we should do the easy things first 1. 2x larger >> 2x faster in many cases (so its cost effective) 2. Good joins can’t make up for poorly sized cluster (sometimes) 17
  • 18. Some reasons to scale 18 I/O Caching Parallelization
  • 19. I/O • Too small of cluster will cause a spill to disk • Writing and reading from disk are slooooooow • Monitor Spark GUI for spills to disk and add more RAM 19
  • 20. Caching • If you use large datasets 2+ times, cache • Caching requires lots of RAM • Partial caching not good enough 20
  • 21. Parallelization • If data skew is not a problem: 2x larger = 2x faster • Make sure cluster fully utilized – Executor Count / Size 21
  • 22. 2. Build Models End To End 22
  • 23. What is it? • Ability to rerun models historically • Always keep source data intact 23
  • 24. Why? 24 • Distribution of data can shift over time – was your model stable? • Necessary in projects with a time component • Bugfixes quicker • Makes it easy to tweak preprocessing steps
  • 25. Model Deployment • Liebig's law of the minimum (~1850) – plant growth is governed by the least available nutrient. (cf. Amdahl's law). • In any software-based environment, end-to-end system test is a necessity. The test will likely find bugs. So the slowest process in our development governs release and bugfix speed. • Liebig's law of the minimum (rephrased) – model development speed is governed by the slowest part of the development/deployment environment. • Conclusion – if you have a multi-day process to rebuild a model, you’re at risk of this process governing the release and bugfix cadence. 25
  • 26. 26 Time Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Latest Model
  • 27. 27 Time Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Full Latest Model
  • 28. 28 Time Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output Raw Storage Prepared Data Model Output End To End Model
  • 29. 3. Make It Easy To Iterate 29
  • 30. What is it? • Our team consists of many analysts and a few developers • Big focus on making it easy for analysts to contribute quickly – Easy for analysts to setup – Easy to test – Easy to deploy full runs – Easy to run locally • It’s a simple idea that should be taken seriously 30
  • 31. Why? • The magic of compounding! • Lower switching costs to onboard people to project • Less experienced team members can contribute 31
  • 32. 32 Sprints 0 1 2 3 4 Lead Developer 20 20 20 20 20 Developer 10 10 10 10 10 Developer 10 10 10 10 10 Developer 10 10 10 10 10 Developer 10 10 10 10 10 Total Work 60 60 60 60 60 Cumulative Work 60 120 180 240 300 Sprints 0 1 2 3 4 Lead Developer 5 8 11 15 18 Developer 10 12 14 16 18 Developer 10 12 14 16 18 Developer 10 12 14 16 18 Developer 10 12 14 16 18 Total Work 45 56 67 79 90 Cumulative Work 45 101 168 247 337 0 50 100 150 200 250 300 350 400 1 2 3 4 5 Total Cumulative WorkIndividual Focus Tooling Focus
  • 34. Some things we did – Deployment Scripts to AWS – Clear Documentation – Containers / VM’s – Allow everything to run locally – Data exploration tools (Jupyter, Athena) – Pair analysts and developers 34
  • 35. 4. Focus On Local Runs 35
  • 36. • Make it easy to run the full process locally • That means you may need: 1. Representative data samples 2. No / minor reliance on external data sources • Api calls, long database queries, etc. 3. Process to snapshot data 36 What is it?
  • 37. Why? • Development is much cheaper / quicker 37
  • 38. Representative data samples • Not easy to do – We use process that runs on large cluster to create smaller datasets and upload to Cloud – This process sits within code that abstracts data connections • Any external data sources need to be mocked / set in config file to pull from other source 38
  • 39. 39 Full Abstracted Data ML Application Full Raw Data Trimmed Raw Data Trimmed Abstracted Data 1. Snapshot to Raw Data
  • 41. Its about speed of iteration! • What barriers can you remove? • What process can you improve? • What tools can you create? • What flexibility can you add? • What headache can you avoid? • How can you make it easy to do the right thing? • How can you accelerate the work of inexperienced analysts/devs? 41
  • 42. Contact Us • Patrick Caldon – [email protected] • Taylor Hess – [email protected] – https://www.linkedin.com/in/taylorwhess/ 42
  • 43. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT