SlideShare a Scribd company logo
When we Spark and
when we don’t:
ML Pipeline Development at Stitch Fix
Talk Flow
● What is Stitch Fix?
● Infrastructure and Tech Stack
● Thoughts on Good Practices for Developing ML Pipelines
● Case Study: Inventory Recommendation Models
● Tooling & Abstractions at Stitch Fix
Share your style, size
and price preferences
with your personal
stylist.
Get 5 hand-selected
pieces of clothing
delivered to your
door.
Try your fix on
in the comfort
of your home
Leave feedback and
pay for only the
items you keep
Return the other
items in the
envelope
provided
Stitch Fix
There’s an algorithm for that...
Styling Algorithms
Client/Stylist
Matching
Demand Modeling
Human
Computation
Pick Path
Optimization
New Style
Development
Inventory
Allocation
State
Machines
Warehouse
Assignment
Batch Picking
Replenishment
* Find out more at http://algorithms-tour.stitchfix.com/
Our
Infrastructure
and
Tech Stack
Camera:
State Snapshots
FlotillaAWS ECS
Cluster
Bumblebee:
Metadata Manager
AWS:S3
Prod
Dev/Research
Metastore
AWS ECS
Cluster
AWS ECS
Cluster
Data Acquisition Data ProcessingData Storage
Data
Management
Uhura
Job Execution
Workflow
Management
Some facts
● 1000s of jobs / day
○ Model training, featurization, test analysis, reporting, analytics, adhoc research
● Production jobs run on
○ Spark: mostly Spark SQL and pySpark
○ Flotilla: Python or R in Docker containers on ECS
● ML pipelines typically consist of several jobs spanning the stack of
technologies
● Data scientists own pipelines and implementations end-to-end
Good Practices for Developing
ML Pipelines
Pipelines should be designed to support constant iteration
○ Individual pipelines/algorithms/implementations change quickly
○ Tooling and infrastructure should be relatively stable
At scale, failure should be expected
○ Be robust to failure
■ Checkpointing
■ Isolation
■ Automated Retries
■ Alerting
○ Make it easy to debug and diagnose
○ We train 100s of models / day, and expect some # to fail.
Pipelines and jobs should be idempotent.
Make pragmatic choices with respect to technology.
Case Study:
Inventory Recommendation
Models
Extract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload ModelExtract Training
Data
Train Model Upload Model
Algo_V1_1
User Item
Rating
Data
Extract “wide”
Client
Training Data
Train
Model A
Upload
Model A
Extract “wide”
Item
Training Data
Model D
Training
Data
Model C
Training
Data
Ingest
Train
Model C
Upload
Model C
Train
Model D
Upload
Model D
Model B
Training
Data
Train
Model B
Upload
Model B
Model A
Training
Data
Extract “wide”
Client Training
Data
User Item
Rating
Data
Train
Model A
Upload Model
A
Extract “wide”
Item
Training Data Model D
Training Data
Model C
Training Data
Model A
Training Data
Ingest
Train
Model C
Upload Model
C
Train
Model D
Upload Model
D
Model B
Training Data
Train
Model B
Upload Model
B
client_features: {
"expanded_colors": {
"in": [
"client_colors"
],
"fn": "dummy_expand"
},
"X_Y_ratio" : {
"in": [
X,
Y
],
"fn": "compute_scaled_ratio"
}
…
},
item_features: {
"expanded_print" : {
"in": [
colors
],
"fn": "dummy_expand"
}
},
interaction_features: {
}
Extract Jobs generated from resolution of Model + Feature Definitions
{
“deptA”: {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + shiny_material_flag + x_y_ratio”
]
},
"deptB": {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + x_y_ratio + client_color_a +
expanded_print_x”
]
}
}
1. Spark is utilized heavily for feature engineering.
2. Model fitting occurs in containerized Python and R environments.
3. Individual jobs communicate via data dependencies.
4. Our inventory recommendation algorithms are specified with a
high degree of tooling.
5. Pipelines leave behind multiple artifacts for analysis, debugging,
and checkpointing. (extract, train, load)
6. Individual models are isolated from one another. (and can fail
without impacting the rest of the group)
7. Data is contextual: e.g. item type; business line
Some Observations
Platform Tooling is Important!
Desirable Properties of Infrastructure & Tooling
● Isolation should be guaranteed by the infrastructure
● It should be obvious what running jobs and services are doing, when, and why
● Access to data should be easy, consistent, and self-service
● Guide rails should enforce, or strongly encourage, idempotent patterns
● Scaling, logging, and security should be baked into infrastructure and tooling
Access to Data
● All data is managed and tracked by the Metastore
○ Hive metastore abstracted by Bumblebee
○ Location, Schema, Format
● Data access for Python and R is a 1st class citizen
○ Typically accessed as dataframes
○ df = load_dataframe(namespace, table)
○ store_dataframe(df, namespace, table)
the
cloud.
embrace elasticity.
Containerized Batch Jobs
● Containerized job execution has many benefits
○ Strong isolation
○ High degree of control over resources and environment
● But, needs abstraction over job definition and management
○ So we developed Flotilla
○ And open sourced it!
https://stitchfix.github.io/flotilla-os/
Questions?
Get in touch:
jmagnusson@stitchfix.com
@jeffmagnusson
http://www.linkedin.com/in/jmagnuss

More Related Content

PDF
Tracking data lineage at Stitch Fix
Stitch Fix Algorithms
 
PDF
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
PDF
Introduction to basic data analytics tools
Nascenia IT
 
PDF
DataGraft Platform: RDF Database-as-a-Service
Marin Dimitrov
 
PDF
Fast Data processing with RFX
Trieu Nguyen
 
PDF
On-Demand RDF Graph Databases in the Cloud
Marin Dimitrov
 
PDF
Code Once Use Often with Declarative Data Pipelines
Databricks
 
PDF
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Tracking data lineage at Stitch Fix
Stitch Fix Algorithms
 
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Introduction to basic data analytics tools
Nascenia IT
 
DataGraft Platform: RDF Database-as-a-Service
Marin Dimitrov
 
Fast Data processing with RFX
Trieu Nguyen
 
On-Demand RDF Graph Databases in the Cloud
Marin Dimitrov
 
Code Once Use Often with Declarative Data Pipelines
Databricks
 
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 

What's hot (20)

PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Fwdays
 
PDF
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
PDF
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PPTX
Data Ingestion Engine
Adam Doyle
 
PPTX
Machine Learning on the Microsoft Stack
Lynn Langit
 
PPTX
Obfuscating LinkedIn Member Data
DataWorks Summit
 
PDF
GraphDB Connectors – Powering Complex SPARQL Queries
Marin Dimitrov
 
PDF
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
PDF
Machine Learning with PyCaret
Databricks
 
PDF
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
PDF
Low-cost Open Data As-a-Service
Marin Dimitrov
 
PDF
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
PDF
Big Data Pitfalls
Alex Meadows
 
PDF
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
PPTX
Managed Cluster Services
Adam Doyle
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
PDF
How to build a data stack from scratch
Vinayak Hegde
 
PDF
Nikhil summer internship 2016
Nikhil Shekhar
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Fwdays
 
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Data Pipline Observability meetup
Omid Vahdaty
 
Data Ingestion Engine
Adam Doyle
 
Machine Learning on the Microsoft Stack
Lynn Langit
 
Obfuscating LinkedIn Member Data
DataWorks Summit
 
GraphDB Connectors – Powering Complex SPARQL Queries
Marin Dimitrov
 
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
Machine Learning with PyCaret
Databricks
 
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Low-cost Open Data As-a-Service
Marin Dimitrov
 
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Big Data Pitfalls
Alex Meadows
 
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
Managed Cluster Services
Adam Doyle
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
How to build a data stack from scratch
Vinayak Hegde
 
Nikhil summer internship 2016
Nikhil Shekhar
 
Ad

Similar to When We Spark and When We Don’t: Developing Data and ML Pipelines (20)

PDF
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PDF
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PPTX
Machine learning
Saravanan Subburayal
 
PDF
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PPTX
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
PDF
Continuous delivery for machine learning
Rajesh Muppalla
 
PPTX
Manoj Shanmugasundaram - Agile Machine Learning Development
Agile Impact Conference
 
PPTX
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
PDF
Ml ops on AWS
PhilipBasford
 
PDF
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
PDF
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
PDF
Machine Learning Operations Cababilities
davidsh11
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Machine learning
Saravanan Subburayal
 
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
Machine Learning Models in Production
DataWorks Summit
 
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Continuous delivery for machine learning
Rajesh Muppalla
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Agile Impact Conference
 
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
Ml ops on AWS
PhilipBasford
 
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
Apache Spark Model Deployment
Databricks
 
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Machine Learning Operations Cababilities
davidsh11
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
Ad

More from Stitch Fix Algorithms (9)

PPTX
Progression by Regression: How to increase your A/B Test Velocity
Stitch Fix Algorithms
 
PPTX
Deep recommendations in PyTorch
Stitch Fix Algorithms
 
PDF
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
PPTX
Moment-based estimation for hierarchical models in Apache Spark
Stitch Fix Algorithms
 
PDF
Production model deployment
Stitch Fix Algorithms
 
PPTX
Optimizing Spark
Stitch Fix Algorithms
 
PPTX
Incrementality
Stitch Fix Algorithms
 
PDF
Apache Spark & ML Workflows
Stitch Fix Algorithms
 
PDF
Enabling full stack data scientists
Stitch Fix Algorithms
 
Progression by Regression: How to increase your A/B Test Velocity
Stitch Fix Algorithms
 
Deep recommendations in PyTorch
Stitch Fix Algorithms
 
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
Moment-based estimation for hierarchical models in Apache Spark
Stitch Fix Algorithms
 
Production model deployment
Stitch Fix Algorithms
 
Optimizing Spark
Stitch Fix Algorithms
 
Incrementality
Stitch Fix Algorithms
 
Apache Spark & ML Workflows
Stitch Fix Algorithms
 
Enabling full stack data scientists
Stitch Fix Algorithms
 

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of Artificial Intelligence (AI)
Mukul
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 

When We Spark and When We Don’t: Developing Data and ML Pipelines

  • 1. When we Spark and when we don’t: ML Pipeline Development at Stitch Fix
  • 2. Talk Flow ● What is Stitch Fix? ● Infrastructure and Tech Stack ● Thoughts on Good Practices for Developing ML Pipelines ● Case Study: Inventory Recommendation Models ● Tooling & Abstractions at Stitch Fix
  • 3. Share your style, size and price preferences with your personal stylist. Get 5 hand-selected pieces of clothing delivered to your door. Try your fix on in the comfort of your home Leave feedback and pay for only the items you keep Return the other items in the envelope provided Stitch Fix
  • 4. There’s an algorithm for that... Styling Algorithms Client/Stylist Matching Demand Modeling Human Computation Pick Path Optimization New Style Development Inventory Allocation State Machines Warehouse Assignment Batch Picking Replenishment * Find out more at http://algorithms-tour.stitchfix.com/
  • 6. Camera: State Snapshots FlotillaAWS ECS Cluster Bumblebee: Metadata Manager AWS:S3 Prod Dev/Research Metastore AWS ECS Cluster AWS ECS Cluster Data Acquisition Data ProcessingData Storage Data Management Uhura Job Execution Workflow Management
  • 7. Some facts ● 1000s of jobs / day ○ Model training, featurization, test analysis, reporting, analytics, adhoc research ● Production jobs run on ○ Spark: mostly Spark SQL and pySpark ○ Flotilla: Python or R in Docker containers on ECS ● ML pipelines typically consist of several jobs spanning the stack of technologies ● Data scientists own pipelines and implementations end-to-end
  • 8. Good Practices for Developing ML Pipelines
  • 9. Pipelines should be designed to support constant iteration ○ Individual pipelines/algorithms/implementations change quickly ○ Tooling and infrastructure should be relatively stable
  • 10. At scale, failure should be expected ○ Be robust to failure ■ Checkpointing ■ Isolation ■ Automated Retries ■ Alerting ○ Make it easy to debug and diagnose ○ We train 100s of models / day, and expect some # to fail.
  • 11. Pipelines and jobs should be idempotent.
  • 12. Make pragmatic choices with respect to technology.
  • 14. Extract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training Data Train Model Upload Model Algo_V1_1
  • 15. User Item Rating Data Extract “wide” Client Training Data Train Model A Upload Model A Extract “wide” Item Training Data Model D Training Data Model C Training Data Ingest Train Model C Upload Model C Train Model D Upload Model D Model B Training Data Train Model B Upload Model B Model A Training Data
  • 16. Extract “wide” Client Training Data User Item Rating Data Train Model A Upload Model A Extract “wide” Item Training Data Model D Training Data Model C Training Data Model A Training Data Ingest Train Model C Upload Model C Train Model D Upload Model D Model B Training Data Train Model B Upload Model B
  • 17. client_features: { "expanded_colors": { "in": [ "client_colors" ], "fn": "dummy_expand" }, "X_Y_ratio" : { "in": [ X, Y ], "fn": "compute_scaled_ratio" } … }, item_features: { "expanded_print" : { "in": [ colors ], "fn": "dummy_expand" } }, interaction_features: { } Extract Jobs generated from resolution of Model + Feature Definitions { “deptA”: { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + shiny_material_flag + x_y_ratio” ] }, "deptB": { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + x_y_ratio + client_color_a + expanded_print_x” ] } }
  • 18. 1. Spark is utilized heavily for feature engineering. 2. Model fitting occurs in containerized Python and R environments. 3. Individual jobs communicate via data dependencies. 4. Our inventory recommendation algorithms are specified with a high degree of tooling. 5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are isolated from one another. (and can fail without impacting the rest of the group) 7. Data is contextual: e.g. item type; business line Some Observations
  • 19. Platform Tooling is Important!
  • 20. Desirable Properties of Infrastructure & Tooling ● Isolation should be guaranteed by the infrastructure ● It should be obvious what running jobs and services are doing, when, and why ● Access to data should be easy, consistent, and self-service ● Guide rails should enforce, or strongly encourage, idempotent patterns ● Scaling, logging, and security should be baked into infrastructure and tooling
  • 21. Access to Data ● All data is managed and tracked by the Metastore ○ Hive metastore abstracted by Bumblebee ○ Location, Schema, Format ● Data access for Python and R is a 1st class citizen ○ Typically accessed as dataframes ○ df = load_dataframe(namespace, table) ○ store_dataframe(df, namespace, table)
  • 23. Containerized Batch Jobs ● Containerized job execution has many benefits ○ Strong isolation ○ High degree of control over resources and environment ● But, needs abstraction over job definition and management ○ So we developed Flotilla ○ And open sourced it! https://stitchfix.github.io/flotilla-os/