When We Spark and When We Don’t: Developing Data and ML Pipelines

When we Spark and
when we don’t:
ML Pipeline Development at Stitch Fix

Talk Flow
● What is Stitch Fix?
● Infrastructure and Tech Stack
● Thoughts on Good Practices for Developing ML Pipelines
● Case Study: Inventory Recommendation Models
● Tooling & Abstractions at Stitch Fix

Share your style, size
and price preferences
with your personal
stylist.
Get 5 hand-selected
pieces of clothing
delivered to your
door.
Try your fix on
in the comfort
of your home
Leave feedback and
pay for only the
items you keep
Return the other
items in the
envelope
provided
Stitch Fix

There’s an algorithm for that...
Styling Algorithms
Client/Stylist
Matching
Demand Modeling
Human
Computation
Pick Path
Optimization
New Style
Development
Inventory
Allocation
State
Machines
Warehouse
Assignment
Batch Picking
Replenishment
* Find out more at http://algorithms-tour.stitchfix.com/

Our
Infrastructure
and
Tech Stack

Camera:
State Snapshots
FlotillaAWS ECS
Cluster
Bumblebee:
Metadata Manager
AWS:S3
Prod
Dev/Research
Metastore
AWS ECS
Cluster
AWS ECS
Cluster
Data Acquisition Data ProcessingData Storage
Data
Management
Uhura
Job Execution
Workflow
Management

Some facts
● 1000s of jobs / day
○ Model training, featurization, test analysis, reporting, analytics, adhoc research
● Production jobs run on
○ Spark: mostly Spark SQL and pySpark
○ Flotilla: Python or R in Docker containers on ECS
● ML pipelines typically consist of several jobs spanning the stack of
technologies
● Data scientists own pipelines and implementations end-to-end

Good Practices for Developing
ML Pipelines

Pipelines should be designed to support constant iteration
○ Individual pipelines/algorithms/implementations change quickly
○ Tooling and infrastructure should be relatively stable

At scale, failure should be expected
○ Be robust to failure
■ Checkpointing
■ Isolation
■ Automated Retries
■ Alerting
○ Make it easy to debug and diagnose
○ We train 100s of models / day, and expect some # to fail.

Pipelines and jobs should be idempotent.

Make pragmatic choices with respect to technology.

Case Study:
Inventory Recommendation
Models

Extract Training
Data
Train Model Upload ModelExtract Training
Data
Data
Data
Data
Data
Data
Train Model Upload Model
Algo_V1_1

User Item
Rating
Data
Extract “wide”
Client
Training Data
Train
Model A
Upload
Model A
Extract “wide”
Item
Training Data
Model D
Training
Data
Model C
Training
Data
Ingest
Train
Model C
Upload
Model C
Train
Model D
Upload
Model D
Model B
Training
Data
Train
Model B
Upload
Model B
Model A
Training
Data

Extract “wide”
Client Training
Data
User Item
Rating
Data
Train
Model A
Upload Model
A
Extract “wide”
Item
Training Data Model D
Training Data
Model C
Training Data
Model A
Training Data
Ingest
Train
Model C
Upload Model
C
Train
Model D
Upload Model
D
Model B
Training Data
Train
Model B
Upload Model
B

client_features: {
"expanded_colors": {
"in": [
"client_colors"
],
"fn": "dummy_expand"
},
"X_Y_ratio" : {
"in": [
X,
Y
],
"fn": "compute_scaled_ratio"
}
…
},
item_features: {
"expanded_print" : {
"in": [
colors
],
"fn": "dummy_expand"
}
},
interaction_features: {
}
Extract Jobs generated from resolution of Model + Feature Definitions
{
“deptA”: {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + shiny_material_flag + x_y_ratio”
]
},
"deptB": {
"computed_features": [
“example_feature”
],
"formula": [
"s ~ 1 + f_a + x_y_ratio + client_color_a +
expanded_print_x”
]
}
}

1. Spark is utilized heavily for feature engineering.
2. Model fitting occurs in containerized Python and R environments.
3. Individual jobs communicate via data dependencies.
4. Our inventory recommendation algorithms are specified with a
high degree of tooling.
5. Pipelines leave behind multiple artifacts for analysis, debugging,
and checkpointing. (extract, train, load)
6. Individual models are isolated from one another. (and can fail
without impacting the rest of the group)
7. Data is contextual: e.g. item type; business line
Some Observations

Platform Tooling is Important!

Desirable Properties of Infrastructure & Tooling
● Isolation should be guaranteed by the infrastructure
● It should be obvious what running jobs and services are doing, when, and why
● Access to data should be easy, consistent, and self-service
● Guide rails should enforce, or strongly encourage, idempotent patterns
● Scaling, logging, and security should be baked into infrastructure and tooling

Access to Data
● All data is managed and tracked by the Metastore
○ Hive metastore abstracted by Bumblebee
○ Location, Schema, Format
● Data access for Python and R is a 1st class citizen
○ Typically accessed as dataframes
○ df = load_dataframe(namespace, table)
○ store_dataframe(df, namespace, table)

the
cloud.
embrace elasticity.

Containerized Batch Jobs
● Containerized job execution has many benefits
○ Strong isolation
○ High degree of control over resources and environment
● But, needs abstraction over job definition and management
○ So we developed Flotilla
○ And open sourced it!
https://stitchfix.github.io/flotilla-os/

Questions?
Get in touch:
jmagnusson@stitchfix.com
@jeffmagnusson
http://www.linkedin.com/in/jmagnuss

When We Spark and When We Don’t: Developing Data and ML Pipelines

More Related Content

What's hot (20)

Similar to When We Spark and When We Don’t: Developing Data and ML Pipelines (20)

More from Stitch Fix Algorithms (9)

Recently uploaded (20)

When We Spark and When We Don’t: Developing Data and ML Pipelines