SlideShare a Scribd company logo
The Function, the Context, and the Data
Building an Abstraction for Simpler ML Ops at Stitch Fix
Elijah ben Izzy
Data Platform Engineer - Model Lifecycle
@elijahbenizzy
linkedin.com/in/elijahbenizzy
Try out Stitch Fix → goo.gl/Q3tCQ3
2
- Stitch Fix/Data Science (DS) @ Stitch Fix
- Common Workflows/Motivation
- Representing a Model
- Unlocked Capabilities
- Future Musings
Agenda
3
The right abstraction enables separation of concerns between DS and Platforms
Take Home
DAIS 2021 4
whoami
Stitch Fix?
DAIS 2021 6
Stitch Fix is a Personal Styling Service
Shop at your personal curated store. Check out what you like.
DAIS 2021 7
Data Science is Behind Everything We Do
algorithms-tour.stitchfix.com
Algorithms Org.
- 145+ Data Scientists and Platform Engineers
- 3 main verticals + platform
Data Platform
Data Science
@ Stitch Fix
DAIS 2021 9
Common Approaches to Data Science
Typical organization:
● Horizontal teams
● Hand off between fns
● Coordination required
DATA SCIENCE /
RESEARCH TEAMS
ETL TEAMS
ENGINEERING TEAMS
DAIS 2021 10
At Stitch Fix:
● Single organization
● No handoffs
● End to end ownership
● Lots of DS!
● Built on top of data
platform tools &
abstractions
Data Scientists (DS) are Full Stack
See https://cultivating-algos.stitchfix.com/
DATA SCIENCE
ETL
ENGINEERING
The Problem
DAIS 2021 12
The Problem with Verticals
“DS are full stack” != “DS builds stack from the ground up”
Goal: scale without
-> more complex infrastructure
-> more cognitive burden on DS
DS should always be full stack... ...but can we shorten the stack?
ML platform
DAIS 2021 13
Examining Workflows
etl.py save on s3
copy to
production
Training (run at a regular cadence)
Inference
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
Analysis
DAIS 2021 14
Optimizing the Workflow
Goal: Build abstraction to give DS all these capabilities for free
Caveat: Largely uniform workflows with independent technologies
???????
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
...
The Lede
DAIS 2021 16
Build or Buy?
We built our own
- Seamless integration with current infrastructure -> leverage
- Model tracking/management data model was not standard
- We have lots of segments/varying ways to slice and dice our models
- Custom build allows for pivoting as needed
- Invest in interface design to allow for plug/play with open-source options
Called it the Model Envelope
Hats off to MLFlow, TFX, modelDB!
DAIS 2021 17
What we Built
DS only writes training script -- the rest is configuration-driven
import model_envelope as me
from sklearn import linear_model, metrics
df_X, df_y = load_data_somehow()
model = linear_model.LogisticRegression(multi_class='auto')
model.fit(df_X, df_y)
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X, api_output=df_y,
tags={'canonical_name':'foo-bar'})
my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
DAIS 2021 18
Model Envelope (ctd.)
model
microservice
predictions in
batch
streaming
predictions
track metrics
...
share with other
teams
model
envelope
registry
Representing a Model
DAIS 2021 20
Writing a Recipe
The instructions
The cookware
The ingredients
DAIS 2021 21
Representing a Model
The function: what the model does
The context: where/how to run the model
The data: data the model needs to run
DAIS 2021 22
The Function
Artifact + Shape
DAIS 2021 23
The Function
Artifact + Shape
- Serialized model (bytes) including state
- Serialization metadata
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'})
DS passes object, platform serializes
Platform derives metadata
DAIS 2021 24
The Function
Artifact + Shape
- Function inputs
- Function outputs
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'})
DS passes sample dataframe or specifies type-annotations
Platform serializes, represents in custom format
DAIS 2021 25
The Context
Environment + Index
DAIS 2021 26
The Context
Environment + Index
- Installed packages
- Custom code
- Language + version
import my_custom_fancy_ml_module
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
# pip_env=['scikit-learn', pandas'], edge case if needed
custom_modules=[my_custom_fancy_ml_module])
Platform automagically derived, or DS passes pointers
DS passes in as needed
Platform automagically derived
DAIS 2021 27
The Context
Environment + Index
- Key-value tags
- Spine/index of envelope registry
import my_custom_fancy_ml_module
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
custom_modules=[my_custom_module])
Platform derives base tags
DS passes custom tags as desired
`
DAIS 2021 28
The Data
Training Data + Metrics
DAIS 2021 29
The Data
Training Data + Metrics
- Features
- Summary statistics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
DS (optionally) passes spec for features
Platform derives summary stats from passed data
DAIS 2021 30
The Data
Training Data + Metrics
- Scalars
- Fancy metrics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
evaluations = model(df_X)
my_envelope.log_metrics(
validation_loss=metrics.log_loss(evaluations, df_y)
roc_curve=metrics.roc_curve(evaluations, df_y))
)
DS logs metrics
using Platform metric-schema library
Unlocked Capabilities
DAIS 2021 32
Online Inference
Approach Generate, automatically deploy microservice for model predictions
1. Runs cron job to determine models for deployment
2. Generates code to run model microservice
3. Deploys models with config to AWS
4. Monitors/manages model infrastructure
1. Generates, tests out service locally
2. Sets up automatic deployment “rule”
3. Publishes model, waits
DS Platform
DAIS 2021 33
Online Inference
The Function
- Serialized artifacted loaded on service instantiation, called during endpoints
- Function shape used to create OpenAPI spec/validate inputs
DAIS 2021 34
Online Inference
The Context
- Tag spec used to automatically deploy whenever new model is published
- Note: user never has to call deploy()! Done through system-managed CD.
- Stored package versions used to build docker images
- Custom code made accessible to model for deserialization, execution
Docker Image
installed python
packages
custom code
CD
DAIS 2021 35
Online Inference
The Data
- Summary stats used to validate/monitor input (data drift)
- Feature pointer used to load feature data
Feature Store
DAIS 2021 36
Batch Inference
Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla)
1. Spins up spark cluster (if specified)
2. Loads input data, optionally joins with features
3. Execute model’s predict function over input
4. Saves to output table
1. Creates config for batch job (local/spark)
a. tag query to choose model
b. input/output tables
2. Executes as part of ETL
DS Platform
DAIS 2021 37
Batch Inference
The function
- Serialized artifacted loaded on batch job start
- Function shape used to validate against inputs and outputs
- MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
DAIS 2021 38
Batch Inference
The context
- Frozen package, language versions used in installing dependencies
- Custom code made accessible to model for deserialization, execution
- Tags used to determine which model to run
Docker Image
installed python
packages
custom code
DAIS 2021 39
Batch Inference
The data
- Feature pointer used to load feature data if IDs specified
- Evaluation table pointers stored in the registry
Feature Store
DAIS 2021 40
Metrics Tracking
Approach Allow for metrics tracking with tag-based querying
1. Builds/manages dashboard
2. Adds fancy new metric types!
1. Logs metrics using python client
2. Explores in the Model Operations Dashboard
3. Saves URL for favorite viz
DS Platform
DAIS 2021 41
Metrics Tracking
DAIS 2021 42
Metrics Tracking
DAIS 2021 43
Metrics Tracking
In Summation
DAIS 2021 45
Value Added by Separating Concerns
Making deployment easy
Ensuring environment in prod == environment in training
Providing easy metrics analysis
Wrapping up complex systems
Behind-the-scenes best practices
Creating the best model
Choosing the best libraries
Determining the right metrics to log
DS concerned with... Platform concerned with...
DS focuses on creating the best model [writing the recipe]
Platform focuses on optimal infrastructure [cooking it]
Future Musings
DAIS 2021 47
Some Ideas...
More advanced use of the data
- production monitoring: utilize training data/stats to have visibility into prod/training drift
More deployment contexts
- Predictions on streaming/kafka topics
More sophisticated feature tracking/integration
- Feature stores are all the rage…
Lambda-like architecture
- Rather than requiring a deploy, can we query system for a model’s predictions?
- Requires more unified environments…
Attach external capabilities to replace home-built components of our own system...
Questions?
Find me at:
@elijahbenizzy
linkedin.com/in/elijahbenizzy
elijah.benizzy@stitchfix.com
Try out Stitch Fix → goo.gl/Q3tCQ3

More Related Content

What's hot (20)

PDF
Data Discovery at Databricks with Amundsen
Databricks
 
PDF
Advanced SQL For Data Scientists
Databricks
 
PDF
Building an ML Platform with Ray and MLflow
Databricks
 
PDF
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
PDF
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Cathrine Wilhelmsen
 
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
PDF
Harnessing Spark Catalyst for Custom Data Payloads
Simeon Fitch
 
PPTX
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
PDF
The Critical Missing Component in the Production ML Stack
Databricks
 
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
PDF
Building End-to-End Delta Pipelines on GCP
Databricks
 
PPTX
Microsoft Azure Databricks
Sascha Dittmann
 
Data Discovery at Databricks with Amundsen
Databricks
 
Advanced SQL For Data Scientists
Databricks
 
Building an ML Platform with Ray and MLflow
Databricks
 
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Cathrine Wilhelmsen
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
Harnessing Spark Catalyst for Custom Data Payloads
Simeon Fitch
 
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
The Critical Missing Component in the Production ML Stack
Databricks
 
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
An intro to Azure Data Lake
Rick van den Bosch
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Azure Data Lake and U-SQL
Michael Rys
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Building End-to-End Delta Pipelines on GCP
Databricks
 
Microsoft Azure Databricks
Sascha Dittmann
 

Similar to The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix (20)

PDF
Production machine learning: Managing models, workflows and risk at scale
Alex Housley
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
PDF
Wix's ML Platform
Ran Romano
 
PDF
Ml ops intro session
Avinash Patil
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Continuous Intelligence: Keeping your AI Application in Production
Dr. Arif Wider
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PDF
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
PPTX
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
PDF
DevOps Days Rockies MLOps
Matthew Reynolds
 
PPTX
Artificial Intelligence on Data Centric Platform
Stratio
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PPTX
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
PDF
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Databricks
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
Production machine learning: Managing models, workflows and risk at scale
Alex Housley
 
Machine Learning Models in Production
DataWorks Summit
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Wix's ML Platform
Ran Romano
 
Ml ops intro session
Avinash Patil
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Continuous Intelligence: Keeping your AI Application in Production
Dr. Arif Wider
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
DevOps Days Rockies MLOps
Matthew Reynolds
 
Artificial Intelligence on Data Centric Platform
Stratio
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
DevOps for Machine Learning overview en-us
eltonrodriguez11
 
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Databricks
 
Data Science meets Software Development
Alexis Seigneurin
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Research Methodology Overview Introduction
ayeshagul29594
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

  • 1. The Function, the Context, and the Data Building an Abstraction for Simpler ML Ops at Stitch Fix Elijah ben Izzy Data Platform Engineer - Model Lifecycle @elijahbenizzy linkedin.com/in/elijahbenizzy Try out Stitch Fix → goo.gl/Q3tCQ3
  • 2. 2 - Stitch Fix/Data Science (DS) @ Stitch Fix - Common Workflows/Motivation - Representing a Model - Unlocked Capabilities - Future Musings Agenda
  • 3. 3 The right abstraction enables separation of concerns between DS and Platforms Take Home
  • 6. DAIS 2021 6 Stitch Fix is a Personal Styling Service Shop at your personal curated store. Check out what you like.
  • 7. DAIS 2021 7 Data Science is Behind Everything We Do algorithms-tour.stitchfix.com Algorithms Org. - 145+ Data Scientists and Platform Engineers - 3 main verticals + platform Data Platform
  • 9. DAIS 2021 9 Common Approaches to Data Science Typical organization: ● Horizontal teams ● Hand off between fns ● Coordination required DATA SCIENCE / RESEARCH TEAMS ETL TEAMS ENGINEERING TEAMS
  • 10. DAIS 2021 10 At Stitch Fix: ● Single organization ● No handoffs ● End to end ownership ● Lots of DS! ● Built on top of data platform tools & abstractions Data Scientists (DS) are Full Stack See https://cultivating-algos.stitchfix.com/ DATA SCIENCE ETL ENGINEERING
  • 12. DAIS 2021 12 The Problem with Verticals “DS are full stack” != “DS builds stack from the ground up” Goal: scale without -> more complex infrastructure -> more cognitive burden on DS DS should always be full stack... ...but can we shorten the stack? ML platform
  • 13. DAIS 2021 13 Examining Workflows etl.py save on s3 copy to production Training (run at a regular cadence) Inference model microservice predictions in batch streaming predictions track metrics share with other teams Analysis
  • 14. DAIS 2021 14 Optimizing the Workflow Goal: Build abstraction to give DS all these capabilities for free Caveat: Largely uniform workflows with independent technologies ??????? model microservice predictions in batch streaming predictions track metrics share with other teams ...
  • 16. DAIS 2021 16 Build or Buy? We built our own - Seamless integration with current infrastructure -> leverage - Model tracking/management data model was not standard - We have lots of segments/varying ways to slice and dice our models - Custom build allows for pivoting as needed - Invest in interface design to allow for plug/play with open-source options Called it the Model Envelope Hats off to MLFlow, TFX, modelDB!
  • 17. DAIS 2021 17 What we Built DS only writes training script -- the rest is configuration-driven import model_envelope as me from sklearn import linear_model, metrics df_X, df_y = load_data_somehow() model = linear_model.LogisticRegression(multi_class='auto') model.fit(df_X, df_y) my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
  • 18. DAIS 2021 18 Model Envelope (ctd.) model microservice predictions in batch streaming predictions track metrics ... share with other teams model envelope registry
  • 20. DAIS 2021 20 Writing a Recipe The instructions The cookware The ingredients
  • 21. DAIS 2021 21 Representing a Model The function: what the model does The context: where/how to run the model The data: data the model needs to run
  • 22. DAIS 2021 22 The Function Artifact + Shape
  • 23. DAIS 2021 23 The Function Artifact + Shape - Serialized model (bytes) including state - Serialization metadata my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes object, platform serializes Platform derives metadata
  • 24. DAIS 2021 24 The Function Artifact + Shape - Function inputs - Function outputs my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes sample dataframe or specifies type-annotations Platform serializes, represents in custom format
  • 25. DAIS 2021 25 The Context Environment + Index
  • 26. DAIS 2021 26 The Context Environment + Index - Installed packages - Custom code - Language + version import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, # pip_env=['scikit-learn', pandas'], edge case if needed custom_modules=[my_custom_fancy_ml_module]) Platform automagically derived, or DS passes pointers DS passes in as needed Platform automagically derived
  • 27. DAIS 2021 27 The Context Environment + Index - Key-value tags - Spine/index of envelope registry import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, custom_modules=[my_custom_module]) Platform derives base tags DS passes custom tags as desired `
  • 28. DAIS 2021 28 The Data Training Data + Metrics
  • 29. DAIS 2021 29 The Data Training Data + Metrics - Features - Summary statistics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) DS (optionally) passes spec for features Platform derives summary stats from passed data
  • 30. DAIS 2021 30 The Data Training Data + Metrics - Scalars - Fancy metrics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) evaluations = model(df_X) my_envelope.log_metrics( validation_loss=metrics.log_loss(evaluations, df_y) roc_curve=metrics.roc_curve(evaluations, df_y)) ) DS logs metrics using Platform metric-schema library
  • 32. DAIS 2021 32 Online Inference Approach Generate, automatically deploy microservice for model predictions 1. Runs cron job to determine models for deployment 2. Generates code to run model microservice 3. Deploys models with config to AWS 4. Monitors/manages model infrastructure 1. Generates, tests out service locally 2. Sets up automatic deployment “rule” 3. Publishes model, waits DS Platform
  • 33. DAIS 2021 33 Online Inference The Function - Serialized artifacted loaded on service instantiation, called during endpoints - Function shape used to create OpenAPI spec/validate inputs
  • 34. DAIS 2021 34 Online Inference The Context - Tag spec used to automatically deploy whenever new model is published - Note: user never has to call deploy()! Done through system-managed CD. - Stored package versions used to build docker images - Custom code made accessible to model for deserialization, execution Docker Image installed python packages custom code CD
  • 35. DAIS 2021 35 Online Inference The Data - Summary stats used to validate/monitor input (data drift) - Feature pointer used to load feature data Feature Store
  • 36. DAIS 2021 36 Batch Inference Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla) 1. Spins up spark cluster (if specified) 2. Loads input data, optionally joins with features 3. Execute model’s predict function over input 4. Saves to output table 1. Creates config for batch job (local/spark) a. tag query to choose model b. input/output tables 2. Executes as part of ETL DS Platform
  • 37. DAIS 2021 37 Batch Inference The function - Serialized artifacted loaded on batch job start - Function shape used to validate against inputs and outputs - MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
  • 38. DAIS 2021 38 Batch Inference The context - Frozen package, language versions used in installing dependencies - Custom code made accessible to model for deserialization, execution - Tags used to determine which model to run Docker Image installed python packages custom code
  • 39. DAIS 2021 39 Batch Inference The data - Feature pointer used to load feature data if IDs specified - Evaluation table pointers stored in the registry Feature Store
  • 40. DAIS 2021 40 Metrics Tracking Approach Allow for metrics tracking with tag-based querying 1. Builds/manages dashboard 2. Adds fancy new metric types! 1. Logs metrics using python client 2. Explores in the Model Operations Dashboard 3. Saves URL for favorite viz DS Platform
  • 45. DAIS 2021 45 Value Added by Separating Concerns Making deployment easy Ensuring environment in prod == environment in training Providing easy metrics analysis Wrapping up complex systems Behind-the-scenes best practices Creating the best model Choosing the best libraries Determining the right metrics to log DS concerned with... Platform concerned with... DS focuses on creating the best model [writing the recipe] Platform focuses on optimal infrastructure [cooking it]
  • 47. DAIS 2021 47 Some Ideas... More advanced use of the data - production monitoring: utilize training data/stats to have visibility into prod/training drift More deployment contexts - Predictions on streaming/kafka topics More sophisticated feature tracking/integration - Feature stores are all the rage… Lambda-like architecture - Rather than requiring a deploy, can we query system for a model’s predictions? - Requires more unified environments… Attach external capabilities to replace home-built components of our own system...