SlideShare a Scribd company logo
PachydermReproducible and Compliant
Data Science
Nick Harvey - Lead Developer Advocate
Pachyderm Inc.
Nick@pachyderm.com
@nicksharvey
As Data Scientists...
We
Pachyderm.com
As Data Scientists...
We
“Big” Data
Pachyderm.com
As Data Scientists...
We
Pachyderm.com
Production
ML/AI Model
Training
ML/AI Model
Inference or
Prediction
Pachyderm.com
Production
ML/AI Model
Training
Data Input
Transforms
Data
Ingestion
Data
Cleaning
Feature
Engineering
Model
Selection,
Parameter
Search
Feature
Transforms
Production
Model
Testing
Model
Export &
Optimization
ML/AI Model
Inference or
Prediction
Post
Processing
Pachyderm.com
To Reach Its Full Potential
Machine Learning Needs1.
Data to have the same
production practices as
code
2.
Empowered developers
not restricted
3.
Organization wide
confidence
Data Divergence
Data sets change constantly. Teams can’t make decisions from their
data if they don’t know what version was used.
Tooling Constraints
Infra often restricts the tooling options available to data scientists.
Not Reproducible
Data teams can’t reproduce results because they can’t track every
version of data and code throughout the system.
Obstacles that prevent
Effective Data Science
Pachyderm.com
For data science to be successful
outputs need to be reproducible
Manage data with the
same production
practices as code
Developers need to be
empowered with choice,
not restricted
Version control for Data
Containerized data pipelines
Be able to instantly
reconstruct any past
output/decision
Data Lineage
General Fusion uses Pachyderm to
Power Commercial Fusion
Research
“The true tipping point in our decision to use
Pachyderm was its version control features for
managing our data.”
- Jonathan Fraser
Engineer at General Fusion
General Fusion has collects large sets of complex data from thousands of
sensors. Managing, scaling, and processing that data is a challenge.
Criteria
1. A data science platform that could scale and adapt with their growth.
2. Augment existing experimental and analysis workflows.
3. Seamless collaboration with external scientific partners.
Business Outcome
1. Data versioning - Pachyderm enables data science teams to develop
reproducible and distributed data workflows without interfering with
each other's analysis.
2. Data provenance - Every data transformation is tracked, allowing any
result to be 100 percent reproducible and verifiable
Pachyderm provides reproducibility through
Data Versioning
Identify and revert “bad” data changes
Version model binaries and parameters
along with the data used to train them
Reproduce specific processes using
historical state(s) of data
Commit ID: a5bcc61...1812
Commit ID: 7afad96...680e
Commit ID: b85ea63...e4d4
Commit ID: 7585b4e...0cc5
Commit ID: af4cf48...8840
person.png
stopsign.png
road.png
boat.png
bike.png
Pachyderm.com
Pachyderm provides workflow management through
Containerized Analyses
Use any languages and frameworks in
pipelines
Port your workflows to any
infrastructure
Easily transition from local dev to production
deploy
Pachyderm.com
Pachyderm provides workflow management through
Data Pipelines
Use any languages and
frameworks in pipelines
Port your workflows to any
infrastructure
Easily transition from local
dev to production deploy
ETL Pipeline ML pipeline CI/CD Application
Pachyderm
Pachyderm.com
Versioned
Training
Data
Pre-Processing Model Export
Versioned
Pre-Processed
Data
Training Versioned
Model
Coming Soon
github.com/kubeflow/examples
Pachyderm provides audit trails via
Data Provenance
Track every version of data and code
that produced a result
Maintain compliance and reproducibility
Manage relationship between historical
data states
Pachyderm.com
Pachyderm
Stack Diagram
Pachyderm.com
Data Provenance In Action
Being able to pinpoint exactly what data is
being used is hard enough for most
companies. Tack on the requirement of having
to edit/remove a specific piece of data without
disruption, and that sees next to impossible.
General Data Protection
Regulation
Pachyderm.com
GDPR Example - Before
● File a ticket
● Entire audit of pipeline
● Removal of Jared’s data
● Models need to be
re-trained and tested.
● Audit to ensure Jared it
not part of the future
● Etc.
Time consuming
manual process
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
?
What happens when
“Black Box Problem”
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
“Pachctl delete-file jared.info”
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
GDPR Request
Met
Pachyderm.com
Pachyderm in 60-seconds
Pachyderm lets you deploy and manage multi-stage, language-agnostic data
pipelines while maintaining complete reproducibility and provenance.
Pachyderm.com
github.com/pachyderm
Thank you

More Related Content

What's hot (20)

PDF
Death to project documentation with eXtreme Programming
Alex Fernandez
 
PDF
Data Science Challenges in Personal Program Analysis
Work-Bench
 
PDF
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
PDF
Using dataset versioning in data science
Venkata Pingali
 
PDF
Applying Java 8 Idioms to Existing Code
C4Media
 
PDF
MLOps at OLX
Alexey Grigorev
 
PDF
Cqrs
Knoldus Inc.
 
PDF
Julia + R for Data Science
Work-Bench
 
PDF
Web Applications of the Future with TypeScript and GraphQL
Roy Derks
 
PDF
Reproducible AI using MLflow and PyTorch
Databricks
 
PPTX
Machine Learning In Production
Samir Bessalah
 
PDF
Software Frameworks for Music Information Retrieval
Xavier Amatriain
 
PDF
Improving data interoperability in Python and R
Wes McKinney
 
PDF
Functional Programming - Worth the Effort
BoldRadius Solutions
 
PPTX
Finding Defects in C#: Coverity vs. FxCop
Coverity
 
PPTX
How to NLProc from .NET
Sergey Tihon
 
PPTX
C# 4.0 and .NET 4.0
Buu Nguyen
 
PDF
Scott Clark, CEO, SigOpt, at The AI Conference 2017
MLconf
 
PDF
The Quest for an Open Source Data Science Platform
QAware GmbH
 
PDF
Code Reviews in Python - PyZh
Cesar Cardenas Desales
 
Death to project documentation with eXtreme Programming
Alex Fernandez
 
Data Science Challenges in Personal Program Analysis
Work-Bench
 
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Using dataset versioning in data science
Venkata Pingali
 
Applying Java 8 Idioms to Existing Code
C4Media
 
MLOps at OLX
Alexey Grigorev
 
Julia + R for Data Science
Work-Bench
 
Web Applications of the Future with TypeScript and GraphQL
Roy Derks
 
Reproducible AI using MLflow and PyTorch
Databricks
 
Machine Learning In Production
Samir Bessalah
 
Software Frameworks for Music Information Retrieval
Xavier Amatriain
 
Improving data interoperability in Python and R
Wes McKinney
 
Functional Programming - Worth the Effort
BoldRadius Solutions
 
Finding Defects in C#: Coverity vs. FxCop
Coverity
 
How to NLProc from .NET
Sergey Tihon
 
C# 4.0 and .NET 4.0
Buu Nguyen
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
MLconf
 
The Quest for an Open Source Data Science Platform
QAware GmbH
 
Code Reviews in Python - PyZh
Cesar Cardenas Desales
 

Similar to End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey (20)

PDF
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
All Things Open
 
PDF
Scalable and reproducible workflows with Pachyderm
Jon Ander Novella
 
PDF
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov
 
PDF
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
PDF
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
PDF
Big Data Applications
Joseph Zwicker
 
PDF
Putting the Science Back in Data Science
IDEAS - Int'l Data Engineering and Science Association
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
FlorenceAI: Reinventing Data Science at Humana
Databricks
 
PDF
Automate Insurance Forms Processing to Gain Operational Efficiencies and Ena...
Captricity
 
PPTX
2018 10 igneous
Chris Dwan
 
PDF
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Michał Łopuszyński
 
PDF
Continuous Intelligence: Keeping your AI Application in Production
Dr. Arif Wider
 
PDF
Ideas spracklen-final
supportlogic
 
PPTX
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
PDF
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
PDF
Wix's ML Platform
Ran Romano
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PDF
Cwin16 tls-datalab for scientists
Capgemini
 
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
All Things Open
 
Scalable and reproducible workflows with Pachyderm
Jon Ander Novella
 
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov
 
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
Big Data Applications
Joseph Zwicker
 
Putting the Science Back in Data Science
IDEAS - Int'l Data Engineering and Science Association
 
Data Science meets Software Development
Alexis Seigneurin
 
FlorenceAI: Reinventing Data Science at Humana
Databricks
 
Automate Insurance Forms Processing to Gain Operational Efficiencies and Ena...
Captricity
 
2018 10 igneous
Chris Dwan
 
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Michał Łopuszyński
 
Continuous Intelligence: Keeping your AI Application in Production
Dr. Arif Wider
 
Ideas spracklen-final
supportlogic
 
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Wix's ML Platform
Ran Romano
 
Machine Learning Models in Production
DataWorks Summit
 
Cwin16 tls-datalab for scientists
Capgemini
 
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
PDF
Words in Space - Rebecca Bilbro
PyData
 
PPTX
Pydata beautiful soup - Monica Puerto
PyData
 
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
PPTX
Extending Pandas with Custom Types - Will Ayd
PyData
 
PDF
Measuring Model Fairness - Stephen Hoover
PyData
 
PDF
What's the Science in Data Science? - Skipper Seabold
PyData
 
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
PDF
Deprecating the state machine: building conversational AI with the Rasa stack...
PyData
 
PDF
Towards automating machine learning: benchmarking tools for hyperparameter tu...
PyData
 
PDF
Using GANs to improve generalization in a semi-supervised setting - trying it...
PyData
 
PDF
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
PyData
 
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
PyData
 
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
PyData
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
PyData
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
PyData
 
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
PyData
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
July Patch Tuesday
Ivanti
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

End-to-End Machine learning pipelines for Python driven organizations - Nick Harvey

  • 1. PachydermReproducible and Compliant Data Science Nick Harvey - Lead Developer Advocate Pachyderm Inc. [email protected] @nicksharvey
  • 7. To Reach Its Full Potential Machine Learning Needs1. Data to have the same production practices as code 2. Empowered developers not restricted 3. Organization wide confidence
  • 8. Data Divergence Data sets change constantly. Teams can’t make decisions from their data if they don’t know what version was used. Tooling Constraints Infra often restricts the tooling options available to data scientists. Not Reproducible Data teams can’t reproduce results because they can’t track every version of data and code throughout the system. Obstacles that prevent Effective Data Science Pachyderm.com
  • 9. For data science to be successful outputs need to be reproducible Manage data with the same production practices as code Developers need to be empowered with choice, not restricted Version control for Data Containerized data pipelines Be able to instantly reconstruct any past output/decision Data Lineage
  • 10. General Fusion uses Pachyderm to Power Commercial Fusion Research “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion General Fusion has collects large sets of complex data from thousands of sensors. Managing, scaling, and processing that data is a challenge. Criteria 1. A data science platform that could scale and adapt with their growth. 2. Augment existing experimental and analysis workflows. 3. Seamless collaboration with external scientific partners. Business Outcome 1. Data versioning - Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other's analysis. 2. Data provenance - Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable
  • 11. Pachyderm provides reproducibility through Data Versioning Identify and revert “bad” data changes Version model binaries and parameters along with the data used to train them Reproduce specific processes using historical state(s) of data Commit ID: a5bcc61...1812 Commit ID: 7afad96...680e Commit ID: b85ea63...e4d4 Commit ID: 7585b4e...0cc5 Commit ID: af4cf48...8840 person.png stopsign.png road.png boat.png bike.png Pachyderm.com
  • 12. Pachyderm provides workflow management through Containerized Analyses Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy Pachyderm.com
  • 13. Pachyderm provides workflow management through Data Pipelines Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy ETL Pipeline ML pipeline CI/CD Application Pachyderm Pachyderm.com
  • 14. Versioned Training Data Pre-Processing Model Export Versioned Pre-Processed Data Training Versioned Model Coming Soon github.com/kubeflow/examples
  • 15. Pachyderm provides audit trails via Data Provenance Track every version of data and code that produced a result Maintain compliance and reproducibility Manage relationship between historical data states Pachyderm.com
  • 17. Data Provenance In Action Being able to pinpoint exactly what data is being used is hard enough for most companies. Tack on the requirement of having to edit/remove a specific piece of data without disruption, and that sees next to impossible. General Data Protection Regulation Pachyderm.com
  • 18. GDPR Example - Before ● File a ticket ● Entire audit of pipeline ● Removal of Jared’s data ● Models need to be re-trained and tested. ● Audit to ensure Jared it not part of the future ● Etc. Time consuming manual process Model Training Users Database Model Deployed User “Jared” Opts out ? What happens when “Black Box Problem” Pachyderm.com
  • 19. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 “Pachctl delete-file jared.info” Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. Pachyderm.com
  • 20. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. GDPR Request Met Pachyderm.com
  • 21. Pachyderm in 60-seconds Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Pachyderm.com github.com/pachyderm