SlideShare a Scribd company logo
Copyright © 2017 by DataKitchen, Inc. All Rights Reserved.
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Agenda
How to go from Data Science to
Data Operations (#DataOps)
Introductions
Data Science Challenges
What is DataOps?
Seven Shocking Steps to DataOps
Pulling it together
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Keep this question in mind
What can I take from
this session and use
on Monday?
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
For slides contact
gil@DataKitchen.io
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Speaker – co-Founder of DataKitchen
Gil Benghiat, Founder, VP of Products
gil@datakitchen.io
A series of data centric software projects
🎓 Applied Math / Biology @ Brown
🎓 Computer Science @ Stanford
🏢 Bell Labs, Sybase, PhaseForward, LeapFrogRx
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
DataKitchen DataOps Software Platform
Main Features
1. Orchestrate complex data pipelines
2. Deploy new ideas to production
3. Automate tests and monitor quality
Enables
1. Fast delivery of analytics
2. High data quality
3. Using your favorite tools and data stores
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Agenda
Introductions
• Data Science Challenges
What is DataOps?
Seven Shocking Steps to DataOps
Pulling it together
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Figure 1: Only a small fraction of real-world ML systems is
composed of the ML code, as shown by the small black
box in the middle. The required surrounding infrastructure
is vast and complex.
Google
Advances in Neural Information Processing Systems 28 (NIPS 2015)
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Business Need
Prep Data
Feature Extraction
Build Model
Evaluate Model
Deploy Model
Monitor Model
Iterate, Test
and Improve
Model building
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Agenda
Introductions
Data Science Challenges
• What is DataOps?
Seven Shocking Steps to DataOps
Pulling it together
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Genesis of DataOps
People,
Process,
Organization
Technical
Environment
= 7 steps
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Data
Engineer
Data
Scientist
Data
Analyst
Agile Development is a mindset:
1. Collaborate with your customers
2. Respond to change
3. Measure progress by working analytics
4. Release frequently (most important first)
5. Get feedback on your releases
6. Adjust your behavior to become more
effective
4 Values
12 Principles
Be Pragmatic
Not Dogmatic
DataOps: It Began With Agile
Business
Partner
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Focus on Value
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Agenda
Introductions
Data Science Challenges
What is DataOps?
• Seven Shocking Steps to DataOps
Pulling it together
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Seven Steps to DataOps
1. Orchestrate Two Journeys
2. Add Tests
3. Use a Version Control System
4. Branch and Merge
5. Use Multiple Environments
6. Reuse & Containerize
7. Parameterize Your Processing
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Journey 1: Orchestrate data to customer value
Analytic process are like manufacturing: materials (data) and
production outputs (refined data, charts, graphs, models)
Access:
Python Code
Transform:
SQL Code, ETL
Model:
R Code
Visualize:
Tableau
Workbook
Report:
Tableau Online
❶
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Journey 2: Speed ideas to production
Analytic processes are like software development: deliverables
continually move from development to production
❶
Data
Engineers
Data
Scientists
Data
Analysts
Diverse Team
Diverse Tools
Diverse Customers
Business
Customer
Products &
Systems
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Innovation and Value Pipeline Together
Focus on both orchestration and deployment while automating &
monitoring quality
Don’t want break production
when I deploy my changes
Don’t want to learn about data quality issues from my customers
❶
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Add Tests
Monitor quality
Data Quality Monitoring: To
ensure that during in the Value
Pipeline, the data quality
remains high.
Code Quality Monitoring: Before
promoting work, running new and
old tests gives high confidence that
the change did not break anything
in the Innovation Pipeline
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Automate Monitoring & Tests In Production
Test Every Step And Every Tool in Your Value Pipeline
Are your outputs
consistent?
And Save Test Results!
Are data inputs
free from
issues?
Is your business logic
still correct?
Access:
Python Code
Transform:
SQL Code, ETL
Model:
R Code
Visualize:
Tableau
Workbook
Report:
Tableau Online
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Support Multiple Types Of Tests
Testing Data Is Not Just Pass/Fail in Your Value Pipeline
Support Test Types
• Error – stop the line
• Warning – investigate later
• Info – list of changes
Keep Test History
• Statistical Process Control
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Types of Tests
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Example Tests
Simple
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
For the Innovation Pipeline
Tests Are For Also Code: Keep Data Fixed
Deploy Feature
Run all tests here before
promoting
❷
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Use a Version Control System
At The End Of The Day, Analytic Work Is All Just Code
Access:
Python Code
Transform:
SQL Code,
ETL Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Tableau Online
Source Code
Control
❸
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Branch & Merge
Source Code
Control
Branching & Merging enables people to safely work on their own tasks
❹
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
❹ Example branch and merge pattern
Sprint 1 Sprint 2
f1 f2
f3
main / master / trunk
f5
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Access:
Python Code
Transform:
SQL Code,
ETL Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Tableau Online
Use Multiple Environments
Analytic Environment
Your Analytic Work Requires Coordinating Tools And Hardware
❺
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Use Multiple Environments
Provide an Analytic Environment for each branch
• Analysts and Data Scientists need a controlled environment for their experiments
• Engineers need a place to develop outside of production
• Update Production only after all tests are run!
❺
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Use Multiple Environments
❺
Provide an Analytic Environment for each branch
• Analysts and Data Scientists need a controlled environment for their experiments
• Engineers need a place to develop outside of production
• Update Production only after all tests are run!
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Reuse & Containerize
Containerize
1. Manage the environment for each model
(e.g. Docker, VM, AMI)
2. Practice Environment Version Control
make production and development areas
identical
Reuse
1. The code
2. Data
❻
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Parameterize Your Processing
Think Of Your Pipeline Like A Big Function
• Named sets of parameters
will increase your velocity
• With parameters, you can
vary
• Inputs
• Outputs
• Steps in the workflow
• You can make a time
machine
❼
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Agenda
Introductions
Data Science Challenges
What is DataOps?
Seven Shocking Steps to DataOps
• Pulling it together
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Business Need
Prep Data
Feature Extraction
Build Model
Evaluate Model
Deploy Model
Monitor Model
Iterate, Test
and Improve
Model building
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
The 7 Steps and Data Science
Journeys Tests Version Control Branch and Merge Environments Reuse / Containerize Parameterize
Business Need Agile
Prep Data x x x x x x x
Feature Extraction x x x x x x x
Build Model x x x x x x x
Evaluate Model x
Deploy Model x x x x x x x
Monitor Model x
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
Make a note to yourself
What can I take from
this session and use
on Monday?
Copyright © 2018 by DataKitchen, Inc. All Rights Reserved.
For slides contact
gil@DataKitchen.io
Thank you for attending

More Related Content

PPTX
ODSC May 2019 - The DataOps Manifesto
DataKitchen
 
PDF
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
DataKitchen
 
PDF
Open Data Science Conference Agile Data
DataKitchen
 
PPTX
Low-tech, Low-cost data management: Six insights from national reporting on f...
srjbridge
 
PDF
Overcoming DataOps hurdles for ML in Production
Sandeep Uttamchandani
 
PDF
Do Agile Data in Just 5 Shocking Steps!
DataKitchen
 
PDF
Bridged Overview by CodeData
Sam Sur
 
PDF
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 
ODSC May 2019 - The DataOps Manifesto
DataKitchen
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
DataKitchen
 
Open Data Science Conference Agile Data
DataKitchen
 
Low-tech, Low-cost data management: Six insights from national reporting on f...
srjbridge
 
Overcoming DataOps hurdles for ML in Production
Sandeep Uttamchandani
 
Do Agile Data in Just 5 Shocking Steps!
DataKitchen
 
Bridged Overview by CodeData
Sam Sur
 
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 

What's hot (20)

PDF
ODSC data science to DataOps
Christopher Bergh
 
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
PDF
Monitoring data quality by Jos Gheerardyn of Yields.io
Dataops Ghent Meetup
 
PDF
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DevOps.com
 
PPTX
Redis rise of Dataops
landoop
 
PPTX
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
PDF
5 Simple Steps to Unleash Big Data Talend Connect
Talend
 
PPTX
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
Talend
 
PPTX
Talend 6.1 - What's New in Talend?
Talend
 
PDF
Operationalizing Data Analytics
VMware Tanzu
 
PDF
Pivotal Big Data Roadshow
VMware Tanzu
 
PDF
Unleash the Power of Big Data and Machine Learning
Talend
 
PPTX
Metadata Mastery: A Big Step for BI Modernization
Eric Kavanagh
 
PDF
Achieving Agility and Scale for Your Data Lake - Talend
Talend
 
PDF
Embracing Cloud Agility to Maximize Flexibility & Performance
Talend
 
PDF
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
PPTX
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
PDF
7 Habits for Big Data in Production - keynote Big Data London Nov 2018
Ellen Friedman
 
PPTX
Big Data Maturity Scorecard
DataWorks Summit
 
PPT
Making the Case for Hadoop in a Large Enterprise-British Airways
DataWorks Summit
 
ODSC data science to DataOps
Christopher Bergh
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
Monitoring data quality by Jos Gheerardyn of Yields.io
Dataops Ghent Meetup
 
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DevOps.com
 
Redis rise of Dataops
landoop
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
5 Simple Steps to Unleash Big Data Talend Connect
Talend
 
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
Talend
 
Talend 6.1 - What's New in Talend?
Talend
 
Operationalizing Data Analytics
VMware Tanzu
 
Pivotal Big Data Roadshow
VMware Tanzu
 
Unleash the Power of Big Data and Machine Learning
Talend
 
Metadata Mastery: A Big Step for BI Modernization
Eric Kavanagh
 
Achieving Agility and Scale for Your Data Lake - Talend
Talend
 
Embracing Cloud Agility to Maximize Flexibility & Performance
Talend
 
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
7 Habits for Big Data in Production - keynote Big Data London Nov 2018
Ellen Friedman
 
Big Data Maturity Scorecard
DataWorks Summit
 
Making the Case for Hadoop in a Large Enterprise-British Airways
DataWorks Summit
 
Ad

Similar to Fri benghiat gil-odsc-data-kitchen-data science to dataops (20)

PDF
seven steps to dataops @ dataops.rocks conference Oct 2019
DataKitchen
 
PPTX
Washington DC DataOps Meetup -- Nov 2019
DataKitchen
 
PPTX
Your Data Nerd Friends Need You!
DataKitchen
 
PDF
Enabling Agility Through DevOps
Leland Newsom CSP-SM, SPC5, SDP
 
PPTX
Bdf16 big-data-warehouse-case-study-data kitchen
Christopher Bergh
 
PDF
From zero to one - How we evolved our test automation processes and mindset i...
Jen-Chieh Ko
 
PPTX
Data science tools of the trade
Fangda Wang
 
PPTX
Agile-plus-DevOps Testing for Packaged Applications
Worksoft
 
PPTX
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
 
PDF
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
PDF
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Christopher Gutknecht
 
PPTX
Cloud-native Enterprise Data Science Teams
Boston Consulting Group
 
PDF
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
Revolution Analytics
 
PDF
Business Process and Technology Evolution - Product Creation
Vikram Singla FCILT
 
PPTX
PayPal Notebooks at Jupytercon 2018
Romit Mehta
 
PPTX
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
DataWorks Summit
 
PPTX
Office 365 Monitoring Best Practices
ThousandEyes
 
PPTX
Jupyter in the modern enterprise data and analytics ecosystem
Gerald Rousselle
 
PDF
Self-Service Analytics with Guard Rails
Denodo
 
PPTX
Big Data Developer Career Path: Job & Interview Preparation
Intellipaat
 
seven steps to dataops @ dataops.rocks conference Oct 2019
DataKitchen
 
Washington DC DataOps Meetup -- Nov 2019
DataKitchen
 
Your Data Nerd Friends Need You!
DataKitchen
 
Enabling Agility Through DevOps
Leland Newsom CSP-SM, SPC5, SDP
 
Bdf16 big-data-warehouse-case-study-data kitchen
Christopher Bergh
 
From zero to one - How we evolved our test automation processes and mindset i...
Jen-Chieh Ko
 
Data science tools of the trade
Fangda Wang
 
Agile-plus-DevOps Testing for Packaged Applications
Worksoft
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
 
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Christopher Gutknecht
 
Cloud-native Enterprise Data Science Teams
Boston Consulting Group
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
Revolution Analytics
 
Business Process and Technology Evolution - Product Creation
Vikram Singla FCILT
 
PayPal Notebooks at Jupytercon 2018
Romit Mehta
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
DataWorks Summit
 
Office 365 Monitoring Best Practices
ThousandEyes
 
Jupyter in the modern enterprise data and analytics ecosystem
Gerald Rousselle
 
Self-Service Analytics with Guard Rails
Denodo
 
Big Data Developer Career Path: Job & Interview Preparation
Intellipaat
 
Ad

Recently uploaded (20)

PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
short term internship project on Data visualization
JMJCollegeComputerde
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 

Fri benghiat gil-odsc-data-kitchen-data science to dataops

  • 1. Copyright © 2017 by DataKitchen, Inc. All Rights Reserved.
  • 2. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Agenda How to go from Data Science to Data Operations (#DataOps) Introductions Data Science Challenges What is DataOps? Seven Shocking Steps to DataOps Pulling it together
  • 3. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Keep this question in mind What can I take from this session and use on Monday?
  • 4. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. For slides contact [email protected]
  • 5. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Speaker – co-Founder of DataKitchen Gil Benghiat, Founder, VP of Products [email protected] A series of data centric software projects 🎓 Applied Math / Biology @ Brown 🎓 Computer Science @ Stanford 🏢 Bell Labs, Sybase, PhaseForward, LeapFrogRx
  • 6. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. DataKitchen DataOps Software Platform Main Features 1. Orchestrate complex data pipelines 2. Deploy new ideas to production 3. Automate tests and monitor quality Enables 1. Fast delivery of analytics 2. High data quality 3. Using your favorite tools and data stores
  • 7. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Agenda Introductions • Data Science Challenges What is DataOps? Seven Shocking Steps to DataOps Pulling it together
  • 8. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. Google Advances in Neural Information Processing Systems 28 (NIPS 2015)
  • 9. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Business Need Prep Data Feature Extraction Build Model Evaluate Model Deploy Model Monitor Model Iterate, Test and Improve Model building
  • 10. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Agenda Introductions Data Science Challenges • What is DataOps? Seven Shocking Steps to DataOps Pulling it together
  • 11. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Genesis of DataOps People, Process, Organization Technical Environment = 7 steps
  • 12. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Data Engineer Data Scientist Data Analyst Agile Development is a mindset: 1. Collaborate with your customers 2. Respond to change 3. Measure progress by working analytics 4. Release frequently (most important first) 5. Get feedback on your releases 6. Adjust your behavior to become more effective 4 Values 12 Principles Be Pragmatic Not Dogmatic DataOps: It Began With Agile Business Partner
  • 13. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Focus on Value
  • 14. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Agenda Introductions Data Science Challenges What is DataOps? • Seven Shocking Steps to DataOps Pulling it together
  • 15. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Seven Steps to DataOps 1. Orchestrate Two Journeys 2. Add Tests 3. Use a Version Control System 4. Branch and Merge 5. Use Multiple Environments 6. Reuse & Containerize 7. Parameterize Your Processing
  • 16. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Journey 1: Orchestrate data to customer value Analytic process are like manufacturing: materials (data) and production outputs (refined data, charts, graphs, models) Access: Python Code Transform: SQL Code, ETL Model: R Code Visualize: Tableau Workbook Report: Tableau Online ❶
  • 17. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Journey 2: Speed ideas to production Analytic processes are like software development: deliverables continually move from development to production ❶ Data Engineers Data Scientists Data Analysts Diverse Team Diverse Tools Diverse Customers Business Customer Products & Systems
  • 18. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Innovation and Value Pipeline Together Focus on both orchestration and deployment while automating & monitoring quality Don’t want break production when I deploy my changes Don’t want to learn about data quality issues from my customers ❶
  • 19. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Add Tests Monitor quality Data Quality Monitoring: To ensure that during in the Value Pipeline, the data quality remains high. Code Quality Monitoring: Before promoting work, running new and old tests gives high confidence that the change did not break anything in the Innovation Pipeline ❷
  • 20. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Automate Monitoring & Tests In Production Test Every Step And Every Tool in Your Value Pipeline Are your outputs consistent? And Save Test Results! Are data inputs free from issues? Is your business logic still correct? Access: Python Code Transform: SQL Code, ETL Model: R Code Visualize: Tableau Workbook Report: Tableau Online ❷
  • 21. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Support Multiple Types Of Tests Testing Data Is Not Just Pass/Fail in Your Value Pipeline Support Test Types • Error – stop the line • Warning – investigate later • Info – list of changes Keep Test History • Statistical Process Control ❷
  • 22. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Types of Tests ❷
  • 23. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Example Tests Simple ❷
  • 24. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. For the Innovation Pipeline Tests Are For Also Code: Keep Data Fixed Deploy Feature Run all tests here before promoting ❷
  • 25. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Use a Version Control System At The End Of The Day, Analytic Work Is All Just Code Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Tableau Online Source Code Control ❸
  • 26. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Branch & Merge Source Code Control Branching & Merging enables people to safely work on their own tasks ❹
  • 27. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. ❹ Example branch and merge pattern Sprint 1 Sprint 2 f1 f2 f3 main / master / trunk f5
  • 28. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Tableau Online Use Multiple Environments Analytic Environment Your Analytic Work Requires Coordinating Tools And Hardware ❺
  • 29. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Use Multiple Environments Provide an Analytic Environment for each branch • Analysts and Data Scientists need a controlled environment for their experiments • Engineers need a place to develop outside of production • Update Production only after all tests are run! ❺
  • 30. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Use Multiple Environments ❺ Provide an Analytic Environment for each branch • Analysts and Data Scientists need a controlled environment for their experiments • Engineers need a place to develop outside of production • Update Production only after all tests are run!
  • 31. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Reuse & Containerize Containerize 1. Manage the environment for each model (e.g. Docker, VM, AMI) 2. Practice Environment Version Control make production and development areas identical Reuse 1. The code 2. Data ❻
  • 32. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Parameterize Your Processing Think Of Your Pipeline Like A Big Function • Named sets of parameters will increase your velocity • With parameters, you can vary • Inputs • Outputs • Steps in the workflow • You can make a time machine ❼
  • 33. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Agenda Introductions Data Science Challenges What is DataOps? Seven Shocking Steps to DataOps • Pulling it together
  • 34. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Business Need Prep Data Feature Extraction Build Model Evaluate Model Deploy Model Monitor Model Iterate, Test and Improve Model building
  • 35. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. The 7 Steps and Data Science Journeys Tests Version Control Branch and Merge Environments Reuse / Containerize Parameterize Business Need Agile Prep Data x x x x x x x Feature Extraction x x x x x x x Build Model x x x x x x x Evaluate Model x Deploy Model x x x x x x x Monitor Model x
  • 36. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. Make a note to yourself What can I take from this session and use on Monday?
  • 37. Copyright © 2018 by DataKitchen, Inc. All Rights Reserved. For slides contact [email protected] Thank you for attending