SlideShare a Scribd company logo
Building Data Science into
Organizations: Field Experience
Chris Robison
Joseph Bradley
Data + AI Summit 2021
Joseph Bradley
● Sr. Solutions Architect
● 2nd ML Engineer at Databricks
● Apache Spark committer and
PMC member
Our perspectives
Chris Robison
● Sr. Solutions Architect
● Former Director of Data Science
and Omni-channel Marketing at
Overstock.com
● Career data scientist and avid
Apache Spark user
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
So you want to do Data Science...
98.8%
14.4%
of Fortune 1,000 companies
are investing in strategic
Big Data & AI initiatives.
of Fortune 1,000 companies say
they have deployed AI capabilities
into widespread production.
Source: New Vantage Partners
Long-term
● Show business impact
● Increase productivity
● Scale DS across the organization
Short-term
● Validate that DS is worthwhile
● Get resources:
○ Data
○ Data Scientists
○ Executive sponsorship
● Show vision
Goals of a DS/ML/AI program
Technology and platform
● Poor integration between Data Science
and other data teams
● Planning for scale and production,
under investment constraints
Organization
● Team building: skill sets, hiring, and training
● Team organization: embedded vs. standalone
● Business and executive alignment
● R&D
Challenges of a DS/ML/AI program
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Execution
Use agile processes for data science
● Iterate with sprints and standups
● Fail fast in R&D
Transparency is key
● Communicate frequently to your business partners and executives
● Make business partners and consumers an integral part of process
Collaborate with the data and platform teams
● Make your needs known and understood
● Beware shortcuts which build technical debt
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs with
a few models manually
in production
● Starting to build an
AI/ML Strategy
● In discovery phase for
new projects and
low-hanging fruit
Company
● Desire to become data
driven
● Smaller in size
(startup) or an existing
organization with new
data initiatives
Team
● 1-2 Data Scientists
(likely) reporting to a
CTO
● Acting as full stack
data scientists
● Typically a math or
computer science
background
Organization building -- “Crawl” stage
Common tools Descriptions
Notebooks and IDEs Python notebooks, R Studio, Local IDEs
Languages Python, R -- and potentially SQL, Scala, Java, etc.
ML libraries Standard libraries, plus bring-your-own libraries and versions
Git Notebook versioning, and syncing across platforms with Git
Data Pandas, Spark, Koalas; any data sources or formats
Visualization Matplotlib, Plotly, Seaborn, etc.
Integrations Platforms must integrate with any libraries, systems, or services.
Platforms which are cloud-native and have both UIs and APIs are ideal.
Keep using familiar tools
Build around OSS standards for portability
# Downloads / month
990K
350K
1.7M
516K
Be more productive with self-service analytics
Compute resources Libraries and environment
With popular ML libraries
Plug & play environments
requirements.txt
conda.yaml
And customization
Start up machines or
clusters on demand
Cost controls: Autoscaling, auto-termination,
spot instances, cost tracking
Governance: Cluster policies for enforcement
Option 2: Share clusters,
with separate Python
env per user or project.
Option 1: Use your
own cluster
Running example: ML prioritization of Sales opps
Platform enablement
and improvement
Customer history and
Sales data access
Long-term platform and
data pipeline planning
Develop DL
model
Use notebooks +
TensorBoard for
interactive
development.
Analyze
results
Review auto-logged
MLflow metrics to
analyze model
performance.
Load data
Efficient data
loading from S3,
ADLS, etc.
Get an ML
workspace
Simple machine or
cluster creation.
Ready-to-go DS
environments.
Share
results
Share insights
with other
stakeholders
Sync code
Import .py or
.ipynb notebook,
and sync with Git.
Discussion with Sales stakeholders to understand
the problem and data, and to set expectations
Explanation of results and
future potential to Sales
Build executive alignment and
buy-in for long-term initiatives
DS team training
and hiring
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs and
production models in
multiple business
units
● Uniform testing
standards are being
established
Company
● Data initiatives being
discussed at the
executive level
● Business units
pushing for data
projects
● Emerging business
champions for AI/ML
Team
● Data Science team(s)
supporting multiple
business units
● Integrations with
software engineering
for production
● Diversifying skill-sets
for domain expertise
Organization building -- “Walk” stage
Building Data Science into Organizations: Field Experience
Data
Preparation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Deployment
Model
Tuning
Model
Consumption
● Koalas
● Spark DataFrames
● Spark UDFs
● Larger instances
● GPUs
● Distributed training
(Spark ML,
HorovodRunner, etc.)
● Hyperopt
● MLflow
● Spark DataFrames & UDFs
● Jobs & Model Servers
● Mlflow
Scaling in a typical machine learning workflow
Auto-logging for reproducibility
Reproduce Run feature:
✓
✓
✓
✓
Code versioning
Data versioning
Cluster configuration
Environment specification
Reproducibility checklist:
Job scheduling in platform
Automation: schedule, alert, retry, API
Automate and reproduce wherever possible
Secure: IAM Passthrough | Cluster Policies | Table ACLs
Your Existing Data Lake
Ingestion
Tables
Data
Catalog
Feature
Store
Azure Data
Lake Storage
Amazon S3
Streaming
Batch
3rd
Party Data
Marketplace
Files
for Data Science and ML
● Schema enforced high
quality data
● Optimized performance
● Full data lineage /
governance
● Reproducibility through
time travel
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Infrastructure
Data Engineering Data Science
ML Engineer
Running example: ML-driven products
Scale up
or out
Larger machines.
Multiple GPUs.
Distributed
training.
Schedule training
and inference jobs
Create jobs from
notebooks or libraries.
Add schedules, retries,
and alerts.
Model validation checks.
Automate for
downstream
consumption
Integrate with 3rd-party
tools and systems to
export ML insights to
business stakeholders
Integrate with
data pipelines
Automate ingestion of
new data for ML and
output of ML insights
for business/product
Scale tuning with
Hyperopt + SparkTrials.
Manage tuning with
MLflow autologging.
Improve modeling
process
Executive <> Data Science team
alignment on data-driven initiatives
Knowledge sharing across business
units for ML-driven projects
Education for business stakeholders to
understand ML models and insights
Platform adoption by
multiple business units
Increased governance needs for platform, covering
needs of more business units and personas
Platform plays a key role in
establishing best practices
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful production
models in multiple
verticals
● Uniform testing
standards established
● Program to grow
citizen data scientists
Company
● Data initiatives are
reported at the board
level
● Data driven decision
making across an
organization
Team
● Multiple Data Science
teams across verticals
led by an AI executive
● Standard
development and
deployment processes
for models
● COE across verticals
Organization building -- “Run” stage
model lifecycle
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
Models Tracking
Flavor 2
Flavor 1
Model Registry
Custom
Models
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameters Metrics Artifacts
Models
Metadata
Model
Deployment Options
Example of ML Ops
Training
Model
Validation
Job
Production
Batch
Inference Job
Email
Create model
version
Webhook for new model
versions in staging
Comment with test results +
transition request to production
Webhook for new model
version in production
ML Ops person receives email that
transition request to production was made
Approve new
production model
Model
Registry
Modes of deployment
Model training
Batch
Model Tracking
and Registry
Streaming
REST API
Embedded
Delta Lake /
Feature Store
Latency Cost
Minutes Low
Sec - Min Low - Med
< 1 Sec High
varies varies
BI tools
Repeatable Data Science lifecycle
Business
understanding
Executive
sponsorship
Center of Excellence
for DS & ML
End user
feedback
Metric discussions
and KPIs
Business value
realization
Exploratory
data analysis
Data ingestion
and preparation
Model deployment
and automation
ML modeling
Model monitoring
and feedback
ML and Data platform
and pipeline integration
Simple onboarding process
for new teams and use cases
Data and resource
sharing and governance
Standard handoff process
for production jobs
Sharable documentation
and usage education
Resources to learn more
Related talks and blogs
▪ Building Machine Learning Platforms Webinar
▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features
Customer success stories
▪ Comcast, Starbucks, H&M
▪ Searchable customer stories
Databricks
▪ Data science and machine learning product page
▪ Managed MLflow product page
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Building Data Science into Organizations: Field Experience

More Related Content

What's hot (18)

PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Modern Data Architecture
Mark Hewitt
 
PPTX
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Lviv Startup Club
 
PDF
Data Storytelling: The only way to unlock true insight from your data
Bright North
 
PPTX
DataRobot - 머신러닝 자동화 플랫폼
Sutaek Kim
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPSX
Best practices to deliver data analytics to the business with power bi
Satya Shyam K Jayanty
 
PPTX
Forging an Analytics Center of Excellence
Lewandog, Inc,
 
PPTX
Business Drivers Behind Data Governance
Precisely
 
PDF
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
Mdm: why, when, how
Jean-Michel Franco
 
PDF
Data is NOT the new oil - the Data Asset IS different
Christopher Bradley
 
PDF
Power bi-dashboard-in-a-day-diad-mumbai-2019
Priyanka Khanadali
 
PPTX
Big Data Analytics in Government
Deepak Ramanathan
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PDF
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
PDF
Apply MLOps at Scale
Databricks
 
PDF
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Modern Data Architecture
Mark Hewitt
 
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Lviv Startup Club
 
Data Storytelling: The only way to unlock true insight from your data
Bright North
 
DataRobot - 머신러닝 자동화 플랫폼
Sutaek Kim
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Best practices to deliver data analytics to the business with power bi
Satya Shyam K Jayanty
 
Forging an Analytics Center of Excellence
Lewandog, Inc,
 
Business Drivers Behind Data Governance
Precisely
 
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Mdm: why, when, how
Jean-Michel Franco
 
Data is NOT the new oil - the Data Asset IS different
Christopher Bradley
 
Power bi-dashboard-in-a-day-diad-mumbai-2019
Priyanka Khanadali
 
Big Data Analytics in Government
Deepak Ramanathan
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
Apply MLOps at Scale
Databricks
 
Unified MLOps: Feature Stores & Model Deployment
Databricks
 

Similar to Building Data Science into Organizations: Field Experience (20)

PDF
SharePoint Inspired 'Get more from your data with Office 365'
Xylos
 
PDF
Building an AI organisation
Vikash Mishra
 
PDF
Big Data for Data Scientists - Info Session
WeCloudData
 
PDF
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
HostedbyConfluent
 
PDF
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
PDF
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
patrickdtherriault
 
PDF
Northern New England Tableau User Group (TUG) May 2024
patrickdtherriault
 
PDF
Building successful data science teams
Venkatesh Umaashankar
 
PDF
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
PDF
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 
PDF
DevOps Spain 2019. Olivier Perard-Oracle
atSistemas
 
PPTX
How to classify documents automatically using NLP
Skyl.ai
 
PDF
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Debraj GuhaThakurta
 
PDF
Microsoft teams.pdf
sonalibiswas22
 
PPTX
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
Rolly Perreaux, PMP
 
DOCX
Sandeep resume
sandeep chourasia
 
PDF
Google Cloud Machine Learning
India Quotient
 
PDF
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
 
PPTX
Microsoft cloud big data strategy
James Serra
 
PPTX
Data Analytics Offline Course In Hyderabad
pradeepghosh97
 
SharePoint Inspired 'Get more from your data with Office 365'
Xylos
 
Building an AI organisation
Vikash Mishra
 
Big Data for Data Scientists - Info Session
WeCloudData
 
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
HostedbyConfluent
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
patrickdtherriault
 
Northern New England Tableau User Group (TUG) May 2024
patrickdtherriault
 
Building successful data science teams
Venkatesh Umaashankar
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 
DevOps Spain 2019. Olivier Perard-Oracle
atSistemas
 
How to classify documents automatically using NLP
Skyl.ai
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Debraj GuhaThakurta
 
Microsoft teams.pdf
sonalibiswas22
 
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
Rolly Perreaux, PMP
 
Sandeep resume
sandeep chourasia
 
Google Cloud Machine Learning
India Quotient
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
 
Microsoft cloud big data strategy
James Serra
 
Data Analytics Offline Course In Hyderabad
pradeepghosh97
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Climate Action.pptx action plan for climate
justfortalabat
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 

Building Data Science into Organizations: Field Experience

  • 1. Building Data Science into Organizations: Field Experience Chris Robison Joseph Bradley Data + AI Summit 2021
  • 2. Joseph Bradley ● Sr. Solutions Architect ● 2nd ML Engineer at Databricks ● Apache Spark committer and PMC member Our perspectives Chris Robison ● Sr. Solutions Architect ● Former Director of Data Science and Omni-channel Marketing at Overstock.com ● Career data scientist and avid Apache Spark user
  • 3. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  • 4. So you want to do Data Science... 98.8% 14.4% of Fortune 1,000 companies are investing in strategic Big Data & AI initiatives. of Fortune 1,000 companies say they have deployed AI capabilities into widespread production. Source: New Vantage Partners
  • 5. Long-term ● Show business impact ● Increase productivity ● Scale DS across the organization Short-term ● Validate that DS is worthwhile ● Get resources: ○ Data ○ Data Scientists ○ Executive sponsorship ● Show vision Goals of a DS/ML/AI program
  • 6. Technology and platform ● Poor integration between Data Science and other data teams ● Planning for scale and production, under investment constraints Organization ● Team building: skill sets, hiring, and training ● Team organization: embedded vs. standalone ● Business and executive alignment ● R&D Challenges of a DS/ML/AI program
  • 7. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 8. Execution Use agile processes for data science ● Iterate with sprints and standups ● Fail fast in R&D Transparency is key ● Communicate frequently to your business partners and executives ● Make business partners and consumers an integral part of process Collaborate with the data and platform teams ● Make your needs known and understood ● Beware shortcuts which build technical debt
  • 9. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 10. ML/AI Success ● Successful MVPs with a few models manually in production ● Starting to build an AI/ML Strategy ● In discovery phase for new projects and low-hanging fruit Company ● Desire to become data driven ● Smaller in size (startup) or an existing organization with new data initiatives Team ● 1-2 Data Scientists (likely) reporting to a CTO ● Acting as full stack data scientists ● Typically a math or computer science background Organization building -- “Crawl” stage
  • 11. Common tools Descriptions Notebooks and IDEs Python notebooks, R Studio, Local IDEs Languages Python, R -- and potentially SQL, Scala, Java, etc. ML libraries Standard libraries, plus bring-your-own libraries and versions Git Notebook versioning, and syncing across platforms with Git Data Pandas, Spark, Koalas; any data sources or formats Visualization Matplotlib, Plotly, Seaborn, etc. Integrations Platforms must integrate with any libraries, systems, or services. Platforms which are cloud-native and have both UIs and APIs are ideal. Keep using familiar tools
  • 12. Build around OSS standards for portability # Downloads / month 990K 350K 1.7M 516K
  • 13. Be more productive with self-service analytics Compute resources Libraries and environment With popular ML libraries Plug & play environments requirements.txt conda.yaml And customization Start up machines or clusters on demand Cost controls: Autoscaling, auto-termination, spot instances, cost tracking Governance: Cluster policies for enforcement Option 2: Share clusters, with separate Python env per user or project. Option 1: Use your own cluster
  • 14. Running example: ML prioritization of Sales opps Platform enablement and improvement Customer history and Sales data access Long-term platform and data pipeline planning Develop DL model Use notebooks + TensorBoard for interactive development. Analyze results Review auto-logged MLflow metrics to analyze model performance. Load data Efficient data loading from S3, ADLS, etc. Get an ML workspace Simple machine or cluster creation. Ready-to-go DS environments. Share results Share insights with other stakeholders Sync code Import .py or .ipynb notebook, and sync with Git. Discussion with Sales stakeholders to understand the problem and data, and to set expectations Explanation of results and future potential to Sales Build executive alignment and buy-in for long-term initiatives DS team training and hiring
  • 15. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 16. ML/AI Success ● Successful MVPs and production models in multiple business units ● Uniform testing standards are being established Company ● Data initiatives being discussed at the executive level ● Business units pushing for data projects ● Emerging business champions for AI/ML Team ● Data Science team(s) supporting multiple business units ● Integrations with software engineering for production ● Diversifying skill-sets for domain expertise Organization building -- “Walk” stage
  • 18. Data Preparation Feature Engineering Model Training Model Evaluation Model Deployment Model Tuning Model Consumption ● Koalas ● Spark DataFrames ● Spark UDFs ● Larger instances ● GPUs ● Distributed training (Spark ML, HorovodRunner, etc.) ● Hyperopt ● MLflow ● Spark DataFrames & UDFs ● Jobs & Model Servers ● Mlflow Scaling in a typical machine learning workflow
  • 19. Auto-logging for reproducibility Reproduce Run feature: ✓ ✓ ✓ ✓ Code versioning Data versioning Cluster configuration Environment specification Reproducibility checklist: Job scheduling in platform Automation: schedule, alert, retry, API Automate and reproduce wherever possible Secure: IAM Passthrough | Cluster Policies | Table ACLs
  • 20. Your Existing Data Lake Ingestion Tables Data Catalog Feature Store Azure Data Lake Storage Amazon S3 Streaming Batch 3rd Party Data Marketplace Files for Data Science and ML ● Schema enforced high quality data ● Optimized performance ● Full data lineage / governance ● Reproducibility through time travel ML Runtime IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs Infrastructure Data Engineering Data Science ML Engineer
  • 21. Running example: ML-driven products Scale up or out Larger machines. Multiple GPUs. Distributed training. Schedule training and inference jobs Create jobs from notebooks or libraries. Add schedules, retries, and alerts. Model validation checks. Automate for downstream consumption Integrate with 3rd-party tools and systems to export ML insights to business stakeholders Integrate with data pipelines Automate ingestion of new data for ML and output of ML insights for business/product Scale tuning with Hyperopt + SparkTrials. Manage tuning with MLflow autologging. Improve modeling process Executive <> Data Science team alignment on data-driven initiatives Knowledge sharing across business units for ML-driven projects Education for business stakeholders to understand ML models and insights Platform adoption by multiple business units Increased governance needs for platform, covering needs of more business units and personas Platform plays a key role in establishing best practices
  • 22. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 23. ML/AI Success ● Successful production models in multiple verticals ● Uniform testing standards established ● Program to grow citizen data scientists Company ● Data initiatives are reported at the board level ● Data driven decision making across an organization Team ● Multiple Data Science teams across verticals led by an AI executive ● Standard development and deployment processes for models ● COE across verticals Organization building -- “Run” stage
  • 24. model lifecycle Staging Production Archived Data Scientists Deployment Engineers v1 v2 Models Tracking Flavor 2 Flavor 1 Model Registry Custom Models In-Line Code Containers Batch & Stream Scoring Cloud Inference Services OSS Serving Solutions Serving Parameters Metrics Artifacts Models Metadata Model Deployment Options
  • 25. Example of ML Ops Training Model Validation Job Production Batch Inference Job Email Create model version Webhook for new model versions in staging Comment with test results + transition request to production Webhook for new model version in production ML Ops person receives email that transition request to production was made Approve new production model Model Registry
  • 26. Modes of deployment Model training Batch Model Tracking and Registry Streaming REST API Embedded Delta Lake / Feature Store Latency Cost Minutes Low Sec - Min Low - Med < 1 Sec High varies varies BI tools
  • 27. Repeatable Data Science lifecycle Business understanding Executive sponsorship Center of Excellence for DS & ML End user feedback Metric discussions and KPIs Business value realization Exploratory data analysis Data ingestion and preparation Model deployment and automation ML modeling Model monitoring and feedback ML and Data platform and pipeline integration Simple onboarding process for new teams and use cases Data and resource sharing and governance Standard handoff process for production jobs Sharable documentation and usage education
  • 28. Resources to learn more Related talks and blogs ▪ Building Machine Learning Platforms Webinar ▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features Customer success stories ▪ Comcast, Starbucks, H&M ▪ Searchable customer stories Databricks ▪ Data science and machine learning product page ▪ Managed MLflow product page
  • 29. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.