Building Data Science into Organizations: Field Experience

Building Data Science into
Organizations: Field Experience
Chris Robison
Joseph Bradley
Data + AI Summit 2021

Joseph Bradley
● Sr. Solutions Architect
● 2nd ML Engineer at Databricks
● Apache Spark committer and
PMC member
Our perspectives
Chris Robison
● Sr. Solutions Architect
● Former Director of Data Science
and Omni-channel Marketing at
Overstock.com
● Career data scientist and avid
Apache Spark user

5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS

So you want to do Data Science...
98.8%
14.4%
of Fortune 1,000 companies
are investing in strategic
Big Data & AI initiatives.
of Fortune 1,000 companies say
they have deployed AI capabilities
into widespread production.
Source: New Vantage Partners

Long-term
● Show business impact
● Increase productivity
● Scale DS across the organization
Short-term
● Validate that DS is worthwhile
● Get resources:
○ Data
○ Data Scientists
○ Executive sponsorship
● Show vision
Goals of a DS/ML/AI program

Technology and platform
● Poor integration between Data Science
and other data teams
● Planning for scale and production,
under investment constraints
Organization
● Team building: skill sets, hiring, and training
● Team organization: embedded vs. standalone
● Business and executive alignment
● R&D
Challenges of a DS/ML/AI program

Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.

Execution
Use agile processes for data science
● Iterate with sprints and standups
● Fail fast in R&D
Transparency is key
● Communicate frequently to your business partners and executives
● Make business partners and consumers an integral part of process
Collaborate with the data and platform teams
● Make your needs known and understood
● Beware shortcuts which build technical debt

ML/AI Success
● Successful MVPs with
a few models manually
in production
● Starting to build an
AI/ML Strategy
● In discovery phase for
new projects and
low-hanging fruit
Company
● Desire to become data
driven
● Smaller in size
(startup) or an existing
organization with new
data initiatives
Team
● 1-2 Data Scientists
(likely) reporting to a
CTO
● Acting as full stack
data scientists
● Typically a math or
computer science
background
Organization building -- “Crawl” stage

Common tools Descriptions
Notebooks and IDEs Python notebooks, R Studio, Local IDEs
Languages Python, R -- and potentially SQL, Scala, Java, etc.
ML libraries Standard libraries, plus bring-your-own libraries and versions
Git Notebook versioning, and syncing across platforms with Git
Data Pandas, Spark, Koalas; any data sources or formats
Visualization Matplotlib, Plotly, Seaborn, etc.
Integrations Platforms must integrate with any libraries, systems, or services.
Platforms which are cloud-native and have both UIs and APIs are ideal.
Keep using familiar tools

Build around OSS standards for portability
# Downloads / month
990K
350K
1.7M
516K

Be more productive with self-service analytics
Compute resources Libraries and environment
With popular ML libraries
Plug & play environments
requirements.txt
conda.yaml
And customization
Start up machines or
clusters on demand
Cost controls: Autoscaling, auto-termination,
spot instances, cost tracking
Governance: Cluster policies for enforcement
Option 2: Share clusters,
with separate Python
env per user or project.
Option 1: Use your
own cluster

Running example: ML prioritization of Sales opps
Platform enablement
and improvement
Customer history and
Sales data access
Long-term platform and
data pipeline planning
Develop DL
model
Use notebooks +
TensorBoard for
interactive
development.
Analyze
results
Review auto-logged
MLﬂow metrics to
analyze model
performance.
Load data
Efficient data
loading from S3,
ADLS, etc.
Get an ML
workspace
Simple machine or
cluster creation.
Ready-to-go DS
environments.
Share
results
Share insights
with other
stakeholders
Sync code
Import .py or
.ipynb notebook,
and sync with Git.
Discussion with Sales stakeholders to understand
the problem and data, and to set expectations
Explanation of results and
future potential to Sales
Build executive alignment and
buy-in for long-term initiatives
DS team training
and hiring

ML/AI Success
● Successful MVPs and
production models in
multiple business
units
● Uniform testing
standards are being
established
Company
● Data initiatives being
discussed at the
executive level
● Business units
pushing for data
projects
● Emerging business
champions for AI/ML
Team
● Data Science team(s)
supporting multiple
business units
● Integrations with
software engineering
for production
● Diversifying skill-sets
for domain expertise
Organization building -- “Walk” stage

Building Data Science into Organizations: Field Experience

Data
Preparation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Deployment
Model
Tuning
Model
Consumption
● Koalas
● Spark DataFrames
● Spark UDFs
● Larger instances
● GPUs
● Distributed training
(Spark ML,
HorovodRunner, etc.)
● Hyperopt
● MLflow
● Spark DataFrames & UDFs
● Jobs & Model Servers
● Mlflow
Scaling in a typical machine learning workflow

Auto-logging for reproducibility
Reproduce Run feature:
✓
✓
✓
✓
Code versioning
Data versioning
Cluster conﬁguration
Environment speciﬁcation
Reproducibility checklist:
Job scheduling in platform
Automation: schedule, alert, retry, API
Automate and reproduce wherever possible
Secure: IAM Passthrough | Cluster Policies | Table ACLs

Your Existing Data Lake
Ingestion
Tables
Data
Catalog
Feature
Store
Azure Data
Lake Storage
Amazon S3
Streaming
Batch
3rd
Party Data
Marketplace
Files
for Data Science and ML
● Schema enforced high
quality data
● Optimized performance
● Full data lineage /
governance
● Reproducibility through
time travel
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Infrastructure
Data Engineering Data Science
ML Engineer

Running example: ML-driven products
Scale up
or out
Larger machines.
Multiple GPUs.
Distributed
training.
Schedule training
and inference jobs
Create jobs from
notebooks or libraries.
Add schedules, retries,
and alerts.
Model validation checks.
Automate for
downstream
consumption
Integrate with 3rd-party
tools and systems to
export ML insights to
business stakeholders
Integrate with
data pipelines
Automate ingestion of
new data for ML and
output of ML insights
for business/product
Scale tuning with
Hyperopt + SparkTrials.
Manage tuning with
MLﬂow autologging.
Improve modeling
process
Executive <> Data Science team
alignment on data-driven initiatives
Knowledge sharing across business
units for ML-driven projects
Education for business stakeholders to
understand ML models and insights
Platform adoption by
multiple business units
Increased governance needs for platform, covering
needs of more business units and personas
Platform plays a key role in
establishing best practices

ML/AI Success
● Successful production
models in multiple
verticals
● Uniform testing
standards established
● Program to grow
citizen data scientists
Company
● Data initiatives are
reported at the board
level
● Data driven decision
making across an
organization
Team
● Multiple Data Science
teams across verticals
led by an AI executive
● Standard
development and
deployment processes
for models
● COE across verticals
Organization building -- “Run” stage

model lifecycle
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
Models Tracking
Flavor 2
Flavor 1
Model Registry
Custom
Models
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameters Metrics Artifacts
Models
Metadata
Model
Deployment Options

Example of ML Ops
Training
Model
Validation
Job
Production
Batch
Inference Job
Email
Create model
version
Webhook for new model
versions in staging
Comment with test results +
transition request to production
Webhook for new model
version in production
ML Ops person receives email that
transition request to production was made
Approve new
production model
Model
Registry

Modes of deployment
Model training
Batch
Model Tracking
and Registry
Streaming
REST API
Embedded
Delta Lake /
Feature Store
Latency Cost
Minutes Low
Sec - Min Low - Med
< 1 Sec High
varies varies
BI tools

Repeatable Data Science lifecycle
Business
understanding
Executive
sponsorship
Center of Excellence
for DS & ML
End user
feedback
Metric discussions
and KPIs
Business value
realization
Exploratory
data analysis
Data ingestion
and preparation
Model deployment
and automation
ML modeling
Model monitoring
and feedback
ML and Data platform
and pipeline integration
Simple onboarding process
for new teams and use cases
Data and resource
sharing and governance
Standard handoff process
for production jobs
Sharable documentation
and usage education

Resources to learn more
Related talks and blogs
▪ Building Machine Learning Platforms Webinar
▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features
Customer success stories
▪ Comcast, Starbucks, H&M
▪ Searchable customer stories
Databricks
▪ Data science and machine learning product page
▪ Managed MLflow product page

Building Data Science into Organizations: Field Experience

More Related Content

What's hot (18)

Similar to Building Data Science into Organizations: Field Experience (20)

More from Databricks (20)

Recently uploaded (20)

Building Data Science into Organizations: Field Experience