State of enterprise data science

1Copyright 2018 © Qubole
STATE OF ENTERPRISE DATA
SCIENCE
David Roe
Pradeep Reddy

Birth of Data Science in Life Sciences
Cholera Outbreak in 1854; London
- Prevailing Theory: Miasma Theory (Cholera was caused by bad air)
- Dr John Snow refuted Miasma Theory and came up with an idea to mark on a map of London the locations of all known
cases of cholera that led to death. This marked the birth of “Epidemiology
- Reference: The Ghost Map by Steven Johnson

INTRODUCTION
OVERVIEW OF STATE OF DATA SCIENCE TODAY
- KEY TRENDS
- CURRENT PROBLEMS
DATA SCIENCE WORKFLOW IN MODERN ARCHITECTURE
- INSIGHTS FROM 2018 BIG DATA ACTIVATION REPORT
- HOW COMPANIES ARE BECOMING SUCCESSFUL
DEMO OF ML IMPLEMENTATION WITH HADOOP AND SPARK
- END-TO-END BATCH PIPELINE
- MODEL OUTPUTS & VISUALIZATION

The transformational promise of
Data Science projects remain elusive
85%of Data Science
projects fail to
meet expectations
>70%of Analytics
potential value is
unrealised
Copyright 2018 © Qubole

Data Science can be successful with
Modern Data Architecture -
that scales to
allow your models
to train against
production data
enables you to
iterate and
prototype quickly
provides you with a
solid hand-off from
training to
production

COMMON MACHINE LEARNING DATAFLOW

Data Preparation Model Build Model Validation Deploy & Monitor
Tasks: wrangling,
exploration, validation
Tasks: split data, model
specification, feature
selection
Tasks: Train, Visualize,
compare / choose models,
model report
Tasks: build, compile/JAR,
reporting dashboard,
monitor

Question 1:
How many of you do Big Data and/or
Data Science in the Cloud

QUBOLE BIG DATA ACTIVATION STACK
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Cloud-Native Big Data Activation Platform
Autoscale Caching
Spot
Buying
AIR Serverless Monitoring
…
Cloud Data Lake

AUTOSCALING BIG DATA ENGINES IN CLOUD

DATA SCIENCE REQUIRES SCALABLE BIG DATA
DATA CLOUD
50%savings in
cloud spend
1:65DataOps : Users
10Xincrease in
IoT data

STATE OF BIG DATA ADOPTION
• Production
reporting/DW
• Researching
• Initial Big Data
Deployment
• Targeted use
case
• Multiple
departments
• Multiple engines
• Top down use
cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Ubiquitous insights
• True business
transformation
ASPIRATION
1ST
STAGE
EXPERIMENTATION
2ND
STAGE
EXPANSION
3RD
STAGE
INVERSION
4TH
STAGE
NIRVANA
5TH
STAGE

MACHINE LEARNING WORKFLOW IS A PRODUCT
LIFECYCLE
BUSINESS
VALUE
EXPERIMENTATION DEVELOPMENT PRODUCTION Continuous
Integration /
Delivery (CI/CD)
• Identifying
stakeholders
• Product
roadmap
• Data Exploration
• Initial Big Data
deployment
• Targeted use
case
• Multiple Departments
• Model training
• Multiple engines &
deployments
• Top Down Use
Cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Measuring impact
• True business
transformation
1ST
STAGE
2ND
STAGE
3RD
STAGE
4TH
STAGE
5TH
STAGE

Data Science Workflow - Team Data Science
Process(TDSP)
Source: Microsoft Azure
“Data that is loved
tends to survive.”
Kurt Bollacker,
Distinguished Data
Scientist

Question 2:
How many of you use have Big data in
the cloud?

Other Data Science/Data Mining Process Models
Source: http://www.gmelli.org/RKB/SEMMA_Process_Model
Source: https://www.stellarconsulting.co.nz/blog/data/crisp-dm-still-a-
leader/

ENABLING DATA SCIENCE WORKFLOW
Personas Access Use Cases Engines Cloud
Data Engineering
Data Science
Data Analysts
Machine Learning
Campaign Reports
Email analytics
Fraud detection
Presto
Spark
Hive
TensorFlow
AirFlow
AWS
GCP
Marketing
Revenue
Management
Finance
Commercial teams
● Data Science teams are able scale their
products individually (rather than having one
shared multi-tenant environment)
● Saw immediate cost savings on existing cloud
investments, which allowed the company to
focus on R&D
● Able to go-to-market with new Data Science
products in 1-3 months
● Mitigate SLA delays on analytics reports
OUTCOMES

How did they do it?
1
8
Send email to request data to tag
Attachment with untagged data
Upload tagged data
Cloud data lake
Rollup tagged data
Train Model
Internal customer data
Email data
classified by
campaign type
Extract email text and join with tagged data
Hive Table &
Dashboards
Browse Campaign
Product
AUTOMATED EMAIL CAMPAIGN CLASSIFICATION

How did they do it?
1
9
KEY CHARACTERISTICS OF
DATA-DRIVEN ORGANIZATIONS

TYPICAL DATA LAKE OPERATION
AVRO AVRO
Raw
(Staged)
Derived ‘Source of
Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB,
etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalisation,
Recommendation etc.)
Data Science
(i.e. Time-series
Analysis, Research etc.)
Data
Discovery
ML & DL
Cloud
Compute
Object
Storage

ON-PREMISE DATA SCIENCE APPROACH VS. CLOUD
• Impossible to scale storage without scaling
compute leading to expensive deployments
• Difficult to share HDFS data across Operating
Units
• Compute & Storage Separate
• Data is easily shared across Operating Units &
accessed from different locations
Cloud
Object
Store
DATA LOCALITY NO DATA LOCALITY
Higher Upfront Cost
No Autoscaling
Having to Fit
Models in Fixed
Infrastructure
Fewer DS Tools
Lower Cost
More Iterative
Scalable with
Automation
Fast Data and ML
Tool Access

How did they do it?
2
2
STATE OF BIG DATA TODAY

2018 QUBOLE BIG DATA ACTIVATION REPORT
Download a copy of the Qubole
2018 Big Data Activation Report at
https://go.qubole.com/CA-WP-
BigDataIndexReport_NewLP.html
This in-depth research is based
on anonymised insights from
more than 200 global Qubole
customers.

THE ‘BIG THREE’ OPEN SOURCE ENGINES
Characteristics and strengths
Apache
Hive/Hadoop
Workhorse for handling
massive volumes of data for
ETL, ELT or data preparation
on structured and semi-
structured information
Apache Spark
Powerful for processing
complex and memory-
intensive workflows such as
creating data pipelines or
implementing machine
learning
Presto
Shines in interactive analytics
- business intelligence (BI),
data discovery tools when
data is in a semi-structured or
structured form.

Question 3:
How many of you use multiple big data
engines(Hive, Spark & Presto)

SINGLE VS. MULTIPLE OPEN SOURCE ENGINES
Percentage of companies who use single vs. multiple big data engines
Companies are increasingly deploying multiple
engines to solve specific uses cases
Multi
Engine
75.9%
Single
Engine
24.1%
Multi
Engine
86%
Single
Engine
14%

MEASURING EFFICIENCY BY COMMANDS
YOY Growth in
Total No. of
Commands Run
439%
Apache Spark
365%
Presto
129%
Apache
Hadoop/Hive
24x more commands run per hour in Presto than
Apache Spark
6x more commands than Apache Hadoop/Hive
}

3 MUST-HAVES
Movement to
Multi-Engine
Companies are
increasingly deploying
multiple open source
engines for different
use cases (ML, ETL,
analytics, etc.)
Users Getting
More Access
More users have
access to data and are
running more
commands and
collaborating
Cloud Benefits
Recognized
Companies are
leveraging multiple
clouds and automation

How did they do it?
2
9
Customer Churn Model Demo

Data Science Notebooks
What are they?
Notebooks are like lab books from high
school science, but with a Harry Potter
twist. Like animated images in print on
Daily Prophet, the code in a notebook
can be executed and results displayed
as part
Purpose:
• Collaboration Suite for Data Science
projects
• Easy access to computing resources
for data science workloads.
• Building blocks that enable self
service data mining.
• Supports a variety of languages like
R, Python and Scala.

Question 4:
How many of you use Data Science
Notebooks for Collaboration?

ML Example: Scalable Data Science
Data Science Customer Churn Overview:
1. Ingest Telco Churn Dataset (ETL)
2. Refine/Curate features and labels(ETL); Often
referred to as feature engineering.
3. Split dataset into test & train samples (70-30 or
60-40 splits)
4. Create multiple 3-stage ML pipelines for various
models (eg: logistic, gradient boosting, random
forests)
5. Run the multiple pipelines defined above to train
on predicting churn response variable.
6. Plotly visualizations for model
comparison/validation, scoring & selection

ML Customer Churn Pipeline

Sign up at www.qubole.com

Appendix: Instructions to Download the Demo Notebook
• Sign up for a Qubole free account on Azure ( www.qubole.com ). This will give you a 14
day free access to try Hive, Spark, Presto & Airflow on Qubole.
• Once signed up, navigate to “Notebooks” in the Home menu on the left top corner.
• Click New, Import from URL and enter the below URL
• https://goo.gl/ENTqo2
• Once the notebook imports you may start the cluster from the notebook and explore the
notebook.

State of enterprise data science

More Related Content

What's hot (20)

Similar to State of enterprise data science (20)

More from Yan Xu (20)

Recently uploaded (20)

State of enterprise data science

Editor's Notes