SlideShare a Scribd company logo
1Copyright 2018 © Qubole
STATE OF ENTERPRISE DATA
SCIENCE
David Roe
Pradeep Reddy
2Copyright 2018 © Qubole
Birth of Data Science in Life Sciences
Cholera Outbreak in 1854; London
- Prevailing Theory: Miasma Theory (Cholera was caused by bad air)
- Dr John Snow refuted Miasma Theory and came up with an idea to mark on a map of London the locations of all known
cases of cholera that led to death. This marked the birth of “Epidemiology
- Reference: The Ghost Map by Steven Johnson
3Copyright 2018 © Qubole
INTRODUCTION
OVERVIEW OF STATE OF DATA SCIENCE TODAY
- KEY TRENDS
- CURRENT PROBLEMS
DATA SCIENCE WORKFLOW IN MODERN ARCHITECTURE
- INSIGHTS FROM 2018 BIG DATA ACTIVATION REPORT
- HOW COMPANIES ARE BECOMING SUCCESSFUL
DEMO OF ML IMPLEMENTATION WITH HADOOP AND SPARK
- END-TO-END BATCH PIPELINE
- MODEL OUTPUTS & VISUALIZATION
4Copyright 2018 © Qubole
The transformational promise of
Data Science projects remain elusive
85%of Data Science
projects fail to
meet expectations
>70%of Analytics
potential value is
unrealised
Copyright 2018 © Qubole
5Copyright 2018 © Qubole
Data Science can be successful with
Modern Data Architecture -
that scales to
allow your models
to train against
production data
enables you to
iterate and
prototype quickly
Copyright 2018 © Qubole
provides you with a
solid hand-off from
training to
production
6Copyright 2018 © Qubole
COMMON MACHINE LEARNING DATAFLOW
7Copyright 2018 © Qubole
Copyright 2018 © Qubole
Data Preparation Model Build Model Validation Deploy & Monitor
Tasks: wrangling,
exploration, validation
Tasks: split data, model
specification, feature
selection
Tasks: Train, Visualize,
compare / choose models,
model report
Tasks: build, compile/JAR,
reporting dashboard,
monitor
8Copyright 2018 © Qubole
Question 1:
How many of you do Big Data and/or
Data Science in the Cloud
9Copyright 2018 © Qubole
QUBOLE BIG DATA ACTIVATION STACK
Copyright 2018 © Qubole
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Cloud-Native Big Data Activation Platform
Autoscale Caching
Spot
Buying
AIR Serverless Monitoring
…
Cloud Data Lake
10Copyright 2018 © Qubole
AUTOSCALING BIG DATA ENGINES IN CLOUD
11Copyright 2018 © Qubole
DATA SCIENCE REQUIRES SCALABLE BIG DATA
DATA CLOUD
50%savings in
cloud spend
1:65DataOps : Users
10Xincrease in
IoT data
Copyright 2017 © Qubole
STATE OF BIG DATA ADOPTION
Copyright 2018 © Qubole
• Production
reporting/DW
• Researching
• Initial Big Data
Deployment
• Targeted use
case
• Multiple
departments
• Multiple engines
• Top down use
cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Ubiquitous insights
• True business
transformation
ASPIRATION
1ST
STAGE
EXPERIMENTATION
2ND
STAGE
EXPANSION
3RD
STAGE
INVERSION
4TH
STAGE
NIRVANA
5TH
STAGE
Copyright 2017 © Qubole
MACHINE LEARNING WORKFLOW IS A PRODUCT
LIFECYCLE
Copyright 2018 © Qubole
BUSINESS
VALUE
EXPERIMENTATION DEVELOPMENT PRODUCTION Continuous
Integration /
Delivery (CI/CD)
• Identifying
stakeholders
• Product
roadmap
• Data Exploration
• Initial Big Data
deployment
• Targeted use
case
• Multiple Departments
• Model training
• Multiple engines &
deployments
• Top Down Use
Cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Measuring impact
• True business
transformation
1ST
STAGE
2ND
STAGE
3RD
STAGE
4TH
STAGE
5TH
STAGE
14Copyright 2018 © Qubole
Data Science Workflow - Team Data Science
Process(TDSP)
Source: Microsoft Azure
“Data that is loved
tends to survive.”
Kurt Bollacker,
Distinguished Data
Scientist
15Copyright 2018 © Qubole
Question 2:
How many of you use have Big data in
the cloud?
16Copyright 2018 © Qubole
Other Data Science/Data Mining Process Models
Source: http://www.gmelli.org/RKB/SEMMA_Process_Model
Source: https://www.stellarconsulting.co.nz/blog/data/crisp-dm-still-a-
leader/
17Copyright 2018 © Qubole
ENABLING DATA SCIENCE WORKFLOW
Personas Access Use Cases Engines Cloud
Data Engineering
Data Science
Data Analysts
Machine Learning
Campaign Reports
Email analytics
Fraud detection
Presto
Spark
Hive
TensorFlow
AirFlow
AWS
GCP
Marketing
Revenue
Management
Finance
Commercial teams
● Data Science teams are able scale their
products individually (rather than having one
shared multi-tenant environment)
● Saw immediate cost savings on existing cloud
investments, which allowed the company to
focus on R&D
● Able to go-to-market with new Data Science
products in 1-3 months
● Mitigate SLA delays on analytics reports
OUTCOMES
18Copyright 2018 © Qubole
How did they do it?
1
8
Copyright 2018 © Qubole
Send email to request data to tag
Attachment with untagged data
Upload tagged data
Cloud data lake
Rollup tagged data
Train Model
Internal customer data
Email data
classified by
campaign type
Extract email text and join with tagged data
Hive Table &
Dashboards
Browse Campaign
Product
AUTOMATED EMAIL CAMPAIGN CLASSIFICATION
19Copyright 2018 © Qubole
How did they do it?
1
9
Copyright 2018 © Qubole
KEY CHARACTERISTICS OF
DATA-DRIVEN ORGANIZATIONS
Copyright 2017 © Qubole
TYPICAL DATA LAKE OPERATION
AVRO AVRO
Raw
(Staged)
Derived ‘Source of
Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB,
etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalisation,
Recommendation etc.)
Data Science
(i.e. Time-series
Analysis, Research etc.)
Data
Discovery
ML & DL
Cloud
Compute
Object
Storage
21Copyright 2018 © Qubole
ON-PREMISE DATA SCIENCE APPROACH VS. CLOUD
• Impossible to scale storage without scaling
compute leading to expensive deployments
• Difficult to share HDFS data across Operating
Units
• Compute & Storage Separate
• Data is easily shared across Operating Units &
accessed from different locations
Cloud
Object
Store
DATA LOCALITY NO DATA LOCALITY
Higher Upfront Cost
No Autoscaling
Having to Fit
Models in Fixed
Infrastructure
Fewer DS Tools
Lower Cost
More Iterative
Scalable with
Automation
Fast Data and ML
Tool Access
22Copyright 2018 © Qubole
How did they do it?
2
2
Copyright 2018 © Qubole
STATE OF BIG DATA TODAY
23Copyright 2018 © Qubole
2018 QUBOLE BIG DATA ACTIVATION REPORT
Download a copy of the Qubole
2018 Big Data Activation Report at
https://go.qubole.com/CA-WP-
BigDataIndexReport_NewLP.html
This in-depth research is based
on anonymised insights from
more than 200 global Qubole
customers.
24Copyright 2018 © Qubole
THE ‘BIG THREE’ OPEN SOURCE ENGINES
Characteristics and strengths
Apache
Hive/Hadoop
Workhorse for handling
massive volumes of data for
ETL, ELT or data preparation
on structured and semi-
structured information
Apache Spark
Powerful for processing
complex and memory-
intensive workflows such as
creating data pipelines or
implementing machine
learning
Presto
Shines in interactive analytics
- business intelligence (BI),
data discovery tools when
data is in a semi-structured or
structured form.
25Copyright 2018 © Qubole
Question 3:
How many of you use multiple big data
engines(Hive, Spark & Presto)
26Copyright 2018 © Qubole
SINGLE VS. MULTIPLE OPEN SOURCE ENGINES
Percentage of companies who use single vs. multiple big data engines
Companies are increasingly deploying multiple
engines to solve specific uses cases
Copyright 2018 © Qubole
Multi
Engine
75.9%
Single
Engine
24.1%
Multi
Engine
86%
Single
Engine
14%
27Copyright 2018 © Qubole
MEASURING EFFICIENCY BY COMMANDS
YOY Growth in
Total No. of
Commands Run
439%
Apache Spark
365%
Presto
129%
Apache
Hadoop/Hive
24x more commands run per hour in Presto than
Apache Spark
6x more commands than Apache Hadoop/Hive
}
28Copyright 2018 © Qubole
3 MUST-HAVES
Movement to
Multi-Engine
Companies are
increasingly deploying
multiple open source
engines for different
use cases (ML, ETL,
analytics, etc.)
Users Getting
More Access
More users have
access to data and are
running more
commands and
collaborating
Cloud Benefits
Recognized
Companies are
leveraging multiple
clouds and automation
29Copyright 2018 © Qubole
How did they do it?
2
9
Copyright 2018 © Qubole
Customer Churn Model Demo
30Copyright 2018 © Qubole
Data Science Notebooks
What are they?
Notebooks are like lab books from high
school science, but with a Harry Potter
twist. Like animated images in print on
Daily Prophet, the code in a notebook
can be executed and results displayed
as part
Purpose:
• Collaboration Suite for Data Science
projects
• Easy access to computing resources
for data science workloads.
• Building blocks that enable self
service data mining.
• Supports a variety of languages like
R, Python and Scala.
31Copyright 2018 © Qubole
Question 4:
How many of you use Data Science
Notebooks for Collaboration?
32Copyright 2018 © Qubole
ML Example: Scalable Data Science
Data Science Customer Churn Overview:
1. Ingest Telco Churn Dataset (ETL)
2. Refine/Curate features and labels(ETL); Often
referred to as feature engineering.
3. Split dataset into test & train samples (70-30 or
60-40 splits)
4. Create multiple 3-stage ML pipelines for various
models (eg: logistic, gradient boosting, random
forests)
5. Run the multiple pipelines defined above to train
on predicting churn response variable.
6. Plotly visualizations for model
comparison/validation, scoring & selection
33Copyright 2018 © Qubole
ML Customer Churn Pipeline
34Copyright 2018 © Qubole
Sign up at www.qubole.com
35Copyright 2018 © Qubole
Appendix: Instructions to Download the Demo Notebook
• Sign up for a Qubole free account on Azure ( www.qubole.com ). This will give you a 14
day free access to try Hive, Spark, Presto & Airflow on Qubole.
• Once signed up, navigate to “Notebooks” in the Home menu on the left top corner.
• Click New, Import from URL and enter the below URL
• https://goo.gl/ENTqo2
• Once the notebook imports you may start the cluster from the notebook and explore the
notebook.

More Related Content

What's hot (20)

PDF
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
Ellen Friedman
 
PPTX
7 Predictive Analytics, Spark , Streaming use cases
DataWorks Summit/Hadoop Summit
 
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
PDF
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Tokyo University of Science
 
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
PDF
Digital Transformation - #StrataData London 2017 - Data101
Ellen Friedman
 
PDF
The Future of Data Science
DataWorks Summit
 
PDF
Combining hadoop with big data analytics
The Marketing Distillery
 
PDF
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
 
PDF
Big Data Scotland 2017
Ray Bugg
 
PDF
Real-time Analytics in Financial: Use Case, Architecture and Challenges
DataWorks Summit/Hadoop Summit
 
PDF
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 
PDF
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
PDF
DataXDay - A data scientist journey to industrialization of machine learning
DataXDay Conference by Xebia
 
PDF
Let’s talk about reproducible data analysis
Greg Landrum
 
PDF
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
PPTX
Advanced Analytics for Any Data at Real-Time Speed
danpotterdwch
 
PDF
Leveraging Taxonomy Management with Machine Learning
Semantic Web Company
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
Ellen Friedman
 
7 Predictive Analytics, Spark , Streaming use cases
DataWorks Summit/Hadoop Summit
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Tokyo University of Science
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
Digital Transformation - #StrataData London 2017 - Data101
Ellen Friedman
 
The Future of Data Science
DataWorks Summit
 
Combining hadoop with big data analytics
The Marketing Distillery
 
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
 
Big Data Scotland 2017
Ray Bugg
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
DataWorks Summit/Hadoop Summit
 
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
DataXDay - A data scientist journey to industrialization of machine learning
DataXDay Conference by Xebia
 
Let’s talk about reproducible data analysis
Greg Landrum
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
Advanced Analytics for Any Data at Real-Time Speed
danpotterdwch
 
Leveraging Taxonomy Management with Machine Learning
Semantic Web Company
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 

Similar to State of enterprise data science (20)

PPTX
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
PPTX
Atlanta Data Science Meetup | Qubole slides
Qubole
 
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
PPTX
Atlanta MLConf
Qubole
 
PPTX
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole
 
PDF
PXL Data Engineering Workshop By Selligent
Jonny Daenen
 
PPTX
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
PPTX
Big data4businessusers
Bob Hardaway
 
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
PDF
The Cloud Data Lake Early Release Rukmani Gopalan
fossylenea
 
PPTX
So your boss says you need to learn data science
Susan Ibach
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
Atlanta Data Science Meetup | Qubole slides
Qubole
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Qubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole
 
PXL Data Engineering Workshop By Selligent
Jonny Daenen
 
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
Big data4businessusers
Bob Hardaway
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
The Cloud Data Lake Early Release Rukmani Gopalan
fossylenea
 
So your boss says you need to learn data science
Susan Ibach
 
DevOps for DataScience
Stepan Pushkarev
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Ad

More from Yan Xu (20)

PPTX
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
PDF
Basics of Dynamic programming
Yan Xu
 
PPTX
Walking through Tensorflow 2.0
Yan Xu
 
PPTX
Practical contextual bandits for business
Yan Xu
 
PDF
Introduction to Multi-armed Bandits
Yan Xu
 
PDF
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
PDF
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
PDF
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
PDF
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
PDF
Introduction to Autoencoders
Yan Xu
 
PDF
Long Short Term Memory
Yan Xu
 
PDF
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
PPTX
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
PPTX
HML: Historical View and Trends of Deep Learning
Yan Xu
 
PDF
Secrets behind AlphaGo
Yan Xu
 
PPTX
Optimization in Deep Learning
Yan Xu
 
PDF
Introduction to Recurrent Neural Network
Yan Xu
 
PDF
Convolutional neural network
Yan Xu
 
PDF
Introduction to Neural Network
Yan Xu
 
PDF
Nonlinear dimension reduction
Yan Xu
 
Kaggle winning solutions: Retail Sales Forecasting
Yan Xu
 
Basics of Dynamic programming
Yan Xu
 
Walking through Tensorflow 2.0
Yan Xu
 
Practical contextual bandits for business
Yan Xu
 
Introduction to Multi-armed Bandits
Yan Xu
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
Yan Xu
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Yan Xu
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Yan Xu
 
Introduction to Autoencoders
Yan Xu
 
Long Short Term Memory
Yan Xu
 
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
Linear algebra and probability (Deep Learning chapter 2&3)
Yan Xu
 
HML: Historical View and Trends of Deep Learning
Yan Xu
 
Secrets behind AlphaGo
Yan Xu
 
Optimization in Deep Learning
Yan Xu
 
Introduction to Recurrent Neural Network
Yan Xu
 
Convolutional neural network
Yan Xu
 
Introduction to Neural Network
Yan Xu
 
Nonlinear dimension reduction
Yan Xu
 
Ad

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 

State of enterprise data science

  • 1. 1Copyright 2018 © Qubole STATE OF ENTERPRISE DATA SCIENCE David Roe Pradeep Reddy
  • 2. 2Copyright 2018 © Qubole Birth of Data Science in Life Sciences Cholera Outbreak in 1854; London - Prevailing Theory: Miasma Theory (Cholera was caused by bad air) - Dr John Snow refuted Miasma Theory and came up with an idea to mark on a map of London the locations of all known cases of cholera that led to death. This marked the birth of “Epidemiology - Reference: The Ghost Map by Steven Johnson
  • 3. 3Copyright 2018 © Qubole INTRODUCTION OVERVIEW OF STATE OF DATA SCIENCE TODAY - KEY TRENDS - CURRENT PROBLEMS DATA SCIENCE WORKFLOW IN MODERN ARCHITECTURE - INSIGHTS FROM 2018 BIG DATA ACTIVATION REPORT - HOW COMPANIES ARE BECOMING SUCCESSFUL DEMO OF ML IMPLEMENTATION WITH HADOOP AND SPARK - END-TO-END BATCH PIPELINE - MODEL OUTPUTS & VISUALIZATION
  • 4. 4Copyright 2018 © Qubole The transformational promise of Data Science projects remain elusive 85%of Data Science projects fail to meet expectations >70%of Analytics potential value is unrealised Copyright 2018 © Qubole
  • 5. 5Copyright 2018 © Qubole Data Science can be successful with Modern Data Architecture - that scales to allow your models to train against production data enables you to iterate and prototype quickly Copyright 2018 © Qubole provides you with a solid hand-off from training to production
  • 6. 6Copyright 2018 © Qubole COMMON MACHINE LEARNING DATAFLOW
  • 7. 7Copyright 2018 © Qubole Copyright 2018 © Qubole Data Preparation Model Build Model Validation Deploy & Monitor Tasks: wrangling, exploration, validation Tasks: split data, model specification, feature selection Tasks: Train, Visualize, compare / choose models, model report Tasks: build, compile/JAR, reporting dashboard, monitor
  • 8. 8Copyright 2018 © Qubole Question 1: How many of you do Big Data and/or Data Science in the Cloud
  • 9. 9Copyright 2018 © Qubole QUBOLE BIG DATA ACTIVATION STACK Copyright 2018 © Qubole Data Scientists Third-Party Tools Data Engineers Third-Party Tools Analysts Third-Party Tools Qubole Cloud-Native Big Data Activation Platform Autoscale Caching Spot Buying AIR Serverless Monitoring … Cloud Data Lake
  • 10. 10Copyright 2018 © Qubole AUTOSCALING BIG DATA ENGINES IN CLOUD
  • 11. 11Copyright 2018 © Qubole DATA SCIENCE REQUIRES SCALABLE BIG DATA DATA CLOUD 50%savings in cloud spend 1:65DataOps : Users 10Xincrease in IoT data
  • 12. Copyright 2017 © Qubole STATE OF BIG DATA ADOPTION Copyright 2018 © Qubole • Production reporting/DW • Researching • Initial Big Data Deployment • Targeted use case • Multiple departments • Multiple engines • Top down use cases • Enterprise transformation • Bottom up use cases • Digital enterprise • Ubiquitous insights • True business transformation ASPIRATION 1ST STAGE EXPERIMENTATION 2ND STAGE EXPANSION 3RD STAGE INVERSION 4TH STAGE NIRVANA 5TH STAGE
  • 13. Copyright 2017 © Qubole MACHINE LEARNING WORKFLOW IS A PRODUCT LIFECYCLE Copyright 2018 © Qubole BUSINESS VALUE EXPERIMENTATION DEVELOPMENT PRODUCTION Continuous Integration / Delivery (CI/CD) • Identifying stakeholders • Product roadmap • Data Exploration • Initial Big Data deployment • Targeted use case • Multiple Departments • Model training • Multiple engines & deployments • Top Down Use Cases • Enterprise transformation • Bottom up use cases • Digital enterprise • Measuring impact • True business transformation 1ST STAGE 2ND STAGE 3RD STAGE 4TH STAGE 5TH STAGE
  • 14. 14Copyright 2018 © Qubole Data Science Workflow - Team Data Science Process(TDSP) Source: Microsoft Azure “Data that is loved tends to survive.” Kurt Bollacker, Distinguished Data Scientist
  • 15. 15Copyright 2018 © Qubole Question 2: How many of you use have Big data in the cloud?
  • 16. 16Copyright 2018 © Qubole Other Data Science/Data Mining Process Models Source: http://www.gmelli.org/RKB/SEMMA_Process_Model Source: https://www.stellarconsulting.co.nz/blog/data/crisp-dm-still-a- leader/
  • 17. 17Copyright 2018 © Qubole ENABLING DATA SCIENCE WORKFLOW Personas Access Use Cases Engines Cloud Data Engineering Data Science Data Analysts Machine Learning Campaign Reports Email analytics Fraud detection Presto Spark Hive TensorFlow AirFlow AWS GCP Marketing Revenue Management Finance Commercial teams ● Data Science teams are able scale their products individually (rather than having one shared multi-tenant environment) ● Saw immediate cost savings on existing cloud investments, which allowed the company to focus on R&D ● Able to go-to-market with new Data Science products in 1-3 months ● Mitigate SLA delays on analytics reports OUTCOMES
  • 18. 18Copyright 2018 © Qubole How did they do it? 1 8 Copyright 2018 © Qubole Send email to request data to tag Attachment with untagged data Upload tagged data Cloud data lake Rollup tagged data Train Model Internal customer data Email data classified by campaign type Extract email text and join with tagged data Hive Table & Dashboards Browse Campaign Product AUTOMATED EMAIL CAMPAIGN CLASSIFICATION
  • 19. 19Copyright 2018 © Qubole How did they do it? 1 9 Copyright 2018 © Qubole KEY CHARACTERISTICS OF DATA-DRIVEN ORGANIZATIONS
  • 20. Copyright 2017 © Qubole TYPICAL DATA LAKE OPERATION AVRO AVRO Raw (Staged) Derived ‘Source of Truth’ PARQUET Hive / Spark Hive / Spark Insert/Update/Delete Export CSV JSON Analytic Data Warehouse (i.e. Redshift & Snowflake environments) Data Serving DBs (i.e. Cassandra, DynamoDB, etc.) SPARK PRESTO Interactive ad-hoc queries Use Cases Analytics (i.e. Product Analytics, BI, User insights etc.) Data Products (i.e. Personalisation, Recommendation etc.) Data Science (i.e. Time-series Analysis, Research etc.) Data Discovery ML & DL Cloud Compute Object Storage
  • 21. 21Copyright 2018 © Qubole ON-PREMISE DATA SCIENCE APPROACH VS. CLOUD • Impossible to scale storage without scaling compute leading to expensive deployments • Difficult to share HDFS data across Operating Units • Compute & Storage Separate • Data is easily shared across Operating Units & accessed from different locations Cloud Object Store DATA LOCALITY NO DATA LOCALITY Higher Upfront Cost No Autoscaling Having to Fit Models in Fixed Infrastructure Fewer DS Tools Lower Cost More Iterative Scalable with Automation Fast Data and ML Tool Access
  • 22. 22Copyright 2018 © Qubole How did they do it? 2 2 Copyright 2018 © Qubole STATE OF BIG DATA TODAY
  • 23. 23Copyright 2018 © Qubole 2018 QUBOLE BIG DATA ACTIVATION REPORT Download a copy of the Qubole 2018 Big Data Activation Report at https://go.qubole.com/CA-WP- BigDataIndexReport_NewLP.html This in-depth research is based on anonymised insights from more than 200 global Qubole customers.
  • 24. 24Copyright 2018 © Qubole THE ‘BIG THREE’ OPEN SOURCE ENGINES Characteristics and strengths Apache Hive/Hadoop Workhorse for handling massive volumes of data for ETL, ELT or data preparation on structured and semi- structured information Apache Spark Powerful for processing complex and memory- intensive workflows such as creating data pipelines or implementing machine learning Presto Shines in interactive analytics - business intelligence (BI), data discovery tools when data is in a semi-structured or structured form.
  • 25. 25Copyright 2018 © Qubole Question 3: How many of you use multiple big data engines(Hive, Spark & Presto)
  • 26. 26Copyright 2018 © Qubole SINGLE VS. MULTIPLE OPEN SOURCE ENGINES Percentage of companies who use single vs. multiple big data engines Companies are increasingly deploying multiple engines to solve specific uses cases Copyright 2018 © Qubole Multi Engine 75.9% Single Engine 24.1% Multi Engine 86% Single Engine 14%
  • 27. 27Copyright 2018 © Qubole MEASURING EFFICIENCY BY COMMANDS YOY Growth in Total No. of Commands Run 439% Apache Spark 365% Presto 129% Apache Hadoop/Hive 24x more commands run per hour in Presto than Apache Spark 6x more commands than Apache Hadoop/Hive }
  • 28. 28Copyright 2018 © Qubole 3 MUST-HAVES Movement to Multi-Engine Companies are increasingly deploying multiple open source engines for different use cases (ML, ETL, analytics, etc.) Users Getting More Access More users have access to data and are running more commands and collaborating Cloud Benefits Recognized Companies are leveraging multiple clouds and automation
  • 29. 29Copyright 2018 © Qubole How did they do it? 2 9 Copyright 2018 © Qubole Customer Churn Model Demo
  • 30. 30Copyright 2018 © Qubole Data Science Notebooks What are they? Notebooks are like lab books from high school science, but with a Harry Potter twist. Like animated images in print on Daily Prophet, the code in a notebook can be executed and results displayed as part Purpose: • Collaboration Suite for Data Science projects • Easy access to computing resources for data science workloads. • Building blocks that enable self service data mining. • Supports a variety of languages like R, Python and Scala.
  • 31. 31Copyright 2018 © Qubole Question 4: How many of you use Data Science Notebooks for Collaboration?
  • 32. 32Copyright 2018 © Qubole ML Example: Scalable Data Science Data Science Customer Churn Overview: 1. Ingest Telco Churn Dataset (ETL) 2. Refine/Curate features and labels(ETL); Often referred to as feature engineering. 3. Split dataset into test & train samples (70-30 or 60-40 splits) 4. Create multiple 3-stage ML pipelines for various models (eg: logistic, gradient boosting, random forests) 5. Run the multiple pipelines defined above to train on predicting churn response variable. 6. Plotly visualizations for model comparison/validation, scoring & selection
  • 33. 33Copyright 2018 © Qubole ML Customer Churn Pipeline
  • 34. 34Copyright 2018 © Qubole Sign up at www.qubole.com
  • 35. 35Copyright 2018 © Qubole Appendix: Instructions to Download the Demo Notebook • Sign up for a Qubole free account on Azure ( www.qubole.com ). This will give you a 14 day free access to try Hive, Spark, Presto & Airflow on Qubole. • Once signed up, navigate to “Notebooks” in the Home menu on the left top corner. • Click New, Import from URL and enter the below URL • https://goo.gl/ENTqo2 • Once the notebook imports you may start the cluster from the notebook and explore the notebook.

Editor's Notes

  • #10: Want to give a bit of background as to how Qubole sits in the big data ecoystem Our cloud-native, big data activation platform Has built in tools and also connects to many 3rd party tools to support all of your use cases The platform itself makes running workloads easy using your choice of open source technology, optimizes price performance automatically and evolves over time through a plug in architecture Finally, we support multiple cloud providers so there’s no lock in
  • #15: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview