SlideShare a Scribd company logo
© 2017 IBM Corporation
Scaling Data Science on Big Data
Vikram Murali
Program Director, Data Science & Machine Learning, IBM
Sriram Srinivasan
STSM & Architect, Data Science & Machine Learning, IBM
IBM Analytics
3 © 2017 IBM Corporation<#>
Data Scientist Pain Points
 Where is the data I need to drive business insights?
 I don’t want to know Hadoop/Hive etc
How do I collaborate and share my work with others?
 What is the best visualization technique to tell my story?
 How do I bring my familiar R/Python libraries to this new Data Science platform?
 How do I learn to use the latest libraries/Technique? (TensorFlow, Scikit learn, XGBoost,…) and
how do I ensure the right set of compute resources for these ?
How are my Machine Learning Models performing & how to improve them?
 I have this Machine Learning Model, how do I deploy it in production?
Machine Learning & Data Science
IBM Analytics
4 © 2017 IBM Corporation<#>
Challenges for the Enterprise
 Ensure secure data access & auditability - for governance and compliance
 Control and Curate access to data and for all open source libraries used
 Explainability and reproducibility of machine learning activities
 Improve trust in analytics and predictions
 Efficient Collaboration and versioning of all source, sample data and models
 Repeatability of process
 Establish Continuous integration practices
 Agility in delivery
 Publish/Share and identify provenance/ lineage with confidence
 Visibility and Access control
 Effective Resource utilization and ability to scale-out on demand
 Balance resources amongst different data scientists, machine learning practioners' workloads
Machine Learning & Data Science
IBM Analytics
5 © 2017 IBM Corporation<#>
Why has this been hard ?
 Rigid toolsets & absence of an integrated platform
 Have to choose one and only one approach
 Cannot easily connect all of the capabilities required
 Difficult to navigate between the various tools used
 Fragmented and time consuming practices
 Result of using multiple disjoint environments
 Separate on-ramp/community for each tool/environment
 Does not yield meaningful meta data or complete data lineage
 Analytical Silos
 Difficult to maintain and version control project assets
 Limited means of collaborating with teams
 Results are difficult to share and audit
Machine Learning & Data Science
 Resource Management Complexity
 Lack of scalable infrastructure
 Inflexible resource prioritization
techniques
IBM Analytics
6 © 2017 IBM Corporation<#>
Introducing IBM Data Science Experience
• Projects and Version Control
• Spark-in-DSX and Remote Spark
• IBM Machine Learning tech - algorithms
& more
• Platform Manager – for easy
administration
• Compute Elasticity support
IBM Data Science Experience
Community Open Source IBM Added Value
• Find tutorials and datasets
• Connect with other Data Scientists
• IBM ML Hub for expert assistance
• Open Source evangelism
• Fork and share projects, samples
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE
• Anaconda distribution
• Add your favorite libraries
Machine Learning & Data Science
IBM Analytics
7 © 2017 IBM Corporation<#>
IBM Data Science Experience
DSX on Public Cloud DSX Desktop DSX Local on Private Cloud
 PayGo consumption with as-a-service
delivery, up & running in seconds
 Integrated with IBM Spark-as-a-Service for
compute, IBM Object Store for data, as well
as other platform assets
 Immediate cloud collaboration via RStudio
and Jupyter notebooks
 Easily installed on your laptop or PC
 Won’t scale beyond the hardware available on
your machine
 Access to RStudio and Jupyter notebooks,
powered by one small Spark worker operating
locally on your machine
 Load CSV data files into Data Frames
 Scalable DSX cluster deployed on your
private infrastructure
 Dockerized containers via Kubernetes
 DSX Local can also deploy with
Hortonworks Data Platform on-premises
 LDAP for user management and
authentication
 Easy collaboration, versioning with Projects
& git
Built-in Zeppelin & Jupyter Notebooks
and RStudio for visualizing and coding on
data science tasks using Python, R, &
Scala.
Built-in Spark parallelizes & accelerates
data science tasks.
Machine Learning & Data Science
IBM Analytics
8
Machine Learning Workflow in Data Science Experience
 Machine learning detects if models fall out of spec — and automatically triggers retraining
 Fully integrated model management means data scientists, app developers & operations can use
the same environment
Machine Learning & Data Science
Data
Live
SystemIngest
Data
Processing
Model
Training
Deployment &
Management
Creating samples &
Cleansing
Automating Data Science Workloads Scalable
Deployment
Feedback Loop
 Historical
 Streaming
 Data visualization
 Feature transform
and engineering
 Model selection
and evaluation
 Pipelines, not
only models
 Versioning
 Predict when
given new data
 Monitoring and
live evaluation
Models
lose accuracy
Data Scientists
+ Researchers
ML Engineers
+ Production Engineers
Data
Engineers
IBM Analytics
9
Data Science
Experience
Machine Learning Everywhere – An Open Platform
 Add your favorite libraries
 Publish Open APIs for secure ML applications
Machine Learning & Data Science
IBM Analytics
10
DSX Local Architecture
Machine Learning & Data Science
IBM Analytics
11
DSX Scale out in Kubernetes is simple
 DSX-Spark scale-out is automatically done by adding more compute nodes (via “Daemon Sets”)
 Remote Spark can be independently scaled out as usual (say in Hadoop/Yarn)
 Individual workload Isolation and scale-out in pods
 Each DSX individual user (or an entity, in general) gets a Kubernetes namespace assigned, making
metering simple.
 All containers (pods) for that user gets spawned in that namespace, such as for tools – Jupyter/Zeppelin
(Python) or R/RStudio as well as other non-spark jobs.
 Namespace provides total quota for that user with resource requests and limits set in each pod
deployment
 “Shared” services are load balanced (with HA support) across all user access by typical
Kubernetes techniques, such as via replicas of pods & DNS-routing via Kubernetes services.
Machine Learning & Data Science
IBM Analytics
12
Data Science Experience with Hortonworks Data Platform
Big Data
DSX
IBM Analytics
13
Data Science Experience with HDP –Roadmap
DSX & HDP interoperability
Side-by-Side Installation
DSX on-the-edge integrated &
optimized for HDP deployments
DSX Jupyter, RStudio &
Zeppelin and Machine Learning
services enabled for HDP data
sources
Yarn managed Spark leveraged
by DSX, via Livy
• Spark jobs pushed to HDP
cluster
Single Cluster
DSX Within HDP Cluster
Dedicated nodes for DSX in the
HDP cluster with Ambari-based
installation/configuration.
Deploy & scale DSX with Yarn
managing DSX as a top-level
application
Knox, Ranger & Atlas integration
for authentication, authorization &
governance
Fully Yarn Managed DSX
Workloads
HDP embeds Kubernetes in Yarn,
enables launch and integration of
Kubernetes pods as Yarn
containers
Yarn manages all workloads in a
granular fashion across the entire
HDP cluster
• Python & R workloads (non-
spark) also managed by Yarn
• GPU affinity , especially for
Deep Learning Jobs
Today Q4 2017 1st Half 2018
1
Machine Learning & Data Science
IBM Analytics
14
Goal: Enterprise IaaS for Data Scientists
 Efficient Compute Resource Management for large-scale Analytics, Machine Learning and Deep
Learning workloads
-Enable Data Scientists to procure resources from a shared compute “grid” for any kind of activity from
interactive notebooks & IDEs to training Jobs or scheduled scripts and Apps.
-All compute manifested as Docker containers/Kubernetes pods
 HDP/Yarn as the Resource Manager
-Enable all workloads, whether Map Reduce or Spark Jobs or DSX/ML activities to be uniformly handled by
the HDP/Yarn scheduler.
-Manage Queue Priorities, balancing of workloads and scale-out for the whole cluster providing best
utilization of all resources.
 Yarn and Kubernetes - the best of both worlds !
Machine Learning & Data Science
IBM Analytics
15 © 2017 IBM Corporation<#>
Call to Action
Experience DSX & ML Today…
IBM DSX at http://datascience.ibm.com
DSX Local recorded demos
Machine Learning: https://www.youtube.com/watch?v=htGZ1Iomeec
Connecting to external Spark: https://www.youtube.com/watch?v=rA0Rlb2M_oI
Spark submit from external app: https://www.youtube.com/watch?v=TETAT9pC9_o
Administration experience: https://www.youtube.com/watch?v=htGZ1Iomeec
Birds of a Feather
session
6pm Thursday, C 4.5
Machine Learning & Data Science
© 2017 IBM Corporation
THANK YOU
IBM Data Science Experience
Vikram Murali
Program Director, Data Science & Machine Learning, IBM
Sriram Srinivasan
STSM & Architect, Data Science & Machine Learning, IBM

More Related Content

What's hot (20)

PDF
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
PPTX
Insights into Real-world Data Management Challenges
DataWorks Summit
 
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
PPTX
Log I am your father
DataWorks Summit/Hadoop Summit
 
PDF
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
 
PPTX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
BMC Software
 
PPTX
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
PPTX
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
PDF
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
PPTX
Securing your Big Data Environments in the Cloud
DataWorks Summit
 
PPTX
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
PPTX
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
DataWorks Summit
 
PDF
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
DataWorks Summit
 
PPTX
The EDW Ecosystem
DataWorks Summit/Hadoop Summit
 
PDF
50 Shades of SQL
DataWorks Summit
 
PPTX
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
PPTX
Data lake – On Premise VS Cloud
Idan Tohami
 
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Insights into Real-world Data Management Challenges
DataWorks Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Log I am your father
DataWorks Summit/Hadoop Summit
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
BMC Software
 
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Securing your Big Data Environments in the Cloud
DataWorks Summit
 
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
DataWorks Summit
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
DataWorks Summit
 
50 Shades of SQL
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
Data lake – On Premise VS Cloud
Idan Tohami
 

Similar to Scaling Data Science on Big Data (20)

PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
PDF
The Future of Data Science
DataWorks Summit
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PDF
Libera la potenza del Machine Learning
Jürgen Ambrosi
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PPTX
IBM Strategy for Spark
Mark Kerzner
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PDF
IBM Cloud Paris meetup 20180213 - Data Science eXperience @scale
IBM France Lab
 
PDF
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Hortonworks
 
PDF
Challenges of Operationalising Data Science in Production
iguazio
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PPTX
Machine Learning with Apache Spark
IBM Cloud Data Services
 
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
PDF
Ideas spracklen-final
supportlogic
 
PPTX
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Khalid Salama
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PDF
IIPGH Webinar 1: Getting Started With Data Science
ds4good
 
PDF
AI Scalability for the Next Decade
Paula Koziol
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
The Future of Data Science
DataWorks Summit
 
Machine Learning Models in Production
DataWorks Summit
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Libera la potenza del Machine Learning
Jürgen Ambrosi
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
IBM Strategy for Spark
Mark Kerzner
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
IBM Cloud Paris meetup 20180213 - Data Science eXperience @scale
IBM France Lab
 
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Hortonworks
 
Challenges of Operationalising Data Science in Production
iguazio
 
Data Science with Spark
Krishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Machine Learning with Apache Spark
IBM Cloud Data Services
 
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
Ideas spracklen-final
supportlogic
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Khalid Salama
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
IIPGH Webinar 1: Getting Started With Data Science
ds4good
 
AI Scalability for the Next Decade
Paula Koziol
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 

Scaling Data Science on Big Data

  • 1. © 2017 IBM Corporation Scaling Data Science on Big Data Vikram Murali Program Director, Data Science & Machine Learning, IBM Sriram Srinivasan STSM & Architect, Data Science & Machine Learning, IBM
  • 2. IBM Analytics 3 © 2017 IBM Corporation<#> Data Scientist Pain Points  Where is the data I need to drive business insights?  I don’t want to know Hadoop/Hive etc How do I collaborate and share my work with others?  What is the best visualization technique to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I learn to use the latest libraries/Technique? (TensorFlow, Scikit learn, XGBoost,…) and how do I ensure the right set of compute resources for these ? How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I deploy it in production? Machine Learning & Data Science
  • 3. IBM Analytics 4 © 2017 IBM Corporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Repeatability of process  Establish Continuous integration practices  Agility in delivery  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Balance resources amongst different data scientists, machine learning practioners' workloads Machine Learning & Data Science
  • 4. IBM Analytics 5 © 2017 IBM Corporation<#> Why has this been hard ?  Rigid toolsets & absence of an integrated platform  Have to choose one and only one approach  Cannot easily connect all of the capabilities required  Difficult to navigate between the various tools used  Fragmented and time consuming practices  Result of using multiple disjoint environments  Separate on-ramp/community for each tool/environment  Does not yield meaningful meta data or complete data lineage  Analytical Silos  Difficult to maintain and version control project assets  Limited means of collaborating with teams  Results are difficult to share and audit Machine Learning & Data Science  Resource Management Complexity  Lack of scalable infrastructure  Inflexible resource prioritization techniques
  • 5. IBM Analytics 6 © 2017 IBM Corporation<#> Introducing IBM Data Science Experience • Projects and Version Control • Spark-in-DSX and Remote Spark • IBM Machine Learning tech - algorithms & more • Platform Manager – for easy administration • Compute Elasticity support IBM Data Science Experience Community Open Source IBM Added Value • Find tutorials and datasets • Connect with other Data Scientists • IBM ML Hub for expert assistance • Open Source evangelism • Fork and share projects, samples • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE • Anaconda distribution • Add your favorite libraries Machine Learning & Data Science
  • 6. IBM Analytics 7 © 2017 IBM Corporation<#> IBM Data Science Experience DSX on Public Cloud DSX Desktop DSX Local on Private Cloud  PayGo consumption with as-a-service delivery, up & running in seconds  Integrated with IBM Spark-as-a-Service for compute, IBM Object Store for data, as well as other platform assets  Immediate cloud collaboration via RStudio and Jupyter notebooks  Easily installed on your laptop or PC  Won’t scale beyond the hardware available on your machine  Access to RStudio and Jupyter notebooks, powered by one small Spark worker operating locally on your machine  Load CSV data files into Data Frames  Scalable DSX cluster deployed on your private infrastructure  Dockerized containers via Kubernetes  DSX Local can also deploy with Hortonworks Data Platform on-premises  LDAP for user management and authentication  Easy collaboration, versioning with Projects & git Built-in Zeppelin & Jupyter Notebooks and RStudio for visualizing and coding on data science tasks using Python, R, & Scala. Built-in Spark parallelizes & accelerates data science tasks. Machine Learning & Data Science
  • 7. IBM Analytics 8 Machine Learning Workflow in Data Science Experience  Machine learning detects if models fall out of spec — and automatically triggers retraining  Fully integrated model management means data scientists, app developers & operations can use the same environment Machine Learning & Data Science Data Live SystemIngest Data Processing Model Training Deployment & Management Creating samples & Cleansing Automating Data Science Workloads Scalable Deployment Feedback Loop  Historical  Streaming  Data visualization  Feature transform and engineering  Model selection and evaluation  Pipelines, not only models  Versioning  Predict when given new data  Monitoring and live evaluation Models lose accuracy Data Scientists + Researchers ML Engineers + Production Engineers Data Engineers
  • 8. IBM Analytics 9 Data Science Experience Machine Learning Everywhere – An Open Platform  Add your favorite libraries  Publish Open APIs for secure ML applications Machine Learning & Data Science
  • 9. IBM Analytics 10 DSX Local Architecture Machine Learning & Data Science
  • 10. IBM Analytics 11 DSX Scale out in Kubernetes is simple  DSX-Spark scale-out is automatically done by adding more compute nodes (via “Daemon Sets”)  Remote Spark can be independently scaled out as usual (say in Hadoop/Yarn)  Individual workload Isolation and scale-out in pods  Each DSX individual user (or an entity, in general) gets a Kubernetes namespace assigned, making metering simple.  All containers (pods) for that user gets spawned in that namespace, such as for tools – Jupyter/Zeppelin (Python) or R/RStudio as well as other non-spark jobs.  Namespace provides total quota for that user with resource requests and limits set in each pod deployment  “Shared” services are load balanced (with HA support) across all user access by typical Kubernetes techniques, such as via replicas of pods & DNS-routing via Kubernetes services. Machine Learning & Data Science
  • 11. IBM Analytics 12 Data Science Experience with Hortonworks Data Platform Big Data DSX
  • 12. IBM Analytics 13 Data Science Experience with HDP –Roadmap DSX & HDP interoperability Side-by-Side Installation DSX on-the-edge integrated & optimized for HDP deployments DSX Jupyter, RStudio & Zeppelin and Machine Learning services enabled for HDP data sources Yarn managed Spark leveraged by DSX, via Livy • Spark jobs pushed to HDP cluster Single Cluster DSX Within HDP Cluster Dedicated nodes for DSX in the HDP cluster with Ambari-based installation/configuration. Deploy & scale DSX with Yarn managing DSX as a top-level application Knox, Ranger & Atlas integration for authentication, authorization & governance Fully Yarn Managed DSX Workloads HDP embeds Kubernetes in Yarn, enables launch and integration of Kubernetes pods as Yarn containers Yarn manages all workloads in a granular fashion across the entire HDP cluster • Python & R workloads (non- spark) also managed by Yarn • GPU affinity , especially for Deep Learning Jobs Today Q4 2017 1st Half 2018 1 Machine Learning & Data Science
  • 13. IBM Analytics 14 Goal: Enterprise IaaS for Data Scientists  Efficient Compute Resource Management for large-scale Analytics, Machine Learning and Deep Learning workloads -Enable Data Scientists to procure resources from a shared compute “grid” for any kind of activity from interactive notebooks & IDEs to training Jobs or scheduled scripts and Apps. -All compute manifested as Docker containers/Kubernetes pods  HDP/Yarn as the Resource Manager -Enable all workloads, whether Map Reduce or Spark Jobs or DSX/ML activities to be uniformly handled by the HDP/Yarn scheduler. -Manage Queue Priorities, balancing of workloads and scale-out for the whole cluster providing best utilization of all resources.  Yarn and Kubernetes - the best of both worlds ! Machine Learning & Data Science
  • 14. IBM Analytics 15 © 2017 IBM Corporation<#> Call to Action Experience DSX & ML Today… IBM DSX at http://datascience.ibm.com DSX Local recorded demos Machine Learning: https://www.youtube.com/watch?v=htGZ1Iomeec Connecting to external Spark: https://www.youtube.com/watch?v=rA0Rlb2M_oI Spark submit from external app: https://www.youtube.com/watch?v=TETAT9pC9_o Administration experience: https://www.youtube.com/watch?v=htGZ1Iomeec Birds of a Feather session 6pm Thursday, C 4.5 Machine Learning & Data Science
  • 15. © 2017 IBM Corporation THANK YOU IBM Data Science Experience Vikram Murali Program Director, Data Science & Machine Learning, IBM Sriram Srinivasan STSM & Architect, Data Science & Machine Learning, IBM