SlideShare a Scribd company logo
Deep Learning On Spark
Using BigDL on Qubole
Dash Desai
Technology Evangelist
@iamontheinet
Some Basic Concepts
Copyright 2017 © Qubole
What is Machine Learning?
Gives ‘computers the ability to learn without
being explicitly programmed’ - Wikipedia
Copyright 2017 © Qubole
What is Deep Learning?
Form of ML that uses a model of computing—
inspired by the structure of the brain
Copyright 2017 © Qubole
Deep Learning Applications
Computer Vision / Image Recognition / Object Detection
Speech Recognition / Natural Language Processing (NLP)
Recommendation Systems (Products, Matchmaking, etc.)
Prediction (Stock Market, Healthcare, etc.)
Anomaly Detection (Cybersecurity, etc.)
Copyright 2017 © Qubole
What is Apache Spark?
A fast and general-purpose engine
for large-scale, distributed data
processing
MLlib
Spark’s scalable machine learning library
High-quality algorithms; 100x faster than MapReduce
Usable in Java, Scala, Python, and R
Copyright 2017 © Qubole
Deep Learning: On Apache Spark
Copyright 2017 © Qubole
Deep Learning: Other Popular Non-Spark Options
TensorFlow* (Google)
• Natively distributed out-of-the-box
Keras
• Naturally runs on distributed frameworks/back-ends
• Theano, MXNet (CMU, MIT, NYU), TensorFlow, CNTK (Microsoft)
*Not to be confused with TensorFlow On Spark (TFOS) by Yahoo
BigDL
Copyright 2017 © Qubole
What is BigDL?
Distributed deep learning library
for Apache Spark
Open sourced by Intel (in Dec 2016)
Feature parity with DL frameworks such as Caffe, Torch
Integrates with Spark ML pipeline and Spark Streaming
Supports Model snapshots
Intel MKL (Math Kernel Library); multi-threading within
each Spark task
Copyright 2017 © Qubole
Cont…
Includes 100+ Layers (highest level building block in DL)
Includes 20+ Loss functions (help with model fitting)
Optimization methods include SGD, Adagard, LBFGS
Numeric computing via Tensor & high level neural networks
Scaling: synchronous mini-batch SGD and all-reduce
communication on Spark
What is BigDL?
Copyright 2017 © Qubole
BigDL vs TensorFrames
TensorFrames — can call TF from individual
partitions of a DataFrame or an RDD (in PySpark)…
However, since TF is not natively integrated
with Spark, it does not support distributed deep
learning such as for model training or fine
tuning.
Copyright 2017 © Qubole
BigDL vs TensorFlow on Spark (TFOS), Caffe
TensorFlow on Spark* (TFOS) and Caffe on Spark —
use Spark executors to launch TF or Caffe instances
on the cluster…
However, model training, predictions, etc. are
performed outside of Spark across multiple TF or
Caffe instances…
• Run as standalone jobs outside of the pipeline
• Very fine-grained/limited interaction with
analytics pipeline
*Not to be confused with natively distributed TensorFlow by Google
Copyright 2017 © Qubole
How Does BigDL Work
Copyright 2017 © Qubole
Distributed Deep Learning: Two Methods
Copyright 2017 © Qubole
Distributed Deep Learning: BigDL
<Insert Demo Here >> YAY!/>
Copyright 2017 © Qubole
Demo: Recognize Handwritten Digits
On
Use Model
Train On Dataset
… with everything running on …
…
…
… to recognize handwritten digits …
Data Science on Qubole
00Copyright 2017 © Qubole
Qubole
Qubole automates, controls and orchestrates all big data workloads including Data
Science so that you can optimize for performance, cost and scale.
Built for Anyone Who Uses
Data
Analysts
Data Scientists
Data Engineers
Data Admins
A Single Platform
for Any Use Case
ETL & Reporting
Ad Hoc Queries
Machine Learning
Streaming
Vertical Apps
Open Source Engines,
Optimized for the Cloud
Cloud-Native,
Cloud-Optimized,
Cloud-Agnostics
Copyright 2017 © Qubole
Data Science on Qubole
Copyright 2017 © Qubole
Data Science on Qubole
Copyright 2017 © Qubole
Cluster LifeCycle Management on Qubole
Note: Available on Apache Spark, Hadoop, and Presto as a service on Qubole
Auto-scaling Clusters
• Policy-driven
• One-time setup; Runtime modifications
• Work load aware upscaling and downscaling
• No wasted resources results in lowered TCO
Heterogeneous Clusters
• Mix-and-match instance types
• On-Demand and Spot instances (on AWS)
00Copyright 2017 © Qubole
Qubole: High-level View
User Access Qubole Tier Customer’s Azure Account
QUBOLE UI
VIA BROWSER
SDK
ODBC
EPHEMERAL WEB TIER
WEB SERVERS
Default Hive
Metastore
RDS–Qubole
User, Account
Configurations
(Encrypted
credentials)
Encrypted
Result Cache
(Optional)
Custom Hive
Metastore
(Optional) Other
RDS
Encrypted
HDFS
Slave
Encrypted
HDFS
Slave
Master
Ephemeral
Cluster,
Managed by
Qubole
Data Flow within
Customer’s CloudRESTAPI
(HTTPS)
Thank you!
Dash Desai
Technology Evangelist
@iamontheinet
Getting Started
Install BigDL on Qubole + Demo App: http://bit.ly/deep_learning_bigdl_qubole
BigDL: https://github.com/intel-analytics/BigDL

More Related Content

What's hot (20)

PPTX
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Akash Tandon
 
PDF
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
PDF
Greenplum for Kubernetes PGConf india 2019
Goutam Tadi
 
PPTX
Distributed ML with Dask and Kubernetes
Ray Hilton
 
PDF
Introduction to df
Mohit Jaggi
 
PDF
DASK and Apache Spark
Databricks
 
PDF
Distributed deep learning
Mehdi Shibahara
 
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
PDF
Google Cloud Platform Empowers TensorFlow and Machine Learning
DataWorks Summit/Hadoop Summit
 
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PDF
Building DSLs with Scala
Mohit Jaggi
 
PDF
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
PPTX
Machine Learning with Scala
Susan Eraly
 
PDF
Make your PySpark Data Fly with Arrow!
Databricks
 
PPTX
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
MLconf
 
PDF
Keras: Deep Learning Library for Python
Rafi Khan
 
PDF
Distributed deep learning optimizations for Finance
geetachauhan
 
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
PPTX
AI Pipeline Optimization using Kubeflow
Steve Guhr
 
PDF
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Akash Tandon
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Greenplum for Kubernetes PGConf india 2019
Goutam Tadi
 
Distributed ML with Dask and Kubernetes
Ray Hilton
 
Introduction to df
Mohit Jaggi
 
DASK and Apache Spark
Databricks
 
Distributed deep learning
Mehdi Shibahara
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
DataWorks Summit/Hadoop Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Building DSLs with Scala
Mohit Jaggi
 
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Machine Learning with Scala
Susan Eraly
 
Make your PySpark Data Fly with Arrow!
Databricks
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
MLconf
 
Keras: Deep Learning Library for Python
Rafi Khan
 
Distributed deep learning optimizations for Finance
geetachauhan
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
AI Pipeline Optimization using Kubeflow
Steve Guhr
 
Spark Summit EU talk by Josef Habdank
Spark Summit
 

Similar to Deep Learning on Apache Spark (20)

PPTX
Scaling Data Science on Big Data
DataWorks Summit
 
PPTX
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Data Con LA
 
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
 
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PPTX
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
InfluxData
 
PPTX
Databricks for Dummies
Rodney Joyce
 
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Databricks
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PDF
spark_v1_2
Frank Schroeter
 
PPTX
Deep Learning on Qubole Data Platform
Shivaji Dutta
 
PPTX
Google cloud Study Jam 2023.pptx
GDSCNiT
 
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
PDF
TensorFlow 16: Building a Data Science Platform
Seldon
 
PPT
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
 
PDF
201908 Overview of Automated ML
Mark Tabladillo
 
Scaling Data Science on Big Data
DataWorks Summit
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Data Con LA
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
InfluxData
 
Databricks for Dummies
Rodney Joyce
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Databricks
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
spark_v1_2
Frank Schroeter
 
Deep Learning on Qubole Data Platform
Shivaji Dutta
 
Google cloud Study Jam 2023.pptx
GDSCNiT
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
TensorFlow 16: Building a Data Science Platform
Seldon
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
 
201908 Overview of Automated ML
Mark Tabladillo
 
Ad

Recently uploaded (20)

PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
July Patch Tuesday
Ivanti
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Ad

Deep Learning on Apache Spark

  • 1. Deep Learning On Spark Using BigDL on Qubole Dash Desai Technology Evangelist @iamontheinet
  • 3. Copyright 2017 © Qubole What is Machine Learning? Gives ‘computers the ability to learn without being explicitly programmed’ - Wikipedia
  • 4. Copyright 2017 © Qubole What is Deep Learning? Form of ML that uses a model of computing— inspired by the structure of the brain
  • 5. Copyright 2017 © Qubole Deep Learning Applications Computer Vision / Image Recognition / Object Detection Speech Recognition / Natural Language Processing (NLP) Recommendation Systems (Products, Matchmaking, etc.) Prediction (Stock Market, Healthcare, etc.) Anomaly Detection (Cybersecurity, etc.)
  • 6. Copyright 2017 © Qubole What is Apache Spark? A fast and general-purpose engine for large-scale, distributed data processing MLlib Spark’s scalable machine learning library High-quality algorithms; 100x faster than MapReduce Usable in Java, Scala, Python, and R
  • 7. Copyright 2017 © Qubole Deep Learning: On Apache Spark
  • 8. Copyright 2017 © Qubole Deep Learning: Other Popular Non-Spark Options TensorFlow* (Google) • Natively distributed out-of-the-box Keras • Naturally runs on distributed frameworks/back-ends • Theano, MXNet (CMU, MIT, NYU), TensorFlow, CNTK (Microsoft) *Not to be confused with TensorFlow On Spark (TFOS) by Yahoo
  • 10. Copyright 2017 © Qubole What is BigDL? Distributed deep learning library for Apache Spark Open sourced by Intel (in Dec 2016) Feature parity with DL frameworks such as Caffe, Torch Integrates with Spark ML pipeline and Spark Streaming Supports Model snapshots Intel MKL (Math Kernel Library); multi-threading within each Spark task
  • 11. Copyright 2017 © Qubole Cont… Includes 100+ Layers (highest level building block in DL) Includes 20+ Loss functions (help with model fitting) Optimization methods include SGD, Adagard, LBFGS Numeric computing via Tensor & high level neural networks Scaling: synchronous mini-batch SGD and all-reduce communication on Spark What is BigDL?
  • 12. Copyright 2017 © Qubole BigDL vs TensorFrames TensorFrames — can call TF from individual partitions of a DataFrame or an RDD (in PySpark)… However, since TF is not natively integrated with Spark, it does not support distributed deep learning such as for model training or fine tuning.
  • 13. Copyright 2017 © Qubole BigDL vs TensorFlow on Spark (TFOS), Caffe TensorFlow on Spark* (TFOS) and Caffe on Spark — use Spark executors to launch TF or Caffe instances on the cluster… However, model training, predictions, etc. are performed outside of Spark across multiple TF or Caffe instances… • Run as standalone jobs outside of the pipeline • Very fine-grained/limited interaction with analytics pipeline *Not to be confused with natively distributed TensorFlow by Google
  • 14. Copyright 2017 © Qubole How Does BigDL Work
  • 15. Copyright 2017 © Qubole Distributed Deep Learning: Two Methods
  • 16. Copyright 2017 © Qubole Distributed Deep Learning: BigDL
  • 17. <Insert Demo Here >> YAY!/>
  • 18. Copyright 2017 © Qubole Demo: Recognize Handwritten Digits On Use Model Train On Dataset … with everything running on … … … … to recognize handwritten digits …
  • 19. Data Science on Qubole
  • 20. 00Copyright 2017 © Qubole Qubole Qubole automates, controls and orchestrates all big data workloads including Data Science so that you can optimize for performance, cost and scale. Built for Anyone Who Uses Data Analysts Data Scientists Data Engineers Data Admins A Single Platform for Any Use Case ETL & Reporting Ad Hoc Queries Machine Learning Streaming Vertical Apps Open Source Engines, Optimized for the Cloud Cloud-Native, Cloud-Optimized, Cloud-Agnostics
  • 21. Copyright 2017 © Qubole Data Science on Qubole
  • 22. Copyright 2017 © Qubole Data Science on Qubole
  • 23. Copyright 2017 © Qubole Cluster LifeCycle Management on Qubole Note: Available on Apache Spark, Hadoop, and Presto as a service on Qubole Auto-scaling Clusters • Policy-driven • One-time setup; Runtime modifications • Work load aware upscaling and downscaling • No wasted resources results in lowered TCO Heterogeneous Clusters • Mix-and-match instance types • On-Demand and Spot instances (on AWS)
  • 24. 00Copyright 2017 © Qubole Qubole: High-level View User Access Qubole Tier Customer’s Azure Account QUBOLE UI VIA BROWSER SDK ODBC EPHEMERAL WEB TIER WEB SERVERS Default Hive Metastore RDS–Qubole User, Account Configurations (Encrypted credentials) Encrypted Result Cache (Optional) Custom Hive Metastore (Optional) Other RDS Encrypted HDFS Slave Encrypted HDFS Slave Master Ephemeral Cluster, Managed by Qubole Data Flow within Customer’s CloudRESTAPI (HTTPS)
  • 25. Thank you! Dash Desai Technology Evangelist @iamontheinet Getting Started Install BigDL on Qubole + Demo App: http://bit.ly/deep_learning_bigdl_qubole BigDL: https://github.com/intel-analytics/BigDL