DBG / Feb 20, 2018 / © 2018 IBM Corporation
Spark and AI

—

Fred Reiss



Nick Pentreath

DBG / Feb 20, 2018 / © 2018 IBM Corporation
Agenda
The Machine Learning Workflow
Overview of Spark Deep Learning Frameworks
Discussion of Future Directions
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Perception
The Machine Learning Workflow
DBG / Feb 20, 2018 / © 2018 IBM Corporation
In reality, workflow spans teams …
The Machine Learning Workflow
DBG / Feb 20, 2018 / © 2018 IBM Corporation
… and tools
The Machine Learning Workflow
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Major Frameworks
Deep Learning on Spark
• Deeplearning4J
• BigDL
• Deep Learning Pipelines
• TensorFlowOnSpark
• Microsoft Machine Learning on Spark
(MMLSpark)
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Deeplearning4J
Deep Learning on Spark
• Distributed GPU support for all major deep
learning architectures
• CPU / Distributed CPU / Single GPU options exist
• Supports Convolutional Nets, LSTMs / RNNs,
Feedforward Nets, Word2Vec
• Supported by Skymind
• Backed by its own linear algebra library – ND4J
• APIs in Scala, Java, Python
• Newer Scala API, Keras-like
• Keras import / export for Python API
• Production serving is through proprietary layer
DBG / Feb 20, 2018 / © 2018 IBM Corporation
BigDL
Deep Learning on Spark
• Distributed CPU with Intel MKL
• No GPU support
• Most DL models – CNN, RNN
• Backed by Intel
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Uses internal Spark components for distributed
training
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Deep Learning Pipelines
Deep Learning on Spark
• Created by Databricks
• Focus on scoring models (TensorFlow / Keras) and
basic transfer learning
• No support for training the DL model
• Focus on image data & use cases
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Support for scoring models as a SQL UDF
DBG / Feb 20, 2018 / © 2018 IBM Corporation
TensorFlowOnSpark
Deep Learning on Spark
• Created by Yahoo
• Scale out TF on Spark clusters
• Use Spark executors to launch TF processes
• Supports distributed training through TF parameter
servers
• RDMA / Infiniband improvement to TF to speed up
distributed training
• Good support for TensorBoard
• Good integration with Spark
• But only Python API
• Some support for Spark ML pipelines
• Relatively inactive recently
DBG / Feb 20, 2018 / © 2018 IBM Corporation
MMLSpark
Deep Learning on Spark
• Created by Microsoft
• Supports training using CNTK including distributed
• Image, text data
• Good integration with Spark
• Scala, Python, R API
• Support for Spark ML pipelines
• Relatively active, seems quite well supported
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Other Frameworks
Deep Learning on Spark
• H20 AI / DeepWater
• Apache MXNet Spark integration
• TensorFrames
• CaffeOnSpark
• scalable-deep-learning on Github
• MLlib – MLPClassifier only
• Sparknet (abandoned)
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Questions
Future Directions in Spark + AI
• What are the real-world uses of deep learning
for Spark users?
• Problems being solved?
• DL component relative to the rest of the system?
• What ETL and pre-processing is being used?
• Transfer learning / using pre-trained models?
• Full-scale training, distributed training?
• What are the integration pain points between
Spark and deep learning framework X?
• Serialization issues
• Integrating disparate frameworks and systems
• Model deployment
• Deploying pipelines that span pre-processing,
”traditional” ML and DL?
DBG / Feb 20, 2018 / © 2018 IBM Corporation
Thank you!
Nick Pentreath
Principal Engineer
—
nickp@za.ibm.com
@MLnick
ibm.com
DBG / Feb 20, 2018 / © 2018 IBM Corporation

Index conf sparkai-feb20-n-pentreath

  • 1.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Spark and AI
 —
 Fred Reiss
 
 Nick Pentreath

  • 2.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Agenda The Machine Learning Workflow Overview of Spark Deep Learning Frameworks Discussion of Future Directions
  • 3.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Perception The Machine Learning Workflow
  • 4.
    DBG / Feb20, 2018 / © 2018 IBM Corporation In reality, workflow spans teams … The Machine Learning Workflow
  • 5.
    DBG / Feb20, 2018 / © 2018 IBM Corporation … and tools The Machine Learning Workflow
  • 6.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Major Frameworks Deep Learning on Spark • Deeplearning4J • BigDL • Deep Learning Pipelines • TensorFlowOnSpark • Microsoft Machine Learning on Spark (MMLSpark)
  • 7.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Deeplearning4J Deep Learning on Spark • Distributed GPU support for all major deep learning architectures • CPU / Distributed CPU / Single GPU options exist • Supports Convolutional Nets, LSTMs / RNNs, Feedforward Nets, Word2Vec • Supported by Skymind • Backed by its own linear algebra library – ND4J • APIs in Scala, Java, Python • Newer Scala API, Keras-like • Keras import / export for Python API • Production serving is through proprietary layer
  • 8.
    DBG / Feb20, 2018 / © 2018 IBM Corporation BigDL Deep Learning on Spark • Distributed CPU with Intel MKL • No GPU support • Most DL models – CNN, RNN • Backed by Intel • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Uses internal Spark components for distributed training
  • 9.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Deep Learning Pipelines Deep Learning on Spark • Created by Databricks • Focus on scoring models (TensorFlow / Keras) and basic transfer learning • No support for training the DL model • Focus on image data & use cases • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Support for scoring models as a SQL UDF
  • 10.
    DBG / Feb20, 2018 / © 2018 IBM Corporation TensorFlowOnSpark Deep Learning on Spark • Created by Yahoo • Scale out TF on Spark clusters • Use Spark executors to launch TF processes • Supports distributed training through TF parameter servers • RDMA / Infiniband improvement to TF to speed up distributed training • Good support for TensorBoard • Good integration with Spark • But only Python API • Some support for Spark ML pipelines • Relatively inactive recently
  • 11.
    DBG / Feb20, 2018 / © 2018 IBM Corporation MMLSpark Deep Learning on Spark • Created by Microsoft • Supports training using CNTK including distributed • Image, text data • Good integration with Spark • Scala, Python, R API • Support for Spark ML pipelines • Relatively active, seems quite well supported
  • 12.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Other Frameworks Deep Learning on Spark • H20 AI / DeepWater • Apache MXNet Spark integration • TensorFrames • CaffeOnSpark • scalable-deep-learning on Github • MLlib – MLPClassifier only • Sparknet (abandoned)
  • 13.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Questions Future Directions in Spark + AI • What are the real-world uses of deep learning for Spark users? • Problems being solved? • DL component relative to the rest of the system? • What ETL and pre-processing is being used? • Transfer learning / using pre-trained models? • Full-scale training, distributed training? • What are the integration pain points between Spark and deep learning framework X? • Serialization issues • Integrating disparate frameworks and systems • Model deployment • Deploying pipelines that span pre-processing, ”traditional” ML and DL?
  • 14.
    DBG / Feb20, 2018 / © 2018 IBM Corporation Thank you! Nick Pentreath Principal Engineer — [email protected] @MLnick ibm.com
  • 15.
    DBG / Feb20, 2018 / © 2018 IBM Corporation