@hexadata & @mmalohlava 
presents 
Sparkling Water 
“Killer App for Spark”
Spark and H2O 
Several months ago…
Sparkling Water 
Before 
Tachyon based 
Unnecessary data duplication 
Now 
Pure H2ORDD 
Transparent use of H2O data and algorithms with 
Spark API
Sparkling Water 
  
	
 
  
	
 
+ 
RDD 
immutable 
world 
DataFrame 
mutable 
world
Sparkling Water 
  
  
	
 RDD DataFrame
Sparkling Water Design 
Sparkling 
App 
jar file 
Spark 
Master 
JVM 
spark-submit 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Sparkling Water Cluster 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O
Data Distribution 
Sparkling Water Cluster 
H2O 
H2O 
H2O 
Spark Executor JVM 
Data 
Source 
(e.g. 
HDFS) 
H2O 
RDD 
Spark 
RDD 
Spark Executor JVM 
Spark Executor JVM
Hands-on Time
Example 
LoadParse CSV data 
Use Spark API, do SQL query 
Create Deep Learning model 
Use model for prediction
Requirements 
Linux or Mac OS X 
Oracle Java 1.7 
Virtual image 
is provided 
for Windows 
users
Download 
http://0xdata.com/download/
Install and Launch 
Unpack zip file 
or 
Open provided virtual image in VirtualBox 
and 
Launch h2o-examples/sparkling-shell
What is Sparkling Shell? 
Standard spark-shell 
Launch H2O extension 
export MASTER=“local-cluster[3,2,1024]” 
! 
spark-shell  
JAR containing 
H2O code 
Spark Master 
address 
—jars shaded.jar  
—conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension 
Name of H2O extension 
provided by JAR
…more on launching… 
‣ By default single JVM, multi-threaded (export 
MASTER=local[*]) or 
‣ export MASTER=“local-cluster[3,2,1024]” to launch 
an embedded Spark cluster or 
‣ Launch standalone Spark cluster via 
sbin/launch-spark-cloud.sh 
and export MASTER=“spark://localhost:7077”
Lets play with Sparking 
shell…
Create H2O Client 
import water.{H2O,H2OClientApp} 
H2OClientApp.start() 
H2O.waitForCloudSize(3, 10000)
Is Spark Running? 
http://localhost:4040
Is H2O running? 
http://localhost:54321/steam/index.html
Data 
Load some data and parse them 
import java.io.File 
import org.apache.spark.examples.h2o._ 
import org.apache.spark.h2o._ 
val dataFile = 
“../h2o-examples/smalldata/allyears2k_headers.csv.gz 
! 
// Create DataFrame - involves parse of data 
val airlinesData = new DataFrame(new File(dataFile))
Where are data? 
Go to http://localhost:54321/steam/ 
index.html
Use Spark API 
// H2O Context provide useful implicits for conversions 
val h2oContext = new H2OContext(sc) 
import h2oContext._ 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) 
airlinesTable.count 
// And use Spark RDD API directly 
val flightsOnlyToSF = airlinesTable.filter( 
f = 
f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) ) 
flightsOnlyToSF.count
Use Spark SQL 
import org.apache.spark.sql.SQLContext 
// We need to create SQL context 
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
airlinesTable.registerTempTable(airlinesTable) 
val query = 
“SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR 
Dest LIKE 'OAK'“ 
// Invoke query 
val result = sql(query) // Using a registered context and tables 
result.count 
assert(result.count == flightsOnlyToSF.count)
Launch H2O Algorithms 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel.DeepLearningParameters 
// Setup deep learning parameters 
val dlParams = new DeepLearningParameters() 
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 
'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 
'FlightNum, 'TailNum, 'CRSElapsedTime, 
'Origin, 'Dest,'Distance, ‘IsDepDelayed) 
dlParams.response_column = 'IsDepDelayed.name 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get
Make a prediction 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
! 
// Collect predicted values via RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse(NaN) )
What is under the hood?
Spark App Extension 
/** Notion of Spark application platform extension. */ 
trait PlatformExtension extends Serializable { 
/** Method to start extension */ 
def start(conf: SparkConf):Unit 
/** Method to stop extension */ 
def stop (conf: SparkConf):Unit 
/* Point in Spark infrastructure which will be intercepted by this extension. */ 
def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC 
/* User-friendly description of extension */ 
def desc:String 
override def toString = s$desc@$intercept 
} 
/** Supported interception points. 
* 
* Currently only Executor life cycle is supported. */ 
object InterceptionPoints extends Enumeration { 
type InterceptionPoints = Value 
val EXECUTOR_LC /* Inject into executor lifecycle */ 
= Value 
}
Using App Extensions 
val conf = new SparkConf() 
.setAppName(“Sparkling H2O Example”) 
// Setup expected size of H2O cloud 
conf.set(“spark.h2o.cluster.size”,h2oWorkers) 
! 
// Add H2O extension 
conf.addExtension[H2OPlatformExtension] 
! 
// Create Spark Context 
val sc = new SparkContext(sc)
Spark Changes 
We keep them small (~30 lines of code) 
JIRA SPARK-3270 - Platform App Extensions 
https://issues.apache.org/jira/browse/ 
SPARK-3270
You can participate! 
Epic PUBDEV-21aka Sparkling Water 
PUBDEV-23 Test HDFS reader 
PUBDEV-26 Implement toSchemaRDD 
PUBDEV-27 Boolean transfers 
PUBDEV-31 Support toRDD[ X : Numeric] 
PUBDEV-32/33 Mesos/YARN support
More info 
Checkout 0xdata Blog for tutorials 
http://0xdata.com/blog/ 
Checkout 0xdata Youtube Channel 
https://www.youtube.com/user/0xdata 
Checkout github 
https://github.com/0xdata/h2o-dev 
https://github.com/0xdata/perrier
Thank you! 
Learn more about H2O at 
0xdata.com 
or 
neo for r in h2o-dev perrier; do ! 
git clone “git@github.com:0xdata/$r.git”! 
done 
Follow us at @hexadata

More Related Content

PDF
Sparkling Water Meetup
PDF
Interactive Session on Sparkling Water
PDF
Sparkling Water
PDF
H2O World - Intro to R, Python, and Flow - Amy Wang
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
ODP
Introduction to Spark with Scala
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Sparkling Water Meetup
Interactive Session on Sparkling Water
Sparkling Water
H2O World - Intro to R, Python, and Flow - Amy Wang
SparkR - Play Spark Using R (20160909 HadoopCon)
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Introduction to Spark with Scala
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

What's hot (20)

PDF
H2O World - PySparkling Water - Nidhi Mehta
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PDF
Scala+data
PDF
Making KVS 10x Scalable
PDF
Scalding - the not-so-basics @ ScalaDays 2014
PDF
Beyond Parallelize and Collect by Holden Karau
PPT
whats new in java 8
PPTX
Apache Spark
PDF
A New Chapter of Data Processing with CDK
PPTX
scalable machine learning
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
PDF
PDF
Spark Summit EU talk by Nimbus Goehausen
PDF
Intro to apache spark stand ford
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
PDF
Getting Started Running Apache Spark on Apache Mesos
PDF
Ultimate journey towards realtime data platform with 2.5M events per sec
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
PPTX
Writing Hadoop Jobs in Scala using Scalding
H2O World - PySparkling Water - Nidhi Mehta
Big Data Analytics with Scala at SCALA.IO 2013
Scala+data
Making KVS 10x Scalable
Scalding - the not-so-basics @ ScalaDays 2014
Beyond Parallelize and Collect by Holden Karau
whats new in java 8
Apache Spark
A New Chapter of Data Processing with CDK
scalable machine learning
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Summit EU talk by Nimbus Goehausen
Intro to apache spark stand ford
Debugging PySpark: Spark Summit East talk by Holden Karau
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Getting Started Running Apache Spark on Apache Mesos
Ultimate journey towards realtime data platform with 2.5M events per sec
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Writing Hadoop Jobs in Scala using Scalding
Ad

Viewers also liked (20)

PDF
H2O World - Sparkling Water - Michal Malohlava
PDF
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
PDF
Sparkling Water 2.0 - Michal Malohlava
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
PDF
H2o storm
PPTX
Sparkling Water Webinar October 29th, 2014
PDF
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
PDF
Introduction to Sparkling Water - Spark Summit East 2016
PDF
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
PPTX
H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
PDF
H2O World - Generalized Low Rank Models - Madeleine Udell
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
PDF
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
PDF
H2O World - GLM - Tomas Nykodym
PDF
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
PDF
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
PDF
H2O World - H2O Deep Learning with Arno Candel
PPTX
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PPTX
Data & Data Alliances - Scott Mclellan
H2O World - Sparkling Water - Michal Malohlava
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
Sparkling Water 2.0 - Michal Malohlava
Skutil - H2O meets Sklearn - Taylor Smith
H2o storm
Sparkling Water Webinar October 29th, 2014
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Introduction to Sparkling Water - Spark Summit East 2016
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - GLM - Tomas Nykodym
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
H2O World - H2O Deep Learning with Arno Candel
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
Data & Data Alliances - Scott Mclellan
Ad

Similar to 2014 09 30_sparkling_water_hands_on (20)

PDF
Sparkling Water 5 28-14
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PPTX
Dec6 meetup spark presentation
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Spark + H20 = Machine Learning at scale
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
H2O Big Data Environments
PDF
Introduction to Apache Spark Ecosystem
PPTX
Apache Spark in Industry
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PPTX
Introduction to Apache Spark Developer Training
PPTX
"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...
PPTX
Building highly scalable data pipelines with Apache Spark
PDF
Fast Data Analytics with Spark and Python
PDF
Analyzing Data at Scale with Apache Spark
PDF
Apache Spark 2x Cookbook Cloudready Recipes For Analytics And Data Science 2n...
PPTX
H2O 0xdata MLconf
Sparkling Water 5 28-14
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Machine Learning with H2O, Spark, and Python at Strata 2015
Intro to Apache Spark by CTO of Twingo
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Dec6 meetup spark presentation
Spark Summit East 2015 Advanced Devops Student Slides
Spark + H20 = Machine Learning at scale
Large-Scale Data Science in Apache Spark 2.0
H2O Big Data Environments
Introduction to Apache Spark Ecosystem
Apache Spark in Industry
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Introduction to Apache Spark Developer Training
"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...
Building highly scalable data pipelines with Apache Spark
Fast Data Analytics with Spark and Python
Analyzing Data at Scale with Apache Spark
Apache Spark 2x Cookbook Cloudready Recipes For Analytics And Data Science 2n...
H2O 0xdata MLconf

More from Sri Ambati (20)

PDF
Practical MLOps with H2O.ai -Support Slide Deck.pdf
PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Practical MLOps with H2O.ai -Support Slide Deck.pdf
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Recently uploaded (20)

PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
DATA MODELING, data model concepts, types of data concepts
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PPTX
ifsm.pptx, institutional food service management
PPT
Classification methods in data analytics.ppt
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
AI_Agriculture_Presentation_Enhanced.pptx
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
DATA MODELING, data model concepts, types of data concepts
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
REPORT CARD OF GRADE 2 2025-2026 MATATAG
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
ifsm.pptx, institutional food service management
Classification methods in data analytics.ppt
expt-design-lecture-12 hghhgfggjhjd (1).ppt
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Grey Minimalist Professional Project Presentation (1).pdf
Hushh Hackathon for IIT Bombay: Create your very own Agents
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPT for Diseases (1)-2, types of diseases.pptx
AI_Agriculture_Presentation_Enhanced.pptx
Session 11 - Data Visualization Storytelling (2).pdf
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
A biomechanical Functional analysis of the masitary muscles in man
inbound6529290805104538764.pptxmmmmmmmmm

2014 09 30_sparkling_water_hands_on

  • 1. @hexadata & @mmalohlava presents Sparkling Water “Killer App for Spark”
  • 2. Spark and H2O Several months ago…
  • 3. Sparkling Water Before Tachyon based Unnecessary data duplication Now Pure H2ORDD Transparent use of H2O data and algorithms with Spark API
  • 4. Sparkling Water + RDD immutable world DataFrame mutable world
  • 5. Sparkling Water RDD DataFrame
  • 6. Sparkling Water Design Sparkling App jar file Spark Master JVM spark-submit Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O
  • 7. Data Distribution Sparkling Water Cluster H2O H2O H2O Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark RDD Spark Executor JVM Spark Executor JVM
  • 9. Example LoadParse CSV data Use Spark API, do SQL query Create Deep Learning model Use model for prediction
  • 10. Requirements Linux or Mac OS X Oracle Java 1.7 Virtual image is provided for Windows users
  • 12. Install and Launch Unpack zip file or Open provided virtual image in VirtualBox and Launch h2o-examples/sparkling-shell
  • 13. What is Sparkling Shell? Standard spark-shell Launch H2O extension export MASTER=“local-cluster[3,2,1024]” ! spark-shell JAR containing H2O code Spark Master address —jars shaded.jar —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension Name of H2O extension provided by JAR
  • 14. …more on launching… ‣ By default single JVM, multi-threaded (export MASTER=local[*]) or ‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or ‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”
  • 15. Lets play with Sparking shell…
  • 16. Create H2O Client import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)
  • 17. Is Spark Running? http://localhost:4040
  • 18. Is H2O running? http://localhost:54321/steam/index.html
  • 19. Data Load some data and parse them import java.io.File import org.apache.spark.examples.h2o._ import org.apache.spark.h2o._ val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz ! // Create DataFrame - involves parse of data val airlinesData = new DataFrame(new File(dataFile))
  • 20. Where are data? Go to http://localhost:54321/steam/ index.html
  • 21. Use Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter( f = f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) ) flightsOnlyToSF.count
  • 22. Use Spark SQL import org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable(airlinesTable) val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tables result.count assert(result.count == flightsOnlyToSF.count)
  • 23. Launch H2O Algorithms import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  • 24. Make a prediction // Use model to score data val prediction = dlModel.score(result)(‘predict) ! // Collect predicted values via RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse(NaN) )
  • 25. What is under the hood?
  • 26. Spark App Extension /** Notion of Spark application platform extension. */ trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s$desc@$intercept } /** Supported interception points. * * Currently only Executor life cycle is supported. */ object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value }
  • 27. Using App Extensions val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloud conf.set(“spark.h2o.cluster.size”,h2oWorkers) ! // Add H2O extension conf.addExtension[H2OPlatformExtension] ! // Create Spark Context val sc = new SparkContext(sc)
  • 28. Spark Changes We keep them small (~30 lines of code) JIRA SPARK-3270 - Platform App Extensions https://issues.apache.org/jira/browse/ SPARK-3270
  • 29. You can participate! Epic PUBDEV-21aka Sparkling Water PUBDEV-23 Test HDFS reader PUBDEV-26 Implement toSchemaRDD PUBDEV-27 Boolean transfers PUBDEV-31 Support toRDD[ X : Numeric] PUBDEV-32/33 Mesos/YARN support
  • 30. More info Checkout 0xdata Blog for tutorials http://0xdata.com/blog/ Checkout 0xdata Youtube Channel https://www.youtube.com/user/0xdata Checkout github https://github.com/0xdata/h2o-dev https://github.com/0xdata/perrier
  • 31. Thank you! Learn more about H2O at 0xdata.com or neo for r in h2o-dev perrier; do ! git clone “[email protected]:0xdata/$r.git”! done Follow us at @hexadata