SlideShare a Scribd company logo
Avkash Chauhan (avkash@h2o.ai)
VP, Enterprise Customers
Agenda
• H2O Intro
• Installation
• Using H2O from FLOW, R & Python
• Data munging in H2O with Python
• 2 examples of machine learning problems
o GBM, GLM, DRF
o Understanding Models, improvements,
• Machine learning production pipeline
H2O.ai	is	a	Visionary in	the	Gartner Magic Quadrant
for	Data	Science	Platforms
Introduction & Overview
H2O.ai Company Overview
Founded 2011	Venture-backed,	debuted	in	2012
Products • H2O	Open	Source	In-Memory	AI	Prediction	Engine
• Sparkling	Water
• STEAM
• DEEP	WATER
Mission Operationalize	Data	Science,	and	provide	a	platform	for	users	to	build	beautiful	data	products	
Team 60+ employees	worldwide
• CA,	NY,	UT,	Japan,	UK
• Distributed	Systems	Engineers	doing	Machine	Learning
• World-class	visualization	designers
Headquarters Mountain	View,	CA
Please	visit:	http://www.h2o.ai/customers/# H2O	Users List:	http://www.h2o.ai/user-list/
Customers and Use Cases
Financial Insurance MarketingTelecom Healthcare
Open Source Users & Community
H2O	Users	List:	http://www.h2o.ai/user-list/
About Myself
• VP – Enterprise products & customers
– Handling paid enterprise customer’s requirements
– building product(s)
– Helping community
• LinkedIn: https://www.linkedin.com/in/avkashchauhan/
• Blog: https://aichamp.wordpress.com/
• Community: https://community.h2o.ai/index.html
• Twitter: @avkashchauhan
Products and Features
Q
H2O Platform(s)
In-Memory,	Distributed	Machine	
Learning	Algorithms	with	H2O	Flow	
GUI
H2O	AI	Open	Source	Engine	
Integration	with	Spark
DEEP	WATER
Key features
• Open Source (Apache 2.0)
• All supported ML algorithms are coded by our engineers
• Designed for speed, scalability and for super large data-sets
• Same distribution for open source community & enterprise
• Very active production, every other week release
• Vibrant open source community
o https://community.h2o.ai
• Enterprise Support portal
o https://support.h2o.ai
• We have 70,000 users, 8,000 organizations and growing daily
Usage: Simple Solution
o Single Deployable compiled Java code (jar)
o Ready to use point and click FLOW Interface
o Connection from R and Python after specific packages are
installed
o Use Java, Scala natively and any other language through
RESTful API
o Deployable models - Binary & Java (POJO & MOJO)
o One click prediction/scoring engine
Usage: Complex Solution
o Multi-node Deployment
o Spark and Hadoop distributed environment
• Sparkling Water (Spark + H2O)
o Data ingested from various inputs
• S3, HDFS, NFS, JDBC, Object store etc.
• Streaming support in Spark (through Sparking Water)
o Distributed machine learning for every algorithm in
platform
o Prediction service deployment on several machines
Current Algorithm Overview
Statistical Analysis
• Linear Models (GLM)
• Naïve Bayes
Ensembles
• Random Forest
• Distributed Trees
• Gradient Boosting Machine
• R Package - Stacking / Super
Learner
Deep Neural Networks
• Multi-layer Feed-Forward Neural
Network
• Auto-encoder
• Anomaly Detection
Clustering
• K-Means
Dimension	Reduction
• Principal	Component	Analysis
• Generalized	Low	Rank	Models
Solvers	&	Optimization
• Generalized	ADMM	Solver
• L-BFGS	(Quasi	Newton	Method)
• Ordinary	Least-Square	Solver
• Stochastic	Gradient	Descent
Data	Munging
• Scalable	Data	Frames
• Sort,	Slice,	Log	Transform
• Data.table (1B	rows	groupBy record)
Technical Architecture
Q
JobFluid Vector Frame
MRTaskDistributed K/V Store
Distributed Fork/JoinNon-Blocking Hash Table
Distributed In-Memory Processing
Core H2O: Architecture
REST / JSON
Parse
Exploratory
Analysis
Feature
Engineering
ML
Algorithms
Model
Evaluation
Scoring
Data/Model
Export
SQL
NFS
Local
S3
HDFS
POJO
Production
Environments
Sparkling Water - High Level Architecture
Deep Water : Architecture
Node	1 Node	N
Scala
Spark
H2O
Java
Execution	Engine
TensorFlow/mxnet/Caffe
C++
GPU CPU
TensorFlow/mxnet/Caffe
C++
GPU CPU
RPC
R/Py/Flow/Scala	client
REST	API
Web	server
H2O
Java
Execution	Engine
grpc/MPI/RDMA
Scala
Spark
H2O Installation
Q
What we covered with H2O Installation
• Installation, using H2O with R & Python
o Installation H2O
• Help – help(h2o.init), h2o.cluster_status()
o H2O Github repo
• Source code - glance
o R package installation
o Python Package Installation
o Connecting H2O from R
o Connecting H2O from Python
Q
H2O FLOW DEMO
Q
What we covered in FLOW DEMO
• FLOW Intro
• Running Examples
• Generating Data
• Working with UI, Cell, Running FLOW Script
• Importing Data
o Chunk Distribution
o Feature analysis
• Building models from imported data
• Understanding models
o Binary Model, POJO, MOJO
• Listing all Jobs
• Using HELP
• Understanding RESTful Interface
• Reading Logs, Water Meter (CPU analysis), Stack Trace etc.
Q
Data manipulation
between H2O, R & python
Q
Data manipulation between H2O, R, python
• Import data in between H2O, python, R and others..
Q
H2O R,	python,	Java,	Scala,	etc..	
3600 View
Data manipulation between H2O, R, python
• Import data in python
o import pandas as pd
• datasetpath = "/Users/avkashchauhan/tools/datasets/kaggle-imdb-word2vec”
• train_data = pd.read_csv(datasetpath + "/labeledTrainData.tsv", header=0, delimiter="t", quoting=3)
o h2o.ls()
o train_data.describe()
o train_data_h2o = h2o.H2OFrame(train_data, destination_frame = "train_data_h2o")
o h2o.ls()
o train_data_h2o.describe()
o Now Look the same frame in the FLOW - getFrames
• In R
• > h2o.ls()
• > rdf = h2o.getFrame("train_data_h2o")
• > rdf
• > summary(rdf)
Q
Part	1
Reference:	http://localhost:8888/notebooks/H2O-start-test-and-data-switch-python-r.ipynb
Data manipulation between H2O, R, python
• Import data in R
o > iris
o > mydf = as.h2o(iris)
• This frame will be imported as iris original frame name
o > summary(mydf)
o > summary(iris)
o > h2o.ls()
• You will see the iris entry as h2o frames list
• Check FLOW as well and you will see iris there too
o > mydf = as.h2o(iris, destination_frame = “mydf”)
o > h2o.ls()
o In Python
• h2o.ls()
• my_python_df= h2o.get_frame("mydf")
• my_python_df
• h2o.ls()
Q
Part	2
Data munging in H2O
with Python
Q
What we covered Data munging in H2O with python
• H2O and Python
• Jupyter notebook Demo
• Data import
• Row, column, data frames, slicing, binding, exporting, factoring
• Using functions
Q
Reference:	
[1]	http://localhost:8888/notebooks/H2O%20Data%20Ingest%20Demo.ipynb
[2]	http://localhost:8888/notebooks/H2O%20Frame%20manipulation.ipynb
Price Prediction using
GLM, GBM, DRF
Q
Problem Description
• Kaggle
o https://www.kaggle.com/harlfoxem/housesalesprediction
• Local Datasets
o /Users/avkashchauhan/learn/seattle-workshop/kc_house_data.csv
o /Users/avkashchauhan/learn/seattle-workshop/kc_house_orig.csv
• Documentation
o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
What we covered
• Data Import
• Understood data, frame distribution, chunk, compression etc.
• Understood all features, through histograms, enums, etc.
• Split data frames for train and test
• Converting features to Factors/Enum
• Imputation of values
• Training
o with Training frame only
o With cross-validation
o with validation frame
• Understanding Model details
• Documentation
o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
What we covered
• Response Column - Price
• buildModel 'glm', {"model_id":"glm-2fcdf58a-da59-4b3f-8a68-
1177bce9531c","training_frame":"kc90_house_data.hex","nfolds":"10","seed":-
1,"response_column":"price","ignored_columns":["id"],"ignore_const_cols":true,"family":"gaussian","solver":"AUTO","alpha":[],"lam
bda":[],"lambda_search":false,"standardize":true,"non_negative":false,"fold_assignment":"AUTO","score_each_iteration":false,"co
mpute_p_values":false,"remove_collinear_columns":false,"max_iterations":-
1,"link":"family_default","max_runtime_secs":0,"keep_cross_validation_predictions":false,"keep_cross_validation_fold_assignment":
false,"missing_values_handling":"MeanImputation","intercept":true,"objective_epsilon":-
1,"beta_epsilon":0.0001,"gradient_epsilon":-1,"prior":-1,"max_active_predictors":-1}
o r2 0.017224
• buildModel 'glm', {"model_id":"glm-2fcdf58a-da59-4b3f-8a68-
1177bce9531c","training_frame":"kc90_house_data.hex","nfolds":"10","seed":-
1,"response_column":"price","ignored_columns":["id"],"ignore_const_cols":true,"family":"gaussian","solver":"AUTO","alpha":[0.001
],"lambda":[0.1],"lambda_search":false,"standardize":false,"non_negative":false,"fold_assignment":"AUTO","score_each_iteration":
false,"compute_p_values":false,"remove_collinear_columns":false,"max_iterations":-
1,"link":"family_default","max_runtime_secs":0,"keep_cross_validation_predictions":false,"keep_cross_validation_fold_assignment":
false,"missing_values_handling":"MeanImputation","intercept":true,"objective_epsilon":-
1,"beta_epsilon":0.0001,"gradient_epsilon":-1,"prior":-1,"max_active_predictors":-1}
o r2 0.619503
GLM - Regularization
• L1 Lasso and L2 Ridge:
o Regularization is used solve problems with overfitting in GLM.
o Penalties are introduced to avoid overfitting,
o To reduce variance of the prediction error
o To handle correlated predictors.
• H2O - Elastic Net
o Alpha (0 – 1)/ Lambda(0 – 1, into very small fraction i.e. 0.0001)
o Alpha = 0 –> Ridge
o Alpha = 1 -> LASSO
o Lambda – 0.0 > No regularization
• Lambda Search
o Enable Lambda_Search
o lambda_min_ratio
o Nlambdas
o max_active_predictors
• Look for Intercept, and p-values in the docs too
• Documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#regularization
Linear Regression - Prediction
• Import data to predict
• Feature Engineering
o Proper encoding for numerical data
o Hide – ID and other features
• Experimentation with Alpha and Lambda
o 0.001 & 0.1 – 0.61 – Did Prediction
o 0.0001 & 0.1 - 0.61
o 0.0001 & 0.5 – 0.58
o 0.001 & 0.01 - 0.67 – Did Prediction
o 0.004 & 0.01 – 0.70 – Did prediction
• Understanding Training and validation – r2 values
Price Prediction - GBM
• Feature Engineering
o Proper encoding for numerical data
o Hide – ID and other features
- All default
- Training r2 = 0.95 & Validation r2 = 0.76 – Try prediction
- Learning Rate – 0.5 + all default = 0.98/0.71
- Learning Rate – 0.01 + all default = 0.50/0.40
- Setting – stopping rounds
Understanding GLM, GBM, DRF
- Validation and Cross Validation
- Scoring History, Validation History (Validation Frame/CV)
- Training and Validation Metrics
- Using Stopping metrics into Tree based algorithms – DRF/GBM
- How adding tree depth changes the results?
- Variable Importance
- Gains and Lift Chart
- Confusion Matrix (TPR/FPR)
- Solvers
Note:	Using	3	FLOW	interface	build	– GLM,	GBM,	DRF	Model
Improving Overall Results
- Feature Engineering
- Adding proper categories
- Year, Waterfront, view, condition, grade, zipcode – Factors
- How r2 values helps better prediction
- GBM Improvements with CV
- 0.3 Learning, all default – 0.795 >> Perform Prediction
- Ntree=60, depth=5,l-rate=0.21,row-sr=0.8, col-sr=0.8> 0.80
- GBM with Validation frame
- Ntree=60, depth=5,l-rate=0.21,row-sr=0.8, col-sr=0.8> 0.82
- DRF with CV and default settings > 0.80
- Finally now – Ignore Date column
- DRF - 0.86
- GBM – 0.86
- With Update Feature Engineering
- GLM – Aplha:0.004, lambda:0.01 > r2 = 0.77
R - Demo
- Start R
- > h2o.init()
- > h2o.clusterStatus()
- ?h2o.gbm > Use the sample
- > h2o.varimp(gbm)
- > h2o.varimp_plot(gbm)
- Performance
- > perf <- h2o.performance(gbm, australia.hex)
- > h2o.mse(perf)
- > h2o.r2(perf)
- Add CV – Rebuild the model with nfolds = 5
- > h2o.cross_validation_models(gbm)
- Kmeans - ?h2o.kmeans
- Run the example
- Want to eval the K?
- kmodel = h2o.kmeans(training_frame = prostate.hex, k = 10, x = c("AGE", "RACE", "VOL",
"GLEASON"))
- kmodel = h2o.kmeans(training_frame = prostate.hex, k = 100, x = c("AGE", "RACE", "VOL",
"GLEASON"), estimate_k = T)
Supplement Information
- https://www.kaggle.com/harlfoxem/d/harlfoxem/housesalespre
diction/house-price-prediction-part-1/discussion
- https://www.kaggle.com/auygur/d/harlfoxem/housesalespredict
ion/step-by-step-house-price-prediction-r-2-0-77/code
- https://rpubs.com/MagicSea/property_price
Solving a Binomial
Classification problem
Q
Binomial Classification Problem Description
• Titanic Survivors list
o Passengers on board: 1317, others were crew of total 2224
• Kaggle
o https://www.kaggle.com/c/titanic
• Download Dataset:
o http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
• Local Datasets
o /Users/avkashchauhan/learn/seattle-workshop/titanic_list.csv
• Documentation
o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
Predicting Titanic Survival Rate
• Reference:
o Using scikit-learn
• http://localhost:8888/notebooks/Titanic%20Survival%20Dem
o%20in%20Python.ipynb
o Using H2O python estimators
• http://localhost:8888/notebooks/Titanic%20Survival%20Dem
o%20in%20H2O%20and%20Python.ipynb
What we learn– ROC Curve
AUC of 0.5 is random and 1 is perfect
What we did
• GLM/GBM/RF in Python + scikit-learn
• GLM in H2O from Python
• Grid Search
• Working in FLOW
o Ingest, split frame
o GLM in H2O from FLOW
• Response must be number
– Understanding r2
o GBM in FLOW
• Response numeric– Regression
– r2
• Response enum – classification
– AUC (Area under the curve)
– Confusion matrix
– ROC Curve – NEXT PAGE
• AUC
o AUC of 0.5 is random and 1 is perfect
o Improve AUC
What we covered
• Data Import
• Understood data, frame distribution, chunk, compression etc.
• Understood all features, through histograms, enums, etc.
• Split data frames for train and test
• Converting features to Factors/Enum
• Imputation of values
Model Deployment
Q
Supported Model Types in H2O
• Binary
• POJO
• MOJO exportModel "word2vec-0dfb7bfd-7a5d-42a9-9ebf-82706304a4fe",	
"/Users/avkashchauhan/Downloads/word2vec-0dfb7bfd-7a5d-42a9-9ebf-
82706304a4fe",	overwrite:	true
POJO Demo
• Run “GBM_Airlines_Classification” demo from FLOW examples
• Download POJO from FLOW
• Create a temp folder/Be in temp folder
o Move gbm_pojo_test.java
• Get Model from RESTful API
– $ curl -X GET "http://localhost:54321/3/Models.java/gbm_pojo_test" >> gbm_pojo_test_1.java
o Get H2O gen-model
• curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
o Create main.java
o Compile
• $ javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m gbm_pojo_test.java main.java
• Verify with compiled class files
o Run
• $ java -cp .:h2o-genmodel.jar main
• See Results
• Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/POJO_QuickStart.md
MOJO Demo
• Run “GBM_Airlines_Classification” demo from FLOW examples
• Download POJO from FLOW
• Create a temp folder/Be in temp folder
o Move gbm_pojo_test.zip
• Get Model from RESTful API
– $ curl -X GET "http://localhost:54321/3/Models.java/gbm_pojo_test.zip" >> gbm_pojo_test.zip
o Get H2O gen-model
• curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
o Create main_mojo.java
o Compile
• $ javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m gbm_pojo_test.java main_mojo.java
– Make sure main method name is main_mojo
• Verify with compiled class files
o Run
• $ java -cp .:h2o-genmodel.jar main
• See Results
• Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/MOJO_QuickStart.md
Thank you so much

More Related Content

What's hot (20)

PDF
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Databricks
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Spark tutorial @ KCC 2015
Jongwook Woo
 
PDF
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
PDF
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
PDF
2014 sept 26_thug_lambda_part1
Adam Muise
 
PDF
Intro to H2O Machine Learning in R at Santa Clara University
Sri Ambati
 
PDF
Graph database Use Cases
Max De Marzi
 
PDF
Spark streaming
Noam Shaish
 
PDF
Machine Learning on Google Cloud with H2O
Sri Ambati
 
PPTX
Machine Learning and Hadoop
Josh Patterson
 
PDF
ArnoCandelAIFrontiers011217
Sri Ambati
 
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
PPTX
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
PDF
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
PDF
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PPTX
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Sri Ambati
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Databricks
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Spark tutorial @ KCC 2015
Jongwook Woo
 
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
2014 sept 26_thug_lambda_part1
Adam Muise
 
Intro to H2O Machine Learning in R at Santa Clara University
Sri Ambati
 
Graph database Use Cases
Max De Marzi
 
Spark streaming
Noam Shaish
 
Machine Learning on Google Cloud with H2O
Sri Ambati
 
Machine Learning and Hadoop
Josh Patterson
 
ArnoCandelAIFrontiers011217
Sri Ambati
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Sri Ambati
 

Viewers also liked (20)

PDF
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Avkash Chauhan
 
PDF
High Performance Machine Learning in R with H2O
Sri Ambati
 
PDF
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
PDF
MLconf - Distributed Deep Learning for Classification and Regression Problems...
Sri Ambati
 
PDF
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
PPTX
Big Data Science with H2O in R
Anqi Fu
 
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
PDF
GBM in H2O with Cliff Click: H2O API
Sri Ambati
 
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
PDF
PayPal's Fraud Detection with Deep Learning in H2O World 2014
Sri Ambati
 
PDF
Transform your Business with AI, Deep Learning and Machine Learning
Sri Ambati
 
PDF
Deep Learning through Examples
Sri Ambati
 
PDF
H2O Machine Learning and Kalman Filters for Machine Prognostics
Sri Ambati
 
PDF
Intro to H2O in Python - Data Science LA
Sri Ambati
 
PPTX
Hadoop cluster os_tuning_v1.0_20170106_mobile
상연 최
 
PDF
Scalable Data Science and Deep Learning with H2O
odsc
 
PPTX
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
PDF
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 
PPTX
Data Science, Machine Learning, and H2O
Sri Ambati
 
PPTX
The concept of Datalake with Hadoop
Avkash Chauhan
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Avkash Chauhan
 
High Performance Machine Learning in R with H2O
Sri Ambati
 
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
Sri Ambati
 
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
Big Data Science with H2O in R
Anqi Fu
 
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
GBM in H2O with Cliff Click: H2O API
Sri Ambati
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
Sri Ambati
 
Transform your Business with AI, Deep Learning and Machine Learning
Sri Ambati
 
Deep Learning through Examples
Sri Ambati
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
Sri Ambati
 
Intro to H2O in Python - Data Science LA
Sri Ambati
 
Hadoop cluster os_tuning_v1.0_20170106_mobile
상연 최
 
Scalable Data Science and Deep Learning with H2O
odsc
 
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 
Data Science, Machine Learning, and H2O
Sri Ambati
 
The concept of Datalake with Hadoop
Avkash Chauhan
 
Ad

Similar to Applied Machine learning using H2O, python and R Workshop (20)

PDF
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow
 
PDF
H2O at Poznan R Meetup
Jo-fai Chow
 
PDF
Introduction to Machine Learning with H2O and Python
Jo-fai Chow
 
PDF
H2O at BelgradeR Meetup
Jo-fai Chow
 
PDF
Belgrade R - Intro to H2O and Deep Water
Sri Ambati
 
PDF
H2O at Berlin R Meetup
Jo-fai Chow
 
PDF
Berlin R Meetup
Sri Ambati
 
PDF
Scalable Machine Learning in R and Python with H2O
Sri Ambati
 
PPTX
Project "Deep Water"
Jo-fai Chow
 
PDF
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
PDF
New Developments in H2O: April 2017 Edition
Sri Ambati
 
PDF
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
PDF
Hambug R Meetup - Intro to H2O
Sri Ambati
 
PPTX
ISV Showcase: End-to-end Machine Learning using H2O on Azure
Microsoft Tech Community
 
PPTX
Intro to R and H2O with Spencer Aiello
Sri Ambati
 
PDF
Machine Learning With H2O vs SparkML
Arnab Biswas
 
PDF
Latest Developments in H2O
Sri Ambati
 
PDF
H2o.ai presentation at 2nd Virtual Pydata Piraeus meetup
PyData Piraeus
 
PDF
Introduction to data science with H2O-Chicago
Sri Ambati
 
PDF
Introduction to Data Science with H2O- Mountain View
Sri Ambati
 
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow
 
H2O at Poznan R Meetup
Jo-fai Chow
 
Introduction to Machine Learning with H2O and Python
Jo-fai Chow
 
H2O at BelgradeR Meetup
Jo-fai Chow
 
Belgrade R - Intro to H2O and Deep Water
Sri Ambati
 
H2O at Berlin R Meetup
Jo-fai Chow
 
Berlin R Meetup
Sri Ambati
 
Scalable Machine Learning in R and Python with H2O
Sri Ambati
 
Project "Deep Water"
Jo-fai Chow
 
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
Hambug R Meetup - Intro to H2O
Sri Ambati
 
ISV Showcase: End-to-end Machine Learning using H2O on Azure
Microsoft Tech Community
 
Intro to R and H2O with Spencer Aiello
Sri Ambati
 
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Latest Developments in H2O
Sri Ambati
 
H2o.ai presentation at 2nd Virtual Pydata Piraeus meetup
PyData Piraeus
 
Introduction to data science with H2O-Chicago
Sri Ambati
 
Introduction to Data Science with H2O- Mountain View
Sri Ambati
 
Ad

More from Avkash Chauhan (13)

PPTX
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
Avkash Chauhan
 
PPTX
AI Expo - AI Revolution in Silicon Valley
Avkash Chauhan
 
PDF
Nikkei xTech coverage on macnica.ai announcement
Avkash Chauhan
 
PPTX
H2O Core Introduction
Avkash Chauhan
 
PDF
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Avkash Chauhan
 
PDF
Big Data Perspective UI V2
Avkash Chauhan
 
PDF
Big Data Perspective (UI)
Avkash Chauhan
 
PDF
Big Data Perspective (Company Information)
Avkash Chauhan
 
PPTX
Developing Hadoop strategy for your Enterprise
Avkash Chauhan
 
PDF
Introduction to Apache Hive
Avkash Chauhan
 
PDF
Introduction to Apache Sqoop
Avkash Chauhan
 
PDF
Introduction to Apache Pig
Avkash Chauhan
 
PDF
Introduction to HBase
Avkash Chauhan
 
AI Solutions with Macnica.ai - AI Expo 2018 Tokyo Japan
Avkash Chauhan
 
AI Expo - AI Revolution in Silicon Valley
Avkash Chauhan
 
Nikkei xTech coverage on macnica.ai announcement
Avkash Chauhan
 
H2O Core Introduction
Avkash Chauhan
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Avkash Chauhan
 
Big Data Perspective UI V2
Avkash Chauhan
 
Big Data Perspective (UI)
Avkash Chauhan
 
Big Data Perspective (Company Information)
Avkash Chauhan
 
Developing Hadoop strategy for your Enterprise
Avkash Chauhan
 
Introduction to Apache Hive
Avkash Chauhan
 
Introduction to Apache Sqoop
Avkash Chauhan
 
Introduction to Apache Pig
Avkash Chauhan
 
Introduction to HBase
Avkash Chauhan
 

Recently uploaded (20)

PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 

Applied Machine learning using H2O, python and R Workshop

  • 2. Agenda • H2O Intro • Installation • Using H2O from FLOW, R & Python • Data munging in H2O with Python • 2 examples of machine learning problems o GBM, GLM, DRF o Understanding Models, improvements, • Machine learning production pipeline H2O.ai is a Visionary in the Gartner Magic Quadrant for Data Science Platforms
  • 4. H2O.ai Company Overview Founded 2011 Venture-backed, debuted in 2012 Products • H2O Open Source In-Memory AI Prediction Engine • Sparkling Water • STEAM • DEEP WATER Mission Operationalize Data Science, and provide a platform for users to build beautiful data products Team 60+ employees worldwide • CA, NY, UT, Japan, UK • Distributed Systems Engineers doing Machine Learning • World-class visualization designers Headquarters Mountain View, CA
  • 6. Customers and Use Cases Financial Insurance MarketingTelecom Healthcare
  • 7. Open Source Users & Community H2O Users List: http://www.h2o.ai/user-list/
  • 8. About Myself • VP – Enterprise products & customers – Handling paid enterprise customer’s requirements – building product(s) – Helping community • LinkedIn: https://www.linkedin.com/in/avkashchauhan/ • Blog: https://aichamp.wordpress.com/ • Community: https://community.h2o.ai/index.html • Twitter: @avkashchauhan
  • 11. Key features • Open Source (Apache 2.0) • All supported ML algorithms are coded by our engineers • Designed for speed, scalability and for super large data-sets • Same distribution for open source community & enterprise • Very active production, every other week release • Vibrant open source community o https://community.h2o.ai • Enterprise Support portal o https://support.h2o.ai • We have 70,000 users, 8,000 organizations and growing daily
  • 12. Usage: Simple Solution o Single Deployable compiled Java code (jar) o Ready to use point and click FLOW Interface o Connection from R and Python after specific packages are installed o Use Java, Scala natively and any other language through RESTful API o Deployable models - Binary & Java (POJO & MOJO) o One click prediction/scoring engine
  • 13. Usage: Complex Solution o Multi-node Deployment o Spark and Hadoop distributed environment • Sparkling Water (Spark + H2O) o Data ingested from various inputs • S3, HDFS, NFS, JDBC, Object store etc. • Streaming support in Spark (through Sparking Water) o Distributed machine learning for every algorithm in platform o Prediction service deployment on several machines
  • 14. Current Algorithm Overview Statistical Analysis • Linear Models (GLM) • Naïve Bayes Ensembles • Random Forest • Distributed Trees • Gradient Boosting Machine • R Package - Stacking / Super Learner Deep Neural Networks • Multi-layer Feed-Forward Neural Network • Auto-encoder • Anomaly Detection Clustering • K-Means Dimension Reduction • Principal Component Analysis • Generalized Low Rank Models Solvers & Optimization • Generalized ADMM Solver • L-BFGS (Quasi Newton Method) • Ordinary Least-Square Solver • Stochastic Gradient Descent Data Munging • Scalable Data Frames • Sort, Slice, Log Transform • Data.table (1B rows groupBy record)
  • 16. JobFluid Vector Frame MRTaskDistributed K/V Store Distributed Fork/JoinNon-Blocking Hash Table Distributed In-Memory Processing Core H2O: Architecture REST / JSON Parse Exploratory Analysis Feature Engineering ML Algorithms Model Evaluation Scoring Data/Model Export SQL NFS Local S3 HDFS POJO Production Environments
  • 17. Sparkling Water - High Level Architecture
  • 18. Deep Water : Architecture Node 1 Node N Scala Spark H2O Java Execution Engine TensorFlow/mxnet/Caffe C++ GPU CPU TensorFlow/mxnet/Caffe C++ GPU CPU RPC R/Py/Flow/Scala client REST API Web server H2O Java Execution Engine grpc/MPI/RDMA Scala Spark
  • 20. What we covered with H2O Installation • Installation, using H2O with R & Python o Installation H2O • Help – help(h2o.init), h2o.cluster_status() o H2O Github repo • Source code - glance o R package installation o Python Package Installation o Connecting H2O from R o Connecting H2O from Python Q
  • 22. What we covered in FLOW DEMO • FLOW Intro • Running Examples • Generating Data • Working with UI, Cell, Running FLOW Script • Importing Data o Chunk Distribution o Feature analysis • Building models from imported data • Understanding models o Binary Model, POJO, MOJO • Listing all Jobs • Using HELP • Understanding RESTful Interface • Reading Logs, Water Meter (CPU analysis), Stack Trace etc. Q
  • 24. Data manipulation between H2O, R, python • Import data in between H2O, python, R and others.. Q H2O R, python, Java, Scala, etc.. 3600 View
  • 25. Data manipulation between H2O, R, python • Import data in python o import pandas as pd • datasetpath = "/Users/avkashchauhan/tools/datasets/kaggle-imdb-word2vec” • train_data = pd.read_csv(datasetpath + "/labeledTrainData.tsv", header=0, delimiter="t", quoting=3) o h2o.ls() o train_data.describe() o train_data_h2o = h2o.H2OFrame(train_data, destination_frame = "train_data_h2o") o h2o.ls() o train_data_h2o.describe() o Now Look the same frame in the FLOW - getFrames • In R • > h2o.ls() • > rdf = h2o.getFrame("train_data_h2o") • > rdf • > summary(rdf) Q Part 1 Reference: http://localhost:8888/notebooks/H2O-start-test-and-data-switch-python-r.ipynb
  • 26. Data manipulation between H2O, R, python • Import data in R o > iris o > mydf = as.h2o(iris) • This frame will be imported as iris original frame name o > summary(mydf) o > summary(iris) o > h2o.ls() • You will see the iris entry as h2o frames list • Check FLOW as well and you will see iris there too o > mydf = as.h2o(iris, destination_frame = “mydf”) o > h2o.ls() o In Python • h2o.ls() • my_python_df= h2o.get_frame("mydf") • my_python_df • h2o.ls() Q Part 2
  • 27. Data munging in H2O with Python Q
  • 28. What we covered Data munging in H2O with python • H2O and Python • Jupyter notebook Demo • Data import • Row, column, data frames, slicing, binding, exporting, factoring • Using functions Q Reference: [1] http://localhost:8888/notebooks/H2O%20Data%20Ingest%20Demo.ipynb [2] http://localhost:8888/notebooks/H2O%20Frame%20manipulation.ipynb
  • 30. Problem Description • Kaggle o https://www.kaggle.com/harlfoxem/housesalesprediction • Local Datasets o /Users/avkashchauhan/learn/seattle-workshop/kc_house_data.csv o /Users/avkashchauhan/learn/seattle-workshop/kc_house_orig.csv • Documentation o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
  • 31. What we covered • Data Import • Understood data, frame distribution, chunk, compression etc. • Understood all features, through histograms, enums, etc. • Split data frames for train and test • Converting features to Factors/Enum • Imputation of values • Training o with Training frame only o With cross-validation o with validation frame • Understanding Model details • Documentation o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
  • 32. What we covered • Response Column - Price • buildModel 'glm', {"model_id":"glm-2fcdf58a-da59-4b3f-8a68- 1177bce9531c","training_frame":"kc90_house_data.hex","nfolds":"10","seed":- 1,"response_column":"price","ignored_columns":["id"],"ignore_const_cols":true,"family":"gaussian","solver":"AUTO","alpha":[],"lam bda":[],"lambda_search":false,"standardize":true,"non_negative":false,"fold_assignment":"AUTO","score_each_iteration":false,"co mpute_p_values":false,"remove_collinear_columns":false,"max_iterations":- 1,"link":"family_default","max_runtime_secs":0,"keep_cross_validation_predictions":false,"keep_cross_validation_fold_assignment": false,"missing_values_handling":"MeanImputation","intercept":true,"objective_epsilon":- 1,"beta_epsilon":0.0001,"gradient_epsilon":-1,"prior":-1,"max_active_predictors":-1} o r2 0.017224 • buildModel 'glm', {"model_id":"glm-2fcdf58a-da59-4b3f-8a68- 1177bce9531c","training_frame":"kc90_house_data.hex","nfolds":"10","seed":- 1,"response_column":"price","ignored_columns":["id"],"ignore_const_cols":true,"family":"gaussian","solver":"AUTO","alpha":[0.001 ],"lambda":[0.1],"lambda_search":false,"standardize":false,"non_negative":false,"fold_assignment":"AUTO","score_each_iteration": false,"compute_p_values":false,"remove_collinear_columns":false,"max_iterations":- 1,"link":"family_default","max_runtime_secs":0,"keep_cross_validation_predictions":false,"keep_cross_validation_fold_assignment": false,"missing_values_handling":"MeanImputation","intercept":true,"objective_epsilon":- 1,"beta_epsilon":0.0001,"gradient_epsilon":-1,"prior":-1,"max_active_predictors":-1} o r2 0.619503
  • 33. GLM - Regularization • L1 Lasso and L2 Ridge: o Regularization is used solve problems with overfitting in GLM. o Penalties are introduced to avoid overfitting, o To reduce variance of the prediction error o To handle correlated predictors. • H2O - Elastic Net o Alpha (0 – 1)/ Lambda(0 – 1, into very small fraction i.e. 0.0001) o Alpha = 0 –> Ridge o Alpha = 1 -> LASSO o Lambda – 0.0 > No regularization • Lambda Search o Enable Lambda_Search o lambda_min_ratio o Nlambdas o max_active_predictors • Look for Intercept, and p-values in the docs too • Documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#regularization
  • 34. Linear Regression - Prediction • Import data to predict • Feature Engineering o Proper encoding for numerical data o Hide – ID and other features • Experimentation with Alpha and Lambda o 0.001 & 0.1 – 0.61 – Did Prediction o 0.0001 & 0.1 - 0.61 o 0.0001 & 0.5 – 0.58 o 0.001 & 0.01 - 0.67 – Did Prediction o 0.004 & 0.01 – 0.70 – Did prediction • Understanding Training and validation – r2 values
  • 35. Price Prediction - GBM • Feature Engineering o Proper encoding for numerical data o Hide – ID and other features - All default - Training r2 = 0.95 & Validation r2 = 0.76 – Try prediction - Learning Rate – 0.5 + all default = 0.98/0.71 - Learning Rate – 0.01 + all default = 0.50/0.40 - Setting – stopping rounds
  • 36. Understanding GLM, GBM, DRF - Validation and Cross Validation - Scoring History, Validation History (Validation Frame/CV) - Training and Validation Metrics - Using Stopping metrics into Tree based algorithms – DRF/GBM - How adding tree depth changes the results? - Variable Importance - Gains and Lift Chart - Confusion Matrix (TPR/FPR) - Solvers Note: Using 3 FLOW interface build – GLM, GBM, DRF Model
  • 37. Improving Overall Results - Feature Engineering - Adding proper categories - Year, Waterfront, view, condition, grade, zipcode – Factors - How r2 values helps better prediction - GBM Improvements with CV - 0.3 Learning, all default – 0.795 >> Perform Prediction - Ntree=60, depth=5,l-rate=0.21,row-sr=0.8, col-sr=0.8> 0.80 - GBM with Validation frame - Ntree=60, depth=5,l-rate=0.21,row-sr=0.8, col-sr=0.8> 0.82 - DRF with CV and default settings > 0.80 - Finally now – Ignore Date column - DRF - 0.86 - GBM – 0.86 - With Update Feature Engineering - GLM – Aplha:0.004, lambda:0.01 > r2 = 0.77
  • 38. R - Demo - Start R - > h2o.init() - > h2o.clusterStatus() - ?h2o.gbm > Use the sample - > h2o.varimp(gbm) - > h2o.varimp_plot(gbm) - Performance - > perf <- h2o.performance(gbm, australia.hex) - > h2o.mse(perf) - > h2o.r2(perf) - Add CV – Rebuild the model with nfolds = 5 - > h2o.cross_validation_models(gbm) - Kmeans - ?h2o.kmeans - Run the example - Want to eval the K? - kmodel = h2o.kmeans(training_frame = prostate.hex, k = 10, x = c("AGE", "RACE", "VOL", "GLEASON")) - kmodel = h2o.kmeans(training_frame = prostate.hex, k = 100, x = c("AGE", "RACE", "VOL", "GLEASON"), estimate_k = T)
  • 39. Supplement Information - https://www.kaggle.com/harlfoxem/d/harlfoxem/housesalespre diction/house-price-prediction-part-1/discussion - https://www.kaggle.com/auygur/d/harlfoxem/housesalespredict ion/step-by-step-house-price-prediction-r-2-0-77/code - https://rpubs.com/MagicSea/property_price
  • 41. Binomial Classification Problem Description • Titanic Survivors list o Passengers on board: 1317, others were crew of total 2224 • Kaggle o https://www.kaggle.com/c/titanic • Download Dataset: o http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls • Local Datasets o /Users/avkashchauhan/learn/seattle-workshop/titanic_list.csv • Documentation o http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
  • 42. Predicting Titanic Survival Rate • Reference: o Using scikit-learn • http://localhost:8888/notebooks/Titanic%20Survival%20Dem o%20in%20Python.ipynb o Using H2O python estimators • http://localhost:8888/notebooks/Titanic%20Survival%20Dem o%20in%20H2O%20and%20Python.ipynb
  • 43. What we learn– ROC Curve AUC of 0.5 is random and 1 is perfect
  • 44. What we did • GLM/GBM/RF in Python + scikit-learn • GLM in H2O from Python • Grid Search • Working in FLOW o Ingest, split frame o GLM in H2O from FLOW • Response must be number – Understanding r2 o GBM in FLOW • Response numeric– Regression – r2 • Response enum – classification – AUC (Area under the curve) – Confusion matrix – ROC Curve – NEXT PAGE • AUC o AUC of 0.5 is random and 1 is perfect o Improve AUC
  • 45. What we covered • Data Import • Understood data, frame distribution, chunk, compression etc. • Understood all features, through histograms, enums, etc. • Split data frames for train and test • Converting features to Factors/Enum • Imputation of values
  • 47. Supported Model Types in H2O • Binary • POJO • MOJO exportModel "word2vec-0dfb7bfd-7a5d-42a9-9ebf-82706304a4fe", "/Users/avkashchauhan/Downloads/word2vec-0dfb7bfd-7a5d-42a9-9ebf- 82706304a4fe", overwrite: true
  • 48. POJO Demo • Run “GBM_Airlines_Classification” demo from FLOW examples • Download POJO from FLOW • Create a temp folder/Be in temp folder o Move gbm_pojo_test.java • Get Model from RESTful API – $ curl -X GET "http://localhost:54321/3/Models.java/gbm_pojo_test" >> gbm_pojo_test_1.java o Get H2O gen-model • curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar o Create main.java o Compile • $ javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m gbm_pojo_test.java main.java • Verify with compiled class files o Run • $ java -cp .:h2o-genmodel.jar main • See Results • Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/POJO_QuickStart.md
  • 49. MOJO Demo • Run “GBM_Airlines_Classification” demo from FLOW examples • Download POJO from FLOW • Create a temp folder/Be in temp folder o Move gbm_pojo_test.zip • Get Model from RESTful API – $ curl -X GET "http://localhost:54321/3/Models.java/gbm_pojo_test.zip" >> gbm_pojo_test.zip o Get H2O gen-model • curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar o Create main_mojo.java o Compile • $ javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m gbm_pojo_test.java main_mojo.java – Make sure main method name is main_mojo • Verify with compiled class files o Run • $ java -cp .:h2o-genmodel.jar main • See Results • Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/MOJO_QuickStart.md
  • 50. Thank you so much