SlideShare a Scribd company logo
H2O AutoML Roadmap 2016.10
Raymond Peck
Director of Product Engineering, H2O.ai
rpeck@h2o.ai
© H2O.ai, 2016 1
What Will We Cover?
• What is AutoML?
• What is the roadmap for H2O AutoML?
© H2O.ai, 2016 2
What is AutoML?
H2O AutoML automates parts of data preparation and model
training in order to help both Machine Learning / Data Science
experts and complete novices.
Other AutoML projects concentrate on novices.
© H2O.ai, 2016 3
Outside AutoML Projects
• auto-sklearn
• AutoCompete
• TPOT
• DataRobot
• Automatic Statistician
• BigML
• et al...
© H2O.ai, 2016 4
Who is the Target Audience?
• "Big green button" for novice users such as software
developers and business analysts;
• Iterative, interactive use and controls for expert users:
• Machine Learning experts
• Descriptive Data Scientists
© H2O.ai, 2016 5
What Are the Pieces?
• data cleaning
• feature engineering / feature generation
• feature selection
• for both the original and generated features
• model hyperparameter tuning
• automatic smart ensemble generation
© H2O.ai, 2016 6
Prior Work @ H2O
• ensembles (stacking), from Erin LeDell
• random hyperparameter search with automatic stopping,
from Raymond Peck
• some dataset characterization and feature engineering,
from Spencer Aiello
• hyperopt Bayesian hyperparameter optimization, from
Abhishek Malali
© H2O.ai, 2016 7
Current Work
• random hyperparameter search with parameter values
based on open datasets
• moving ensembles into the back end
• working on basic metalearning for hyperparameter vectors,
starting with 140 OpenML datasets
© H2O.ai, 2016 8
Future Work
• feature selection
• feature engineering for IID data
• Bayesian hyperparameter search with warm start
• feature engineering for non-IID data, e.g. time series
• iterate w/ larger datasets that are typical for our customers
• distribution guesser for regression
© H2O.ai, 2016 9
How Do We Evaluate Our Work?
• public datasets from
• OpenML
• ChaLearn AutoML challenge
• Kaggle
• our own Data Scientists' work with customer datasets
• customer feedback (soon)
© H2O.ai, 2016 10
Data Cleaning
• outlier analysis (with user feedback)
• sentinel value detection
• as a side-effect of outlier analysis
• type-based heuristics (e.g., 999999, 1970.01.01)
• identifier detection (e.g., customer ID)
• smart imputation
© H2O.ai, 2016 11
Feature Generation
We will be using several techniques including:
• type-based heuristics
• date/time expansion
• log and other transforms of numerics
• interactions (product, ratio, etc)
• feature generation with Deep Learning deepfeatures()
• clustering
© H2O.ai, 2016 12
Feature Selection
We will be evaluating several techniques including:
• Mutual Information (non-linear correlation)
• variable importance from GBM and Deep Learning
• PCA
• GLM with Elastic Net / LASSO
Perhaps different selectors for initial data and transforms / interactions
to trade off speed and the detection of non-linear relationships.
© H2O.ai, 2016 13
Hyperparameter Tuning
• currently do random hyperparameter search with metric-based
smart stopping
• hyperparameter values taken from hand-tuning 140 OpenML
datasets
• soon adding simple "nearest neighbors" warm start (basic
metalearning)
• then adding Bayesian hyperparameter optimization
• possibly integrating hyperopt into the back end
© H2O.ai, 2016 14
Automatic Smart Ensemble
Generation
• currently adding Erin LeDell's stacking / SuperLearner into the back end
• initially, ensemble top N models from hyperparameter searches
• optional "use original features"
• smarter ensemble generation for faster scoring, less overfitting:
• greedy ensemble creation
• ensemble models with uncorrelated residuals
© H2O.ai, 2016 15
Possible Futures
• multiple concurrent H2O clusters for speed
• freeze/thaw model training
• outlier analysis with user feedback
• residuals analysis with user feedback
• composite models using pre-clustering step
• try to predict accuracy from dataset metadata
• training time prediction
• scoring time prediction
© H2O.ai, 2016 16

More Related Content

Viewers also liked (10)

PPTX
400 million Search Results -Predict Contextual Ad Clicks
Sri Ambati
 
PPT
Robsonalves fotografia Fine Art 2016-2
Robson Alves
 
PPTX
Alice Lindorfer
AOtaki
 
PDF
Introduction to Deducer
Kazuki Yoshida
 
PDF
Data mining with Rattle For R
Akhil Anil
 
PPTX
Installing R and R-Studio
Syracuse University
 
PPTX
Automating Machine Learning - Is it feasible?
Manuel Martín
 
PPTX
R and Rcmdr Statistical Software
arttan2001
 
PDF
Automatic Machine Learning, AutoML
Himadri Mishra
 
PPTX
R-Studio Vs. Rcmdr
Syracuse University
 
400 million Search Results -Predict Contextual Ad Clicks
Sri Ambati
 
Robsonalves fotografia Fine Art 2016-2
Robson Alves
 
Alice Lindorfer
AOtaki
 
Introduction to Deducer
Kazuki Yoshida
 
Data mining with Rattle For R
Akhil Anil
 
Installing R and R-Studio
Syracuse University
 
Automating Machine Learning - Is it feasible?
Manuel Martín
 
R and Rcmdr Statistical Software
arttan2001
 
Automatic Machine Learning, AutoML
Himadri Mishra
 
R-Studio Vs. Rcmdr
Syracuse University
 

Similar to H2O Machine Learning AutoML Roadmap 2016.10 (20)

PDF
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB
 
PDF
LIMS Implementation
Robin Emig
 
PDF
Experimental Performance Evaluation of RPA Bots
Michael Groeschel
 
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
PPTX
Building a Real-Time Security Application Using Log Data and Machine Learning...
Sri Ambati
 
PDF
DevOps Days Rockies MLOps
Matthew Reynolds
 
PDF
MongoDB World 2019: High Performance Auditing of Changes Based on MongoDB Cha...
MongoDB
 
PPTX
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
PDF
Machine Learning - Eine Challenge für Architekten
Harald Erb
 
PDF
Bigowl aitech
José Manuel García Nieto
 
PPTX
Techniques for scaling application with security and visibility in cloud
Akshay Mathur
 
PDF
Algolytics company Overview 2015
Algolytics
 
PDF
Algolytics company Overview 2015
Algolytics (old account)
 
PDF
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 
PDF
An Introduction to Graph: Database, Analytics, and Cloud Services
Jean Ihm
 
PPTX
Machine Learning
Ramiro Aduviri Velasco
 
PDF
Internship Presentation.pdf
vishwajeetparmar1
 
PPTX
Making advertising personal, 4th NL Recommenders Meetup
Olivier Koch
 
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
PDF
What is the future of data strategy?
Denodo
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB
 
LIMS Implementation
Robin Emig
 
Experimental Performance Evaluation of RPA Bots
Michael Groeschel
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Sri Ambati
 
DevOps Days Rockies MLOps
Matthew Reynolds
 
MongoDB World 2019: High Performance Auditing of Changes Based on MongoDB Cha...
MongoDB
 
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
Machine Learning - Eine Challenge für Architekten
Harald Erb
 
Techniques for scaling application with security and visibility in cloud
Akshay Mathur
 
Algolytics company Overview 2015
Algolytics
 
Algolytics company Overview 2015
Algolytics (old account)
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 
An Introduction to Graph: Database, Analytics, and Cloud Services
Jean Ihm
 
Machine Learning
Ramiro Aduviri Velasco
 
Internship Presentation.pdf
vishwajeetparmar1
 
Making advertising personal, 4th NL Recommenders Meetup
Olivier Koch
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
What is the future of data strategy?
Denodo
 
Ad

Recently uploaded (20)

PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Ad

H2O Machine Learning AutoML Roadmap 2016.10

  • 1. H2O AutoML Roadmap 2016.10 Raymond Peck Director of Product Engineering, H2O.ai [email protected] © H2O.ai, 2016 1
  • 2. What Will We Cover? • What is AutoML? • What is the roadmap for H2O AutoML? © H2O.ai, 2016 2
  • 3. What is AutoML? H2O AutoML automates parts of data preparation and model training in order to help both Machine Learning / Data Science experts and complete novices. Other AutoML projects concentrate on novices. © H2O.ai, 2016 3
  • 4. Outside AutoML Projects • auto-sklearn • AutoCompete • TPOT • DataRobot • Automatic Statistician • BigML • et al... © H2O.ai, 2016 4
  • 5. Who is the Target Audience? • "Big green button" for novice users such as software developers and business analysts; • Iterative, interactive use and controls for expert users: • Machine Learning experts • Descriptive Data Scientists © H2O.ai, 2016 5
  • 6. What Are the Pieces? • data cleaning • feature engineering / feature generation • feature selection • for both the original and generated features • model hyperparameter tuning • automatic smart ensemble generation © H2O.ai, 2016 6
  • 7. Prior Work @ H2O • ensembles (stacking), from Erin LeDell • random hyperparameter search with automatic stopping, from Raymond Peck • some dataset characterization and feature engineering, from Spencer Aiello • hyperopt Bayesian hyperparameter optimization, from Abhishek Malali © H2O.ai, 2016 7
  • 8. Current Work • random hyperparameter search with parameter values based on open datasets • moving ensembles into the back end • working on basic metalearning for hyperparameter vectors, starting with 140 OpenML datasets © H2O.ai, 2016 8
  • 9. Future Work • feature selection • feature engineering for IID data • Bayesian hyperparameter search with warm start • feature engineering for non-IID data, e.g. time series • iterate w/ larger datasets that are typical for our customers • distribution guesser for regression © H2O.ai, 2016 9
  • 10. How Do We Evaluate Our Work? • public datasets from • OpenML • ChaLearn AutoML challenge • Kaggle • our own Data Scientists' work with customer datasets • customer feedback (soon) © H2O.ai, 2016 10
  • 11. Data Cleaning • outlier analysis (with user feedback) • sentinel value detection • as a side-effect of outlier analysis • type-based heuristics (e.g., 999999, 1970.01.01) • identifier detection (e.g., customer ID) • smart imputation © H2O.ai, 2016 11
  • 12. Feature Generation We will be using several techniques including: • type-based heuristics • date/time expansion • log and other transforms of numerics • interactions (product, ratio, etc) • feature generation with Deep Learning deepfeatures() • clustering © H2O.ai, 2016 12
  • 13. Feature Selection We will be evaluating several techniques including: • Mutual Information (non-linear correlation) • variable importance from GBM and Deep Learning • PCA • GLM with Elastic Net / LASSO Perhaps different selectors for initial data and transforms / interactions to trade off speed and the detection of non-linear relationships. © H2O.ai, 2016 13
  • 14. Hyperparameter Tuning • currently do random hyperparameter search with metric-based smart stopping • hyperparameter values taken from hand-tuning 140 OpenML datasets • soon adding simple "nearest neighbors" warm start (basic metalearning) • then adding Bayesian hyperparameter optimization • possibly integrating hyperopt into the back end © H2O.ai, 2016 14
  • 15. Automatic Smart Ensemble Generation • currently adding Erin LeDell's stacking / SuperLearner into the back end • initially, ensemble top N models from hyperparameter searches • optional "use original features" • smarter ensemble generation for faster scoring, less overfitting: • greedy ensemble creation • ensemble models with uncorrelated residuals © H2O.ai, 2016 15
  • 16. Possible Futures • multiple concurrent H2O clusters for speed • freeze/thaw model training • outlier analysis with user feedback • residuals analysis with user feedback • composite models using pre-clustering step • try to predict accuracy from dataset metadata • training time prediction • scoring time prediction © H2O.ai, 2016 16