SlideShare uma empresa Scribd logo
Biblitecas de Machine Learning
H2O.aieXGBoost
David Pinto
16 Junho, 2016
H20.ai Tutorial
Instalação
#RemoveanypreviouslyinstalledH2OpackagesforR
if("package:h2o"%in%search())
detach("package:h2o",unload=TRUE)
if("h2o"%in%rownames(installed.packages()))
remove.packages("h2o")
#DownloadpackagesthatH2Odependson
pkgs<-c("methods","statmod","stats","graphics",
"RCurl","jsonlite","tools","utils")
for(pkginpkgs){
if(!(pkg%in%rownames(installed.packages())))
install.packages(pkg)
}
#DownloadandinstalltheH2OpackageforR
install.packages("h2o",type="source",repos=(
c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/R")
))
3/21
Inicialização da Biblioteca
library('h2o')
##AllowedRAM:2GB,allowedcores:all
h2o.init(max_mem_size='2G',nthreads=-1)
h2o.removeAll()
##Killh2ocluster
#h2o.shutdown(prompt=FALSE)
4/21
Carregamento dos Dados
label.name<-'Target'
###Loadtraindata
train.hex<-h2o.importFile(path=normalizePath("train.csv.zip"),
destination_frame='train.hex')
train.hex[,label.name]<-as.factor(train.hex[,label.name])
###Loadtestdata
test.hex<-h2o.importFile(path=normalizePath("test.csv.zip"),
destination_frame='test.hex')
###Getfeaturenames
input.names<-setdiff(colnames(train.hex),c('Id',label.name))
5/21
Random Forest
rf.model<-h2o.randomForest(x=input.names,y=label.name,
training_frame=train.hex,ntrees=10)
6/21
Predição dos dados de Teste
##Predictprobabilities
options(scipen=999)
rf.pred<-as.data.frame(predict(rf.model,test.hex))
rf.pred$predict<-NULL
rf.pred<-round(rf.pred,8)
##Saveprediction
rf.pred<-cbind.data.frame(Id=as.integer(1:nrow.H2OFrame(test.hex)),rf.pred)
write.csv(
rf.pred,
file=gzfile('rf_pred.csv.gz'),
row.names=FALSE
)
7/21
Importância das Variáveis
###Variableranking
var.imp<-h2o.varimp(rf.model)
###Best15features
best.feat<-var.imp$variable[1:15]
8/21
Validação-Cruzada k-Fold
rf.cv<-h2o.randomForest(
x=input.names,y=label.name,nfolds=5,
training_frame=train.hex,ntrees=20,
stopping_rounds=10,stopping_metric='logloss',
score_tree_interval=5
)
train.loss<-rf.cv@model$training_metrics@metrics$logloss
test.loss<-rf.cv@model$cross_validation_metrics@metrics$logloss
9/21
Parâmetros do Modelo
Os principais parâmetros a serem ajustados em um modelo Random Forest são:
mtries: número de variáveis candidatas selecionadas aleatoriamente em
cada split de cada árvore. Valor default: sqrt(p)para classificação e p/3para
regressão
sample_rate: fração de observações aleatoriamente selecionadas para cada
árvore. Valor default: 0.632 (bootstrap)
max_depth: profundidade máxima das árvores. Valor default: 20
·
·
·
10/21
Otimização de Parâmetros via Random Search
rf.grid<-h2o.grid(
'randomForest',grid_id='rf_search',x=input.names,
y=label.name,training_frame=train.hex,is_supervised=TRUE,
nfolds=5,ntrees=300,stopping_rounds=10,
stopping_metric='logloss',score_tree_interval=5,
search_criteria=list(
strategy="RandomDiscrete",max_models=100,
max_runtime_secs=3*60*60
),
hyper_params=list(
mtries=seq(from=5,to=30,by=1),
max_depth=seq(from=3,to=15,by=1),
sample_rate=seq(from=0.3,to=0.9,by=0.05)
)
)
summary(rf.grid)
11/21
Outros Algoritmos
Classificação e/ou Regressão: Naive-Bayes (h2o.naiveBayes), Regularized GLM
(h2o.glm), Gradient Boosting Machine (h2o.gbm), Deep Learning
(h2o.deeplearning)
Extração de Características: Principal Components Analysis (h2o.prcomp),
Generalized Low-Rank Models (h2o.glrm) e Deep Features (h2o.deepfeatures)
Clusterização: k-Means (h2o.kmeans)
·
·
·
12/21
XGBoost Tutorial
Instalação
OBS: Usuários Windows precisam instalar o pacote RTools antes.
devtools::install_github('tqchen/xgboost',subdir='R-package')
14/21
Carregamento dos dados
Usar o script indicado na primeira aula para criar as variáveis x, x.newe y. Em
seguida:
library('xgboost')
library('Matrix')
dtrain<-xgb.DMatrix(
Matrix(x,sparse=TRUE),
label=as.integer(y)-1
)
dtest<-xgb.DMatrix(
Matrix(x.new,sparse=TRUE)
)
15/21
Treinamento do Modelo
xgb.params<-list(
"booster"="gbtree",#or"gblinear"
"num_class"=7,
"objective"="multi:softprob",
"eval_metric"="mlogloss",
"silent"=1,
"nthread"=4
)
xgb.model<-xgb.train(
data=dtrain,
params=xgb.params,
nrounds=10
)
16/21
Predição dos dados de Teste
###Avoidscientificnotation
options(scipen=999)
###Generatematrixofpredictions
xgb.pred<-matrix(predict(xgb.model,dtest),ncol=7,byrow=TRUE)
###Reducenumberofdecimalplaces
xgb.pred<-round(xgb.pred,8)
###Transformtodata.frame
xgb.pred<-as.data.frame(xgb.pred)
names(xgb.pred)<-levels(y)
xgb.pred<-cbind.data.frame(Id=as.integer(dt.te$Id),xgb.pred)
###Saveresult
write.csv(
xgb.pred,
file=gzfile('xgb_pred.csv.gz'),
row.names=FALSE
)
17/21
Validação-Cruzada k-Fold
###5-foldCV
cv.out<-xgb.cv(params=xgb.params,data=dtrain,nrounds=20,
nfold=5,prediction=FALSE,stratified=TRUE,
verbose=TRUE,showsd=FALSE,print.every.n=10,
early.stop.round=5,maximize=FALSE)
###Getbestperformance
best.train<-min(cv.out$train.mlogloss.mean)
best.test <-min(cv.out$test.mlogloss.mean)
best.iter <-which.min(cv.out$test.mlogloss.mean)
18/21
Ajuste de Parâmetros
Um modelo XGBoost possui muitos parâmetros (12), que podem ser ajustados
usando validação k-Fold. Os mais importantes são:
eta: taxa de aprendizado. Valor default: 0.3
max_depth: profundidade máxima das árvores. Valor default: 6
subsample: fração de observações aleatoriamente selecionadas para cada
árvore. Valor default: 1 (todas)
colsample_bytree: fração de variáveis aleatoriamente selecionadas para
cada árvore. Valor default: 1 (todas)
·
·
·
·
19/21
Ranqueamento de Variáveis
###Variableranking
var.imp<-xgb.importance(colnames(x),model=xgb.model)
###Best15features
best.feat<-var.imp$Feature[1:15]
x.sub<-x[,best.feat]
x.new.sub<-x.new[,best.feat]
20/21
Dúvidas?

Mais conteúdo relacionado

Destaque (20)

PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
PDF
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
PDF
Everything You Need To Know About ChatGPT
Expeed Software
 
PDF
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
PDF
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
PDF
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
PDF
Skeleton Culture Code
Skeleton Technologies
 
PDF
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
PDF
Content Methodology: A Best Practices Report (Webinar)
contently
 
PPTX
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
PDF
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
PDF
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
PDF
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
PDF
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
PDF
Getting into the tech field. what next
Tessa Mero
 
PDF
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
PDF
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 

Tutorial de Machine Learning