SlideShare a Scribd company logo
2
Most read
3
Most read
16
Most read
XGBoost: A Scalable Tree Boosting System
Simon Lia-Jonassen
Motivation
 Used by majority of winning solutions on
Kaggle, 2nd most popular method after DNN.
 Also used by 10 best teams in KDDCup’15.
 Applies to classification, regression and
learning-to-rank tasks.
 Usually outperforms alternatives in an
out-of-the-box setting.
 Combines a good theoretical foundation and
a highly efficient implementation.
 So, how does it work?
Decision Tree Boosting
Number of trees Tree function,
maps to a set of leaf weights
Instance features
Regularized Learning Objective
Prediction loss Complexity penalty
Number of leaves L2 regularization on
leaves weights
Regularized Learning Objective
First order gradient
of the loss function
Second order gradient
of the loss function
By additive definition
Where:
However, for example:
Regularized Learning Objective
By expansion:
For each
instance
For each leaf For each
instance
in the leaf
Regularized Learning Objective
Optimal leaf weight for a fixed structure:
By substitution:
Gradient Tree Boosting
Before
we split
Left
split
Right
split
Split
penalty
Gradient Tree Boosting
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
Further reading
 The paper:
 https://arxiv.org/pdf/1603.02754.pdf
 XGBoost tutorial:
 http://xgboost.readthedocs.io/en/latest/model.html
 A great deck of slides:
 https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
 A simple usage example:
 https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python
 DataCamp mini-course:
 https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost

More Related Content

What's hot (20)

PPTX
Understanding Query Optimization with ‘regular’ and ‘Exadata’ Oracle
Guatemala User Group
 
PDF
Feature Engineering
Sri Ambati
 
PDF
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
PPTX
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
PPSX
LMAX Disruptor - High Performance Inter-Thread Messaging Library
Sebastian Andrasoni
 
PPTX
Introducing Azure SQL Database
James Serra
 
PPT
obiee basics ppt
rajtrainings
 
PDF
Algoritmo Guloso
Vinicius Marangoni
 
PDF
The Feature Store in Hopsworks
Jim Dowling
 
PDF
Text classification presentation
Marijn van Zelst
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
Lessons Learned: Understanding Azure Data Factory Pricing (Microsoft Ignite 2...
Cathrine Wilhelmsen
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PPTX
Introduction to Apache Kudu
Jeff Holoman
 
PPTX
Gpdb best practices v a01 20150313
Sanghee Lee
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PPTX
Query Optimization in SQL Server
Rajesh Gunasundaram
 
PDF
Master the RETE algorithm
Masahiko Umeno
 
Understanding Query Optimization with ‘regular’ and ‘Exadata’ Oracle
Guatemala User Group
 
Feature Engineering
Sri Ambati
 
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
Sebastian Andrasoni
 
Introducing Azure SQL Database
James Serra
 
obiee basics ppt
rajtrainings
 
Algoritmo Guloso
Vinicius Marangoni
 
The Feature Store in Hopsworks
Jim Dowling
 
Text classification presentation
Marijn van Zelst
 
Apache HBase™
Prashant Gupta
 
Lessons Learned: Understanding Azure Data Factory Pricing (Microsoft Ignite 2...
Cathrine Wilhelmsen
 
Apache Flink internals
Kostas Tzoumas
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Introduction to Apache Kudu
Jeff Holoman
 
Gpdb best practices v a01 20150313
Sanghee Lee
 
Introduction to Data Engineering
Durga Gadiraju
 
Query Optimization in SQL Server
Rajesh Gunasundaram
 
Master the RETE algorithm
Masahiko Umeno
 

Similar to Xgboost: A Scalable Tree Boosting System - Explained (20)

PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 
PPTX
XgBoost.pptx
sumankumar507
 
PDF
Xgboost
Vivian S. Zhang
 
PDF
Introduction to XGBoost
Joonyoung Yi
 
PPTX
XGBOOST [Autosaved]12.pptx
yadav834181
 
PDF
Boosting Algorithms Omar Odibat
omarodibat
 
PDF
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
PPTX
Comparison Study of Decision Tree Ensembles for Regression
Seonho Park
 
PDF
Understanding Bagging and Boosting
Mohit Rajput
 
PPTX
XGBoost (System Overview)
Natallie Baikevich
 
PPTX
Tech Talk overview of xgboost and review of paper
Tushar Tank
 
PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
PPTX
Jordan Evans Kaplan.pptx
Samyati
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
193_report (1)
Maksim Korolev
 
PPTX
Ppt shuai
Xiang Zhang
 
PPTX
Introduction to RandomForests 2004
Salford Systems
 
PPTX
PPT_ML.pptx______________________________________
lalithavaddadi
 
PPTX
Solar energy Forecasting and site adjustment using ML.pptx
PriyanshuParamjitDas
 
PPTX
Introduction of Xgboost
michiaki ito
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 
XgBoost.pptx
sumankumar507
 
Introduction to XGBoost
Joonyoung Yi
 
XGBOOST [Autosaved]12.pptx
yadav834181
 
Boosting Algorithms Omar Odibat
omarodibat
 
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
Comparison Study of Decision Tree Ensembles for Regression
Seonho Park
 
Understanding Bagging and Boosting
Mohit Rajput
 
XGBoost (System Overview)
Natallie Baikevich
 
Tech Talk overview of xgboost and review of paper
Tushar Tank
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Jordan Evans Kaplan.pptx
Samyati
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
193_report (1)
Maksim Korolev
 
Ppt shuai
Xiang Zhang
 
Introduction to RandomForests 2004
Salford Systems
 
PPT_ML.pptx______________________________________
lalithavaddadi
 
Solar energy Forecasting and site adjustment using ML.pptx
PriyanshuParamjitDas
 
Introduction of Xgboost
michiaki ito
 
Ad

More from Simon Lia-Jonassen (10)

PDF
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
PPTX
HyperLogLog and friends
Simon Lia-Jonassen
 
PPTX
No more bad news!
Simon Lia-Jonassen
 
PPTX
Chatbots are coming!
Simon Lia-Jonassen
 
PDF
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
 
PDF
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
PDF
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
PDF
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
PDF
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
PDF
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Building successful and secure products with AI and ML
Simon Lia-Jonassen
 
HyperLogLog and friends
Simon Lia-Jonassen
 
No more bad news!
Simon Lia-Jonassen
 
Chatbots are coming!
Simon Lia-Jonassen
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
 
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 

Xgboost: A Scalable Tree Boosting System - Explained

  • 1. XGBoost: A Scalable Tree Boosting System Simon Lia-Jonassen
  • 2. Motivation  Used by majority of winning solutions on Kaggle, 2nd most popular method after DNN.  Also used by 10 best teams in KDDCup’15.  Applies to classification, regression and learning-to-rank tasks.  Usually outperforms alternatives in an out-of-the-box setting.  Combines a good theoretical foundation and a highly efficient implementation.  So, how does it work?
  • 3. Decision Tree Boosting Number of trees Tree function, maps to a set of leaf weights Instance features
  • 4. Regularized Learning Objective Prediction loss Complexity penalty Number of leaves L2 regularization on leaves weights
  • 5. Regularized Learning Objective First order gradient of the loss function Second order gradient of the loss function By additive definition Where: However, for example:
  • 6. Regularized Learning Objective By expansion: For each instance For each leaf For each instance in the leaf
  • 7. Regularized Learning Objective Optimal leaf weight for a fixed structure: By substitution:
  • 8. Gradient Tree Boosting Before we split Left split Right split Split penalty
  • 10. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 11. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 12. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 13. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 14. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 16. Further reading  The paper:  https://arxiv.org/pdf/1603.02754.pdf  XGBoost tutorial:  http://xgboost.readthedocs.io/en/latest/model.html  A great deck of slides:  https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf  A simple usage example:  https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python  DataCamp mini-course:  https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost