SlideShare a Scribd company logo
Poo Kuan Hoong
Build an effective
Machine Learning
Model with LightGBM
Agenda
• Introduction
• Decision Tree
• Ensemble Method
• Gradient Boosting
• Motivation for Gradient Boosting on Decision Trees
• LightGBM
• Demo
About Me
Poo Kuan Hoong
• Google Developer Expert (GDE) in Machine
Learning
• Founded and managing Malaysia R User Group &
TensorFlow & Deep Learning Malaysia User
Group
Malaysia R User Group
https://www.facebook.com/groups/MalaysiaRUserGroup/
Questions?
www.sli.do #X490
Introduction
• Everyone is jumping into the hype of
Deep Learning.
• However, Deep Learning is not always
the best model.
• Deep Learning requires a lot of data,
hyperparameters tuning and training
time
• Often, the best model is the simplest
model.
Decision Tree
Goal
1. Partition input space
2. Pure class distribution in each partition
Decision Trees: Guillotine cuts
Decision Trees: Guillotine cuts
Decision Trees: Guillotine cuts
Finding The Best Split
Finding The Best Split
Finding The Best Split
Finding the best split
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Ensemble Methods
1. Weighted combination of
weak learners
2. Prediction is based on
committee votes
3. Boosting:
1. Train ensemble one weak
learner at the time
2. Focus new learners on
wrongly predicted examples
Gradient Boosting
1. Learn a regressor
2. Compute the error residual (Gradient in deep learning)
3. Then build a new model to predict that residual
Motivation for gradient boosting on Decision
Trees
Single decision tree can easily overfit the data
Naïve Gradient Boosting
Gradient boosting on decision trees
• Let’s define our objective functions
Gradient boosting on decision trees –
regularization
Tricks from XGBoost
• The tree is grown in breadth first fashion (as opposed to depth first
like in the original C4.5 implementation). This provides a possibility of
sorting and traversing data only once on each level
• Furthermore, the sorted features can be cached – no need to sort
that many times
LightGBM
• LightGBM is a fast, distributed, high-performance gradient boosting
framework based on decision tree algorithm, used for ranking,
classification and many other machine learning tasks.
• New library, developed by Microsoft, part of Distributed Machine
Learning Toolkit.
• Main idea: make the training faster First release: April, 24th 2017
Gradient Boosting Machine (GBM)
Why LightGBM?
• Light GBM grows tree vertically while
other algorithm grows trees horizontally
meaning that Light GBM grows tree leaf-
wise while other algorithm grows level-
wise.
• It will choose the leaf with max delta
loss to grow. When growing the same
leaf, Leaf-wise algorithm can reduce
more loss than a level-wise algorithm
Features
Speed
• Light GBM is prefixed as ‘Light’ because of its high speed. Light
GBM can handle the large size of data and takes lower memory to
run
Accuracy
• LightGBM focuses on accuracy of results.
Distributed/Parellel Computing
• LGBM also supports GPU learning
Tips to fine tune LightGBM
• Following set of practices can be used to improve your model
efficiency.
• num_leaves: This is the main parameter to control the complexity of the
tree model. Ideally, the value of num_leaves should be less than or equal
to 2^(max_depth). Value more than this will result in overfitting.
• min_data_in_leaf: Setting it to a large value can avoid growing too deep
a tree, but may cause under-fitting. In practice, setting it to hundreds or
thousands is enough for a large dataset.
• max_depth: You also can use max_depth to limit the tree depth
explicitly.
Tips to fine tune LightGBM
• For Faster Speed:
• Use bagging by setting bagging_fraction and
bagging_freq
• Use feature sub-sampling by setting
feature_fraction
• Use small max_bin
• Use save_binary to speed up data loading in future
learning
• Use parallel learning, refer to parallel learning guide.
Tips to fine tune LightGBM
• For better accuracy:
• Use large max_bin (may be slower)
• Use small learning_rate with large num_iterations
• Use large num_leaves (may cause over-fitting)
• Use bigger training data
• Try dart
• Try to use categorical feature directly
Conclusion
• LightGBM works well on
multiple datasets and its
accuracy is as good or even
better than other boosting
algorithms.
• Based on its speed and
accuracy, it is recommended
to try LightGBM
To install LightGBM R Package
• Build and install R-package with the following commands:
git clone --recursive
https://github.com/Microsoft/LightGBM
cd LightGBM
Rscript build_r.R
https://github.com/Microsoft/LightGBM/tree/master/R-package
DEMO
Data
• Porto Seguro’s Safe Driver Prediction
• https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
Poo Kuan Hoong
kuanhoong@gmail.com
http://www.linkedin.com/in/kuanhoong
https://twitter.com/kuanhoong
Build an efficient Machine Learning model with LightGBM

More Related Content

What's hot (20)

PDF
Feature Engineering
HJ van Veen
 
PDF
Machine learning life cycle
Ramjee Ganti
 
PDF
General Tips for participating Kaggle Competitions
Mark Peng
 
PPTX
Adversarial search
Dheerendra k
 
PDF
Introduction to Diffusion Models
Sangwoo Mo
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
PPTX
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
PDF
Recurrent neural networks rnn
Kuppusamy P
 
PDF
Interpretable machine learning : Methods for understanding complex models
Manojit Nandi
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Ml2 train test-splits_validation_linear_regression
ankit_ppt
 
PPTX
Introduction to XGboost
Shuai Zhang
 
PDF
Hyperparameter Optimization for Machine Learning
Francesco Casalegno
 
PPTX
Generative models
Birger Moell
 
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
Alluxio, Inc.
 
PPTX
Summer Report on Mathematics for Machine learning: Imperial College of London
Yash Khanna
 
PPTX
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
PDF
Kaggle presentation
HJ van Veen
 
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 
Feature Engineering
HJ van Veen
 
Machine learning life cycle
Ramjee Ganti
 
General Tips for participating Kaggle Competitions
Mark Peng
 
Adversarial search
Dheerendra k
 
Introduction to Diffusion Models
Sangwoo Mo
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Recurrent neural networks rnn
Kuppusamy P
 
Interpretable machine learning : Methods for understanding complex models
Manojit Nandi
 
Big Data Analytics with Hadoop
Philippe Julio
 
Ml2 train test-splits_validation_linear_regression
ankit_ppt
 
Introduction to XGboost
Shuai Zhang
 
Hyperparameter Optimization for Machine Learning
Francesco Casalegno
 
Generative models
Birger Moell
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Alluxio, Inc.
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Yash Khanna
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
Kaggle presentation
HJ van Veen
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 

Similar to Build an efficient Machine Learning model with LightGBM (20)

PDF
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Mail.ru Group
 
PPTX
LightGBM and Multilayer perceptron (MLP) slide
riahaque1950
 
PPTX
Gbm.more GBM in H2O
Sri Ambati
 
PPTX
GBM package in r
mark_landry
 
PDF
GLM & GBM in H2O
Sri Ambati
 
PPTX
Introduction to XGBoost Machine Learning Model.pptx
agathaljjwm20
 
PDF
H2O World - GBM and Random Forest in H2O- Mark Landry
Sri Ambati
 
PDF
Boosting Algorithms Omar Odibat
omarodibat
 
PDF
Kaggle Days Madrid - Alberto Danese
Alberto Danese
 
PPTX
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Codemotion
 
PDF
Machine Learning Wars: Deep Learning vs Gradient Boosting Machines
Sefik Ilkin Serengil
 
PPTX
H20 - Thirst for Machine Learning
MeetupDataScienceRoma
 
PPTX
XGBOOST [Autosaved]12.pptx
yadav834181
 
PDF
193_report (1)
Maksim Korolev
 
PDF
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
PDF
Ensembles.pdf
Leonardo Auslender
 
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
PDF
Machine Learning - Supervised Learning
Giorgio Alfredo Spedicato
 
PPTX
XgBoost.pptx
sumankumar507
 
PDF
Gradient Boosting Machines (GBM): from Zero to Hero (with R and Python code)
Data Con LA
 
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Mail.ru Group
 
LightGBM and Multilayer perceptron (MLP) slide
riahaque1950
 
Gbm.more GBM in H2O
Sri Ambati
 
GBM package in r
mark_landry
 
GLM & GBM in H2O
Sri Ambati
 
Introduction to XGBoost Machine Learning Model.pptx
agathaljjwm20
 
H2O World - GBM and Random Forest in H2O- Mark Landry
Sri Ambati
 
Boosting Algorithms Omar Odibat
omarodibat
 
Kaggle Days Madrid - Alberto Danese
Alberto Danese
 
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Codemotion
 
Machine Learning Wars: Deep Learning vs Gradient Boosting Machines
Sefik Ilkin Serengil
 
H20 - Thirst for Machine Learning
MeetupDataScienceRoma
 
XGBOOST [Autosaved]12.pptx
yadav834181
 
193_report (1)
Maksim Korolev
 
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Ensembles.pdf
Leonardo Auslender
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Machine Learning - Supervised Learning
Giorgio Alfredo Spedicato
 
XgBoost.pptx
sumankumar507
 
Gradient Boosting Machines (GBM): from Zero to Hero (with R and Python code)
Data Con LA
 
Ad

More from Poo Kuan Hoong (20)

PDF
Tensor flow 2.0 what's new
Poo Kuan Hoong
 
PDF
The future outlook and the path to be Data Scientist
Poo Kuan Hoong
 
PDF
Data Driven Organization and Data Commercialization
Poo Kuan Hoong
 
PDF
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
PDF
Explore and Have Fun with TensorFlow: Transfer Learning
Poo Kuan Hoong
 
PDF
Deep Learning with R
Poo Kuan Hoong
 
PDF
Explore and have fun with TensorFlow: An introductory to TensorFlow
Poo Kuan Hoong
 
PDF
The path to be a Data Scientist
Poo Kuan Hoong
 
PPTX
Deep Learning with Microsoft R Open
Poo Kuan Hoong
 
PPTX
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Poo Kuan Hoong
 
PDF
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
PDF
Machine Learning and Deep Learning with R
Poo Kuan Hoong
 
PDF
The path to be a data scientist
Poo Kuan Hoong
 
PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
PDF
Big Data Malaysia - A Primer on Deep Learning
Poo Kuan Hoong
 
PDF
Handwritten Recognition using Deep Learning with R
Poo Kuan Hoong
 
PDF
An Introduction to Deep Learning
Poo Kuan Hoong
 
PDF
Machine learning and big data
Poo Kuan Hoong
 
PDF
DSRLab seminar Introduction to deep learning
Poo Kuan Hoong
 
PDF
Context Aware Road Traffic Speech Information System from Social Media
Poo Kuan Hoong
 
Tensor flow 2.0 what's new
Poo Kuan Hoong
 
The future outlook and the path to be Data Scientist
Poo Kuan Hoong
 
Data Driven Organization and Data Commercialization
Poo Kuan Hoong
 
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
Explore and Have Fun with TensorFlow: Transfer Learning
Poo Kuan Hoong
 
Deep Learning with R
Poo Kuan Hoong
 
Explore and have fun with TensorFlow: An introductory to TensorFlow
Poo Kuan Hoong
 
The path to be a Data Scientist
Poo Kuan Hoong
 
Deep Learning with Microsoft R Open
Poo Kuan Hoong
 
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Poo Kuan Hoong
 
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
Machine Learning and Deep Learning with R
Poo Kuan Hoong
 
The path to be a data scientist
Poo Kuan Hoong
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
Big Data Malaysia - A Primer on Deep Learning
Poo Kuan Hoong
 
Handwritten Recognition using Deep Learning with R
Poo Kuan Hoong
 
An Introduction to Deep Learning
Poo Kuan Hoong
 
Machine learning and big data
Poo Kuan Hoong
 
DSRLab seminar Introduction to deep learning
Poo Kuan Hoong
 
Context Aware Road Traffic Speech Information System from Social Media
Poo Kuan Hoong
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Python basic programing language for automation
DanialHabibi2
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

Build an efficient Machine Learning model with LightGBM

  • 1. Poo Kuan Hoong Build an effective Machine Learning Model with LightGBM
  • 2. Agenda • Introduction • Decision Tree • Ensemble Method • Gradient Boosting • Motivation for Gradient Boosting on Decision Trees • LightGBM • Demo
  • 3. About Me Poo Kuan Hoong • Google Developer Expert (GDE) in Machine Learning • Founded and managing Malaysia R User Group & TensorFlow & Deep Learning Malaysia User Group
  • 4. Malaysia R User Group https://www.facebook.com/groups/MalaysiaRUserGroup/
  • 6. Introduction • Everyone is jumping into the hype of Deep Learning. • However, Deep Learning is not always the best model. • Deep Learning requires a lot of data, hyperparameters tuning and training time • Often, the best model is the simplest model.
  • 8. Goal 1. Partition input space 2. Pure class distribution in each partition
  • 16. Greedily Constructing A Decision Tree
  • 17. Greedily Constructing A Decision Tree
  • 18. Greedily Constructing A Decision Tree
  • 19. Greedily Constructing A Decision Tree
  • 20. Ensemble Methods 1. Weighted combination of weak learners 2. Prediction is based on committee votes 3. Boosting: 1. Train ensemble one weak learner at the time 2. Focus new learners on wrongly predicted examples
  • 21. Gradient Boosting 1. Learn a regressor 2. Compute the error residual (Gradient in deep learning) 3. Then build a new model to predict that residual
  • 22. Motivation for gradient boosting on Decision Trees Single decision tree can easily overfit the data
  • 24. Gradient boosting on decision trees • Let’s define our objective functions
  • 25. Gradient boosting on decision trees – regularization
  • 26. Tricks from XGBoost • The tree is grown in breadth first fashion (as opposed to depth first like in the original C4.5 implementation). This provides a possibility of sorting and traversing data only once on each level • Furthermore, the sorted features can be cached – no need to sort that many times
  • 27. LightGBM • LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. • New library, developed by Microsoft, part of Distributed Machine Learning Toolkit. • Main idea: make the training faster First release: April, 24th 2017
  • 29. Why LightGBM? • Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf- wise while other algorithm grows level- wise. • It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm
  • 30. Features Speed • Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run Accuracy • LightGBM focuses on accuracy of results. Distributed/Parellel Computing • LGBM also supports GPU learning
  • 31. Tips to fine tune LightGBM • Following set of practices can be used to improve your model efficiency. • num_leaves: This is the main parameter to control the complexity of the tree model. Ideally, the value of num_leaves should be less than or equal to 2^(max_depth). Value more than this will result in overfitting. • min_data_in_leaf: Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset. • max_depth: You also can use max_depth to limit the tree depth explicitly.
  • 32. Tips to fine tune LightGBM • For Faster Speed: • Use bagging by setting bagging_fraction and bagging_freq • Use feature sub-sampling by setting feature_fraction • Use small max_bin • Use save_binary to speed up data loading in future learning • Use parallel learning, refer to parallel learning guide.
  • 33. Tips to fine tune LightGBM • For better accuracy: • Use large max_bin (may be slower) • Use small learning_rate with large num_iterations • Use large num_leaves (may cause over-fitting) • Use bigger training data • Try dart • Try to use categorical feature directly
  • 34. Conclusion • LightGBM works well on multiple datasets and its accuracy is as good or even better than other boosting algorithms. • Based on its speed and accuracy, it is recommended to try LightGBM
  • 35. To install LightGBM R Package • Build and install R-package with the following commands: git clone --recursive https://github.com/Microsoft/LightGBM cd LightGBM Rscript build_r.R https://github.com/Microsoft/LightGBM/tree/master/R-package
  • 36. DEMO
  • 37. Data • Porto Seguro’s Safe Driver Prediction • https://www.kaggle.com/c/porto-seguro-safe-driver-prediction