Winning Data Science 
Competitions 
Some (hopefully) useful pointers 
Owen Zhang 
Data Scientist
A plug for myself 
Current 
● Chief Product Officer 
Previous 
● VP, Science
Agenda 
● Structure of a Data Science Competition 
● Philosophical considerations 
● Sources of competitive advantage 
● Some technical tips 
● Apply what we learn out of competitions 
Philosophy 
Strategy 
Technique
Structure of a Data Science Competition 
Build model using Training Data to 
predict outcomes on Private LB Data 
Training Public LB 
(validation) 
Private LB 
(holdout) 
Quick but often misleading feedback 
Data Science Competitions remind us that the purpose of a 
predictive model is to predict on data that we have NOT seen.
A little “philosophy” 
● There are many ways to overfit 
● Beware of “multiple comparison fallacy” 
○ There is a cost in “peaking at the answer”, 
○ Usually the first idea (if it works) is the best 
“Think” more, “try” less
Sources of Competitive Advantage 
● Discipline (once bitten twice shy) 
○ Proper validation framework 
● Effort 
● (Some) Domain knowledge 
● Feature engineering 
● The “right” model structure 
● Machine/statistical learning packages 
● Coding/data manipulation efficiency 
● Luck
Technical Tricks -- GBM 
● My confession: I (over)use GBM 
○ When in doubt, use GBM 
● GBM automatically approximate 
○ Non-linear transformations 
○ Subtle and deep interactions 
● GBM gracefully treats missing values 
● GBM is invariant to monotonic transformation of 
features
Technical Tricks -- GBM needs TLC too 
● Tuning parameters 
○ Learning rate + number of trees 
■ Usually small learning rate + many trees work 
well. I target 1000 trees and tune learning rate 
○ Number of obs in leaf 
■ How many obs you need to get a good mean 
estimate? 
○ Interaction depth 
■ Don’t be afraid to use 10+, this is (roughly) the 
number of leaf nodes
Technical Tricks -- when GBM needs help 
● High cardinality features 
○ Convert into numerical with preprocessing -- out-of- 
fold average, counts, etc. 
○ Use Ridge regression (or similar) and 
■ use out-of-fold prediction as input to GBM 
■ or blend 
○ Be brave, use N-way interactions 
■ I used 7-way interaction in the Amazon 
competition. 
● GBM with out-of-fold treatment of high-cardinality 
feature performs very well
Technical Tricks -- feature engineering in GBM 
● GBM only APPROXIMATE interactions and non-linear 
transformations 
● Strong interactions benefit from being explicitly 
defined 
○ Especially ratios/sums/differences among 
features 
● GBM cannot capture complex features such as 
“average sales in the previous period for this type of 
product”
Technical Tricks -- Glmnet 
● From a methodology perspective, the opposite of 
GBM 
● Captures (log/logistic) linear relationship 
● Work with very small # of rows (a few hundred or 
even less) 
● Complements GBM very well in a blend 
● Need a lot of more work 
○ missing values, outliers, transformations (log?), 
interactions 
● The sparsity assumption -- L1 vs L2
Technical Tricks -- Text mining 
● tau package in R 
● Python’s sklearn 
● L2 penalty a must 
● N-grams work well. 
● Don’t forget the “trivial features”: length of text, 
number of words, etc. 
● Many “text-mining” competitions on kaggle are 
actually dominated by structured fields -- KDD2014
Technical Tricks -- Blending 
● All models are wrong, but some are useful (George 
Box) 
○ The hope is that they are wrong in different ways 
● When in doubt, use average blender 
● Beware of temptation to overfit public leaderboard 
○ Use public LB + training CV 
● The strongest individual model does not necessarily 
make the best blend 
○ Sometimes intentionally built weak models are good blending 
candidates -- Liberty Mutual Competition
Technical Tricks -- blending continued 
● Try to build “diverse” models 
○ Different tools -- GBM, Glmnet, RF, SVM, etc. 
○ Different model specifications -- Linear, 
lognormal, poisson, 2 stage, etc. 
○ Different subsets of features 
○ Subsampled observations 
○ Weighted/unweighted 
○ … 
● But, do not try “blindly” -- think more, try less
Apply what we learn outside of competitions 
● Competitions give us really good models, but we also need to 
○ Select the right problem and structure it correctly 
○ Find good (at least useful) data 
○ Make sure models are used the right way 
Competitions help us 
● Understand how much “signal” exists in the data 
● Identify flaws in data or data creation process 
● Build generalizable models 
● Broaden our technical horizon 
● …
Acknowledgement 
● My fellow competitors and data scientists at large 
○ You have taught me (almost) everything 
● Xavier Conort -- my colleague at DataRobot 
○ Thanks for collaboration and inspiration for some 
material 
● Kaggle 
○ Thanks for the all the fun we have competing!

More Related Content

PDF
Model selection and tuning at scale
PDF
General Tips for participating Kaggle Competitions
PDF
Feature Engineering - Getting most out of data for predictive models
PDF
Overview of tree algorithms from decision tree to xgboost
PDF
Tips for data science competitions
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Kaggleのテクニック
PDF
[DL輪読会]Convolutional Sequence to Sequence Learning
Model selection and tuning at scale
General Tips for participating Kaggle Competitions
Feature Engineering - Getting most out of data for predictive models
Overview of tree algorithms from decision tree to xgboost
Tips for data science competitions
Winning data science competitions, presented by Owen Zhang
Kaggleのテクニック
[DL輪読会]Convolutional Sequence to Sequence Learning

What's hot (20)

PDF
Pythonでカスタム状態空間モデル
PDF
合成変量とアンサンブル:回帰森と加法モデルの要点
PDF
基礎から学ぶ! インダストリアルIoTの実現に必須のセンサ計測と予知保全の動向
PDF
Feature Engineering
PDF
LightGBM: a highly efficient gradient boosting decision tree
PDF
Applied Deep Learning for Text Classification - Examples from the HR Industry
PDF
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
PPTX
Kaggle winning solutions: Retail Sales Forecasting
PDF
データサイエンティスト スキルチェックリスト
PDF
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
PDF
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
PDF
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
PDF
Matlantisに込められた 技術・思想_高本_Matlantis User Conference
PDF
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
PPTX
[DL輪読会]Approximating CNNs with Bag-of-local-Features models works surprisingl...
PDF
「樹木モデルとランダムフォレスト-機械学習による分類・予測-」-データマイニングセミナー
PDF
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
PPTX
データサイエンス概論第一=2-1 データ間の距離と類似度
PDF
AutoML lectures (ACDL 2019)
PDF
Joint Negative and Positive Learning for Noisy Labels
Pythonでカスタム状態空間モデル
合成変量とアンサンブル:回帰森と加法モデルの要点
基礎から学ぶ! インダストリアルIoTの実現に必須のセンサ計測と予知保全の動向
Feature Engineering
LightGBM: a highly efficient gradient boosting decision tree
Applied Deep Learning for Text Classification - Examples from the HR Industry
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Kaggle winning solutions: Retail Sales Forecasting
データサイエンティスト スキルチェックリスト
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
Matlantisに込められた 技術・思想_高本_Matlantis User Conference
NIPS2016論文紹介 Riemannian SVRG fast stochastic optimization on riemannian manif...
[DL輪読会]Approximating CNNs with Bag-of-local-Features models works surprisingl...
「樹木モデルとランダムフォレスト-機械学習による分類・予測-」-データマイニングセミナー
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
データサイエンス概論第一=2-1 データ間の距離と類似度
AutoML lectures (ACDL 2019)
Joint Negative and Positive Learning for Noisy Labels
Ad

Similar to Winning data science competitions (20)

PDF
Beat the Benchmark.
PDF
Beat the Benchmark.
PDF
Production-Ready BIG ML Workflows - from zero to hero
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PDF
VSSML16 LR1. Summary Day 1
PDF
BSSML16 L5. Summary Day 1 Sessions
PDF
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
PDF
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
PDF
PyData Global: Thrifty Machine Learning
PDF
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
PDF
CD in Machine Learning Systems
PDF
Kaggle and data science
PDF
"What we learned from 5 years of building a data science software that actual...
PDF
Better programmer overview
PDF
Introduction to Deep Learning Lecture 20 Large Language Models
PPTX
AI hype or reality
PDF
VSSML17 Review. Summary Day 1 Sessions
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
Beat the Benchmark.
Beat the Benchmark.
Production-Ready BIG ML Workflows - from zero to hero
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
VSSML16 LR1. Summary Day 1
BSSML16 L5. Summary Day 1 Sessions
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
PyData Global: Thrifty Machine Learning
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
CD in Machine Learning Systems
Kaggle and data science
"What we learned from 5 years of building a data science software that actual...
Better programmer overview
Introduction to Deep Learning Lecture 20 Large Language Models
AI hype or reality
VSSML17 Review. Summary Day 1 Sessions
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
Ad

Recently uploaded (20)

PPTX
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
ai agent creaction with langgraph_presentation_
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
lung disease detection using transfer learning approach.pptx
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PDF
technical specifications solar ear 2025.
PPTX
ifsm.pptx, institutional food service management
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
langchainpptforbeginners_easy_explanation.pptx
PPTX
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
PPTX
Stats annual compiled ipd opd ot br 2024
PPT
Classification methods in data analytics.ppt
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPTX
DATA MODELING, data model concepts, types of data concepts
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PPTX
GPS sensor used agriculture land for automation
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
Chapter security of computer_8_v8.1.pptx
ai agent creaction with langgraph_presentation_
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
lung disease detection using transfer learning approach.pptx
transformers as a tool for understanding advance algorithms in deep learning
inbound2857676998455010149.pptxmmmmmmmmm
machinelearningoverview-250809184828-927201d2.pptx
AI AND ML PROPOSAL PRESENTATION MUST.pptx
technical specifications solar ear 2025.
ifsm.pptx, institutional food service management
MBA JAPAN: 2025 the University of Waseda
langchainpptforbeginners_easy_explanation.pptx
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
Stats annual compiled ipd opd ot br 2024
Classification methods in data analytics.ppt
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
DATA MODELING, data model concepts, types of data concepts
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
GPS sensor used agriculture land for automation

Winning data science competitions

  • 1. Winning Data Science Competitions Some (hopefully) useful pointers Owen Zhang Data Scientist
  • 2. A plug for myself Current ● Chief Product Officer Previous ● VP, Science
  • 3. Agenda ● Structure of a Data Science Competition ● Philosophical considerations ● Sources of competitive advantage ● Some technical tips ● Apply what we learn out of competitions Philosophy Strategy Technique
  • 4. Structure of a Data Science Competition Build model using Training Data to predict outcomes on Private LB Data Training Public LB (validation) Private LB (holdout) Quick but often misleading feedback Data Science Competitions remind us that the purpose of a predictive model is to predict on data that we have NOT seen.
  • 5. A little “philosophy” ● There are many ways to overfit ● Beware of “multiple comparison fallacy” ○ There is a cost in “peaking at the answer”, ○ Usually the first idea (if it works) is the best “Think” more, “try” less
  • 6. Sources of Competitive Advantage ● Discipline (once bitten twice shy) ○ Proper validation framework ● Effort ● (Some) Domain knowledge ● Feature engineering ● The “right” model structure ● Machine/statistical learning packages ● Coding/data manipulation efficiency ● Luck
  • 7. Technical Tricks -- GBM ● My confession: I (over)use GBM ○ When in doubt, use GBM ● GBM automatically approximate ○ Non-linear transformations ○ Subtle and deep interactions ● GBM gracefully treats missing values ● GBM is invariant to monotonic transformation of features
  • 8. Technical Tricks -- GBM needs TLC too ● Tuning parameters ○ Learning rate + number of trees ■ Usually small learning rate + many trees work well. I target 1000 trees and tune learning rate ○ Number of obs in leaf ■ How many obs you need to get a good mean estimate? ○ Interaction depth ■ Don’t be afraid to use 10+, this is (roughly) the number of leaf nodes
  • 9. Technical Tricks -- when GBM needs help ● High cardinality features ○ Convert into numerical with preprocessing -- out-of- fold average, counts, etc. ○ Use Ridge regression (or similar) and ■ use out-of-fold prediction as input to GBM ■ or blend ○ Be brave, use N-way interactions ■ I used 7-way interaction in the Amazon competition. ● GBM with out-of-fold treatment of high-cardinality feature performs very well
  • 10. Technical Tricks -- feature engineering in GBM ● GBM only APPROXIMATE interactions and non-linear transformations ● Strong interactions benefit from being explicitly defined ○ Especially ratios/sums/differences among features ● GBM cannot capture complex features such as “average sales in the previous period for this type of product”
  • 11. Technical Tricks -- Glmnet ● From a methodology perspective, the opposite of GBM ● Captures (log/logistic) linear relationship ● Work with very small # of rows (a few hundred or even less) ● Complements GBM very well in a blend ● Need a lot of more work ○ missing values, outliers, transformations (log?), interactions ● The sparsity assumption -- L1 vs L2
  • 12. Technical Tricks -- Text mining ● tau package in R ● Python’s sklearn ● L2 penalty a must ● N-grams work well. ● Don’t forget the “trivial features”: length of text, number of words, etc. ● Many “text-mining” competitions on kaggle are actually dominated by structured fields -- KDD2014
  • 13. Technical Tricks -- Blending ● All models are wrong, but some are useful (George Box) ○ The hope is that they are wrong in different ways ● When in doubt, use average blender ● Beware of temptation to overfit public leaderboard ○ Use public LB + training CV ● The strongest individual model does not necessarily make the best blend ○ Sometimes intentionally built weak models are good blending candidates -- Liberty Mutual Competition
  • 14. Technical Tricks -- blending continued ● Try to build “diverse” models ○ Different tools -- GBM, Glmnet, RF, SVM, etc. ○ Different model specifications -- Linear, lognormal, poisson, 2 stage, etc. ○ Different subsets of features ○ Subsampled observations ○ Weighted/unweighted ○ … ● But, do not try “blindly” -- think more, try less
  • 15. Apply what we learn outside of competitions ● Competitions give us really good models, but we also need to ○ Select the right problem and structure it correctly ○ Find good (at least useful) data ○ Make sure models are used the right way Competitions help us ● Understand how much “signal” exists in the data ● Identify flaws in data or data creation process ● Build generalizable models ● Broaden our technical horizon ● …
  • 16. Acknowledgement ● My fellow competitors and data scientists at large ○ You have taught me (almost) everything ● Xavier Conort -- my colleague at DataRobot ○ Thanks for collaboration and inspiration for some material ● Kaggle ○ Thanks for the all the fun we have competing!