SlideShare a Scribd company logo
LOCAL VS. GLOBAL
MODELS FOR EFFORT
ESTIMATION AND DEFECT
PREDICTION
TIM MENZIES, ANDREW BUTCHER   (WVU)
ANDRIAN MARCUS                (WAYNE STATE)
THOMAS ZIMMERMANN             (MICROSOFT)
DAVID COK                     (GRAMMATECH)
PREMISE

Something is very wrong with data mining
research in software engineering
    •  Need less “algorithm mining” and more “data mining”
    •  Handle “conclusion instability”


Need to do a different kind of data mining
    •  Cluster, then learn
    •  Learning via “envy”




12/1/2011




                                                             2
Less “algorithm mining”

       More “data mining”


12/1/2011




                                 3
TOO MUCH MINING?

 Porter & Selby, 1990
     •  Evaluating Techniques for Generating Metric-Based Classification Trees, JSS.
     •  Empirically Guided Software Development Using Metric-Based Classification
        Trees. IEEE Software
     •  Learning from Examples: Generation and Evaluation of Decision Trees for
        Software Resource Analysis. IEEE TSE

 In 2011, Hall et al. (TSE, pre-print)
           •  reported 100s of similar studies.
           •  L learners on D data sets in a M*N cross-val

 What is your next paper?
     •  Hopefully not D*L*M*N

12/1/201




                                                                                       4
THE FIELD IS CALLED “DATA MINING”,
 NOT “ALGORITHM MINING”

To understand data
mining, look at the data,
not the algorithms


Our results should be
insights about data,
  •  not trivia about (say)
     decision tree algorithms


Besides, the thing that
most predicts for
performance is the data,
not the algorithm,
  •  Domingos & Pazzani: Optimality of
     the Simple Bayesian Classifier under
     Zero-One Loss, Machine Learning,




                                            5
     Volume 29, [103-130, 1997
Handle
    “Conclusion instability”



12/1/2011




                               6
CONCLUSION INSTABILITY:
WHAT WORKS THERE DOES NOT WORK HERE




12/1/2011




                                      7
Conclusion Instability:
what works there does not work here

Posnet et al [2011]
Zimmermann [2009] : learned defect predictors from 622 pairs of
projects ⟨project1, project2⟩.
  •  4% of pairs did project1’s predictors work for project2.
Kitchenham [2007] : studies comparing effort models learned from local
or imported models
  •  1/3 better, 1/3 same, 1/3 worse
Jørgensen [2004] :
15 studies comparing model-based to expert-based estimation.
  •  1/3 better, 1/3 same, 1/3 worse
Mair [2005] : studies comparing regression to analogy methods for
effort estimation
  •  7/20 better,4/20 same, 9/20
ROOT CAUSE OF
CONCLUSION INSTABILITY?

  HYPOTHESIS #1                         HYPOTHESIS #2
  Any one of….                          SE is an inherently varied
      •  Over-generalization across     activity
         different kinds of projects?
            •  Solve with “delphi         •  So conclusion instability
               localization”                 can’t be fixed
      •  Noisy data?                      •  It must be managed
      •  Too little data?
      •  Poor statistical technique?      •  Needs different kinds of
      •  Stochastic choice within            data miners
         data miner (e.g. random                •  Cluster, then learn
         forests)
                                                •  Learning via “envy”
      •  Insert idea here

12/1/2011




                                                                         9
SOLVE CONCLUSION INSTABILITY
WITH “DELPHI LOCALIZATIONS” ?
Restrict data mining to just related projects


Ask an expert to find the right local context
    •  Are we sure they’re right?
    •  Posnett at al. 2011:
            •    What is right level for learning?
            •    Files or packages?
            •    Methods or classes?
            •    Changes from study to study



And even if they are “right”:
    •  Should we use those contexts?
    •  What if not enough info in our own delphi localization?




                                                                 10
12/1/2011
Q: What to do
                                             about rare
DELPHI LOCALIZATIONS                           zones?




        A: Select the nearest ones from the rest




                                                                  11
        But how?                                            11"
Cluster then learn




                                 12
12/1/2011
KOCAGUNELI [2011]
CLUSTERING TO FIND “LOCAL”
TEAK: estimates from “k”
nearest-neighbors
    •  “k” auto-selected
       per test case
    •  Pre-processor to cluster data,
       remove worrisome regions
    •  IEEE TSE, Jan’11




ESEM’11
    •    Train within one delphi localization
    •    Or train on all and see what it picks
    •    Result #1: usually, cross as good as within
    •    Result #2: given a choice of both, TEAK picks “within” as much as “cross




                                                                                    13
12/1/2011
LESSON : DATA MAY NOT DIVIDE
NEATLY ON RAW DIMENSIONS




The best description for SE projects may be synthesize
dimensions extracted from the raw dimensions




                                                         14
12/1/2011
SYNTHESIZED DIMENSIONS

PCA : e.g. Nagappan [2006]           Fastmap: Faloutsos [1995]
                                     O(2N) generation of axis of large variability
Finds orthogonal “components”
                                      •  Pick any point W;
    •  Transforms N correlated        •  Find X furthest from W,
       variables to                   •  Find Y furthest from Y.
       fewer uncorrelated
       "components".
                                     c = dist(X,Y)
    •  Component[i]: accounts for    All points have distance a,b to (X,Y)
       as much variability as
       possible.
    •  Component[ j>I ] : accounts     •  x = (a2 + c2 − b2)/2c
       for remaining variability       •  y= sqrt(a2 – x2)

O(N2) to generate




                                                                                     15
12/1/2011
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
Find median(x), median(y)        with similar densities

Recurse on four quadrants        Score each cluster by median
                                 score of class variable




                                                                16
Q: WHY CLUSTER VIA FASTMAP?

A1: Circular methods (e.g. k-means) assume
round clusters.
    •  But density-based clustering allows
      clusters to be any shape



A2: No need to pre-set the number of clusters



A3: the O(2N) heuristic
is very fast,
    •  Unoptimized Python:




                                                17
12/1/2011
Learning via “envy”




                                  18
12/1/2011
Q: WHY TRAIN ON NEIGHBORING
CLUSTERS WITH BETTER SCORES?

A1: Why learn from
your own mistakes?
    •  When there exists
       a smarter
       neighbor?




    •  The “grass is
       greener” principle




                               19
12/1/2011
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
                                 with similar densities
Find median(x), median(y)
                                 Score each cluster by median
Recurse on four quadrants        score of class variable




                                                                20
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
                                 with similar densities
Find median(x), median(y)
                                 Score each cluster by median
Recurse on four quadrants        score of class variable




                                 Where is grass greenest?
                                 C1 envies neighbor C2 with max
                                 abs(score(C2) - score(C1))




                                                                  21
                                   •  Train on C2, test on C1
Q: HOW TO LEARN RULES FROM
NEIGHBORING CLUSTERS

A: it doesn’t really matter
   •  But when comparing global & intra-cluster rules
   •  Use the same rule learner

This study uses WHICH (Menzies [2010])
   • Customizable scoring operator
   • Faster termination
   • Generates very small rules (good for explanation)




                                                         22
12/1/2011
DATA FROM
HTTP://PROMISEDATA.ORG/DATA
                                      Distributions have percentiles:
Effort reduction =
{ NasaCoc, China } :                   100th
COCOMO or function points
                                        75th
Defect reduction =
{lucene,xalan jedit,synapse,etc } :     50th
CK metrics(OO)                          25th

                                               0      20       40   60    80     100
Clusters have untreated class
distribution.                                      untreated    global   local
Rules select a subset of the
examples:                                 Treated with rules
                                       learned from all data
    •  generate a treated class
       distribution
                                                       Treated with rules learned




                                                                                    23
12/1/2011

                                                         from neighboring cluster
BY ANY MEASURE,
PER-CLUSTER LEARNING IS BEST

Lower median efforts/defects (50th percentile)
Greater stability (75th – 25th percentile)
Decreased worst case (100th percentile)




                                                 24
12/1/2011
CLUSTERS GENERATE
DIFFERENT RULES




What works “here” does not work “there”
    •  Misguided to try and tame conclusion instability
    •  Inherent in the data

Don’t tame it, use it: build lots of local models




                                                          25
12/1/2011
Related work




                           26
12/1/2011
RELATED WORK
Defect & effort prediction: 1,000 papers          Design of experiments
  •  All about making predictions                   •  Don’t learn from immediate
  •  This work: learning controllers to change         data, learn from better
     prediction                                        neighbors
                                                    •  Here: , train once per cluster
Outlier removal :                                      (small subset of whole data)
  •  Yin [2011], Yoon [2010], Kocaguneli [2011]     •  Orders of magnitude faster
  •  Subsumed by this work                             than N*M cross-val

Clustering & case-based reasoning                 Localizations:
   •  Kocaguneli [2011], Turhan [2009],              •  Expert-based Petersen [2009]:
      Cuadrado [2007]                                   how to know it correct?
   •  No generated, nothing to reflect about         •  Source code-based: ecological
   •  Needs indexing (runtime speed)                    inference: Posnett [2011]
                                                     •  This work: auto-learning of
Structured literature reviews:                          contexts; beneficial
   •  Kitchenham [2007] + many more besides
   •  May be over-generalizing across cluster
      boundaries




                                                                                        27
12/1/2011
Conclusion




                         28
12/1/2011
THIS TALK

Something is fundamentally wrong with data mining research in
software engineering
    •  Needs more “data mining”, less “algorithm mining”
    •  Handle “conclusion instability”


Need to do a different kind of data mining
    •  Cluster, then learn
    •  Learning via “envy”




                                                                29
12/1/2011
NOT “ONE RING TO RULE THEM ALL”

Trite global statements about multiple SE
projects are… trite


Need effective ways to learn local lessons
    •  Automatic clustering tools
    •  Rule learning (per cluster, using envy)




                                                 30
12/1/2011
THE WISDOM OF THE   CROWDS




                             31
12/1/2011
THE WISDOM OF THE   CROWDS




                             32
12/1/2011
THE WISDOM OF THE   CROWDS




                             33
12/1/2011
THE WISDOM OF THE   COWS




                           34
12/1/2011
THE WISDOM OF THE          COWS

•  Seek the fence where
   the grass is greener
   on the other side.
     •  Learn from there
     •  Test on here


•  Don’t rely on trite
   definitions of “there”
   and “here”
     •  Cluster to find
        “here” and “there”




                                    35
  12/1/2011
36
12/1/2011

More Related Content

Viewers also liked (11)

PDF
Bioalgo 2012-01-gene-prediction-sim
BioinformaticsInstitute
 
PPTX
Sequence alignment
Arindam Ghosh
 
PPTX
Global and local alignment (bioinformatics)
Pritom Chaki
 
PDF
The Needleman-Wunsch Algorithm for Sequence Alignment
Parinda Rajapaksha
 
PDF
Ch06 alignment
BioinformaticsInstitute
 
PPT
RNA secondary structure prediction
Muhammed sadiq
 
PPT
Sequence alignment belgaum
National Institute of Biologics
 
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
PPTX
Introduction to sequence alignment
Kubuldinho
 
PPTX
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
KEY
RNA Secondary Structure Prediction
Sumin Byeon
 
Bioalgo 2012-01-gene-prediction-sim
BioinformaticsInstitute
 
Sequence alignment
Arindam Ghosh
 
Global and local alignment (bioinformatics)
Pritom Chaki
 
The Needleman-Wunsch Algorithm for Sequence Alignment
Parinda Rajapaksha
 
Ch06 alignment
BioinformaticsInstitute
 
RNA secondary structure prediction
Muhammed sadiq
 
Sequence alignment belgaum
National Institute of Biologics
 
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
Introduction to sequence alignment
Kubuldinho
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
RNA Secondary Structure Prediction
Sumin Byeon
 

Similar to Local vs. Global Models for Effort Estimation and Defect Prediction (20)

PDF
Franhouder july2013
CS, NcState
 
PDF
Unsupervised learning and clustering.pdf
officialnovice7
 
PDF
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
PDF
One talk Machine Learning
ONE Talks
 
PDF
Machine Learning: Learning with data
ONE Talks
 
PDF
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
PDF
On the value of stochastic analysis for software engineering
CS, NcState
 
PPTX
Borderline Smote
Trector Rancor
 
PDF
Computer Vision: Algorithms and Applications Richard Szeliski
TaqwaElsayed
 
PDF
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
PPTX
What Metrics Matter?
CS, NcState
 
PPTX
ESWC 2011 BLOOMS+
Prateek Jain
 
PPTX
Icse15 Tech-briefing Data Science
CS, NcState
 
PPTX
Machine Learning Summary for Caltech2
Lukas Mandrake
 
PDF
PhD Defense Slides
Debasmit Das
 
PDF
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
広樹 本間
 
PPTX
Fcv rep darrell
zukun
 
PPT
ensemble learning
butest
 
PPTX
Machine learning tree models for classification
Kv Sagar
 
Franhouder july2013
CS, NcState
 
Unsupervised learning and clustering.pdf
officialnovice7
 
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
One talk Machine Learning
ONE Talks
 
Machine Learning: Learning with data
ONE Talks
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
On the value of stochastic analysis for software engineering
CS, NcState
 
Borderline Smote
Trector Rancor
 
Computer Vision: Algorithms and Applications Richard Szeliski
TaqwaElsayed
 
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
What Metrics Matter?
CS, NcState
 
ESWC 2011 BLOOMS+
Prateek Jain
 
Icse15 Tech-briefing Data Science
CS, NcState
 
Machine Learning Summary for Caltech2
Lukas Mandrake
 
PhD Defense Slides
Debasmit Das
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
広樹 本間
 
Fcv rep darrell
zukun
 
ensemble learning
butest
 
Machine learning tree models for classification
Kv Sagar
 
Ad

More from CS, NcState (20)

PPTX
Talks2015 novdec
CS, NcState
 
PPTX
Future se oct15
CS, NcState
 
PPTX
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
PPTX
Big Data: the weakest link
CS, NcState
 
PPTX
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
PPTX
Lexisnexis june9
CS, NcState
 
PPTX
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
PPTX
Kits to Find the Bits that Fits
CS, NcState
 
PPTX
Ai4se lab template
CS, NcState
 
PPTX
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
PPT
Requirements Engineering
CS, NcState
 
PPT
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
PPTX
Automated Software Engineering
CS, NcState
 
PDF
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
PPTX
Tim Menzies, directions in Data Science
CS, NcState
 
PPTX
Goldrush
CS, NcState
 
PPTX
Dagstuhl14 intro-v1
CS, NcState
 
PPTX
Know thy tools
CS, NcState
 
PPTX
The Art and Science of Analyzing Software Data
CS, NcState
 
PPTX
In the age of Big Data, what role for Software Engineers?
CS, NcState
 
Talks2015 novdec
CS, NcState
 
Future se oct15
CS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
Big Data: the weakest link
CS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
Lexisnexis june9
CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
Kits to Find the Bits that Fits
CS, NcState
 
Ai4se lab template
CS, NcState
 
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
Requirements Engineering
CS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
Automated Software Engineering
CS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
Tim Menzies, directions in Data Science
CS, NcState
 
Goldrush
CS, NcState
 
Dagstuhl14 intro-v1
CS, NcState
 
Know thy tools
CS, NcState
 
The Art and Science of Analyzing Software Data
CS, NcState
 
In the age of Big Data, what role for Software Engineers?
CS, NcState
 
Ad

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 

Local vs. Global Models for Effort Estimation and Defect Prediction

  • 1. LOCAL VS. GLOBAL MODELS FOR EFFORT ESTIMATION AND DEFECT PREDICTION TIM MENZIES, ANDREW BUTCHER (WVU) ANDRIAN MARCUS (WAYNE STATE) THOMAS ZIMMERMANN (MICROSOFT) DAVID COK (GRAMMATECH)
  • 2. PREMISE Something is very wrong with data mining research in software engineering •  Need less “algorithm mining” and more “data mining” •  Handle “conclusion instability” Need to do a different kind of data mining •  Cluster, then learn •  Learning via “envy” 12/1/2011 2
  • 3. Less “algorithm mining” More “data mining” 12/1/2011 3
  • 4. TOO MUCH MINING? Porter & Selby, 1990 •  Evaluating Techniques for Generating Metric-Based Classification Trees, JSS. •  Empirically Guided Software Development Using Metric-Based Classification Trees. IEEE Software •  Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis. IEEE TSE In 2011, Hall et al. (TSE, pre-print) •  reported 100s of similar studies. •  L learners on D data sets in a M*N cross-val What is your next paper? •  Hopefully not D*L*M*N 12/1/201 4
  • 5. THE FIELD IS CALLED “DATA MINING”, NOT “ALGORITHM MINING” To understand data mining, look at the data, not the algorithms Our results should be insights about data, •  not trivia about (say) decision tree algorithms Besides, the thing that most predicts for performance is the data, not the algorithm, •  Domingos & Pazzani: Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, 5 Volume 29, [103-130, 1997
  • 6. Handle “Conclusion instability” 12/1/2011 6
  • 7. CONCLUSION INSTABILITY: WHAT WORKS THERE DOES NOT WORK HERE 12/1/2011 7
  • 8. Conclusion Instability: what works there does not work here Posnet et al [2011] Zimmermann [2009] : learned defect predictors from 622 pairs of projects ⟨project1, project2⟩. •  4% of pairs did project1’s predictors work for project2. Kitchenham [2007] : studies comparing effort models learned from local or imported models •  1/3 better, 1/3 same, 1/3 worse Jørgensen [2004] : 15 studies comparing model-based to expert-based estimation. •  1/3 better, 1/3 same, 1/3 worse Mair [2005] : studies comparing regression to analogy methods for effort estimation •  7/20 better,4/20 same, 9/20
  • 9. ROOT CAUSE OF CONCLUSION INSTABILITY? HYPOTHESIS #1 HYPOTHESIS #2 Any one of…. SE is an inherently varied •  Over-generalization across activity different kinds of projects? •  Solve with “delphi •  So conclusion instability localization” can’t be fixed •  Noisy data? •  It must be managed •  Too little data? •  Poor statistical technique? •  Needs different kinds of •  Stochastic choice within data miners data miner (e.g. random •  Cluster, then learn forests) •  Learning via “envy” •  Insert idea here 12/1/2011 9
  • 10. SOLVE CONCLUSION INSTABILITY WITH “DELPHI LOCALIZATIONS” ? Restrict data mining to just related projects Ask an expert to find the right local context •  Are we sure they’re right? •  Posnett at al. 2011: •  What is right level for learning? •  Files or packages? •  Methods or classes? •  Changes from study to study And even if they are “right”: •  Should we use those contexts? •  What if not enough info in our own delphi localization? 10 12/1/2011
  • 11. Q: What to do about rare DELPHI LOCALIZATIONS zones? A: Select the nearest ones from the rest 11 But how? 11"
  • 12. Cluster then learn 12 12/1/2011
  • 13. KOCAGUNELI [2011] CLUSTERING TO FIND “LOCAL” TEAK: estimates from “k” nearest-neighbors •  “k” auto-selected per test case •  Pre-processor to cluster data, remove worrisome regions •  IEEE TSE, Jan’11 ESEM’11 •  Train within one delphi localization •  Or train on all and see what it picks •  Result #1: usually, cross as good as within •  Result #2: given a choice of both, TEAK picks “within” as much as “cross 13 12/1/2011
  • 14. LESSON : DATA MAY NOT DIVIDE NEATLY ON RAW DIMENSIONS The best description for SE projects may be synthesize dimensions extracted from the raw dimensions 14 12/1/2011
  • 15. SYNTHESIZED DIMENSIONS PCA : e.g. Nagappan [2006] Fastmap: Faloutsos [1995] O(2N) generation of axis of large variability Finds orthogonal “components” •  Pick any point W; •  Transforms N correlated •  Find X furthest from W, variables to •  Find Y furthest from Y. fewer uncorrelated "components". c = dist(X,Y) •  Component[i]: accounts for All points have distance a,b to (X,Y) as much variability as possible. •  Component[ j>I ] : accounts •  x = (a2 + c2 − b2)/2c for remaining variability •  y= sqrt(a2 – x2) O(N2) to generate 15 12/1/2011
  • 16. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves Find median(x), median(y) with similar densities Recurse on four quadrants Score each cluster by median score of class variable 16
  • 17. Q: WHY CLUSTER VIA FASTMAP? A1: Circular methods (e.g. k-means) assume round clusters. •  But density-based clustering allows clusters to be any shape A2: No need to pre-set the number of clusters A3: the O(2N) heuristic is very fast, •  Unoptimized Python: 17 12/1/2011
  • 18. Learning via “envy” 18 12/1/2011
  • 19. Q: WHY TRAIN ON NEIGHBORING CLUSTERS WITH BETTER SCORES? A1: Why learn from your own mistakes? •  When there exists a smarter neighbor? •  The “grass is greener” principle 19 12/1/2011
  • 20. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves with similar densities Find median(x), median(y) Score each cluster by median Recurse on four quadrants score of class variable 20
  • 21. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves with similar densities Find median(x), median(y) Score each cluster by median Recurse on four quadrants score of class variable Where is grass greenest? C1 envies neighbor C2 with max abs(score(C2) - score(C1)) 21 •  Train on C2, test on C1
  • 22. Q: HOW TO LEARN RULES FROM NEIGHBORING CLUSTERS A: it doesn’t really matter •  But when comparing global & intra-cluster rules •  Use the same rule learner This study uses WHICH (Menzies [2010]) • Customizable scoring operator • Faster termination • Generates very small rules (good for explanation) 22 12/1/2011
  • 23. DATA FROM HTTP://PROMISEDATA.ORG/DATA Distributions have percentiles: Effort reduction = { NasaCoc, China } : 100th COCOMO or function points 75th Defect reduction = {lucene,xalan jedit,synapse,etc } : 50th CK metrics(OO) 25th 0 20 40 60 80 100 Clusters have untreated class distribution. untreated global local Rules select a subset of the examples: Treated with rules learned from all data •  generate a treated class distribution Treated with rules learned 23 12/1/2011 from neighboring cluster
  • 24. BY ANY MEASURE, PER-CLUSTER LEARNING IS BEST Lower median efforts/defects (50th percentile) Greater stability (75th – 25th percentile) Decreased worst case (100th percentile) 24 12/1/2011
  • 25. CLUSTERS GENERATE DIFFERENT RULES What works “here” does not work “there” •  Misguided to try and tame conclusion instability •  Inherent in the data Don’t tame it, use it: build lots of local models 25 12/1/2011
  • 26. Related work 26 12/1/2011
  • 27. RELATED WORK Defect & effort prediction: 1,000 papers Design of experiments •  All about making predictions •  Don’t learn from immediate •  This work: learning controllers to change data, learn from better prediction neighbors •  Here: , train once per cluster Outlier removal : (small subset of whole data) •  Yin [2011], Yoon [2010], Kocaguneli [2011] •  Orders of magnitude faster •  Subsumed by this work than N*M cross-val Clustering & case-based reasoning Localizations: •  Kocaguneli [2011], Turhan [2009], •  Expert-based Petersen [2009]: Cuadrado [2007] how to know it correct? •  No generated, nothing to reflect about •  Source code-based: ecological •  Needs indexing (runtime speed) inference: Posnett [2011] •  This work: auto-learning of Structured literature reviews: contexts; beneficial •  Kitchenham [2007] + many more besides •  May be over-generalizing across cluster boundaries 27 12/1/2011
  • 28. Conclusion 28 12/1/2011
  • 29. THIS TALK Something is fundamentally wrong with data mining research in software engineering •  Needs more “data mining”, less “algorithm mining” •  Handle “conclusion instability” Need to do a different kind of data mining •  Cluster, then learn •  Learning via “envy” 29 12/1/2011
  • 30. NOT “ONE RING TO RULE THEM ALL” Trite global statements about multiple SE projects are… trite Need effective ways to learn local lessons •  Automatic clustering tools •  Rule learning (per cluster, using envy) 30 12/1/2011
  • 31. THE WISDOM OF THE CROWDS 31 12/1/2011
  • 32. THE WISDOM OF THE CROWDS 32 12/1/2011
  • 33. THE WISDOM OF THE CROWDS 33 12/1/2011
  • 34. THE WISDOM OF THE COWS 34 12/1/2011
  • 35. THE WISDOM OF THE COWS •  Seek the fence where the grass is greener on the other side. •  Learn from there •  Test on here •  Don’t rely on trite definitions of “there” and “here” •  Cluster to find “here” and “there” 35 12/1/2011