Local vs. Global Models for Effort Estimation and Defect Prediction

LOCAL VS. GLOBAL
MODELS FOR EFFORT
ESTIMATION AND DEFECT
PREDICTION
TIM MENZIES, ANDREW BUTCHER (WVU)
ANDRIAN MARCUS (WAYNE STATE)
THOMAS ZIMMERMANN (MICROSOFT)
DAVID COK (GRAMMATECH)

PREMISE

Something is very wrong with data mining
research in software engineering
•  Need less “algorithm mining” and more “data mining”
•  Handle “conclusion instability”

Need to do a different kind of data mining
•  Cluster, then learn
•  Learning via “envy”

12/1/2011

2

Less “algorithm mining”

More “data mining”

12/1/2011

3

TOO MUCH MINING?

Porter & Selby, 1990
•  Evaluating Techniques for Generating Metric-Based Classification Trees, JSS.
•  Empirically Guided Software Development Using Metric-Based Classification
Trees. IEEE Software
•  Learning from Examples: Generation and Evaluation of Decision Trees for
Software Resource Analysis. IEEE TSE

In 2011, Hall et al. (TSE, pre-print)
•  reported 100s of similar studies.
•  L learners on D data sets in a M*N cross-val

What is your next paper?
•  Hopefully not D*L*M*N

12/1/201

4

THE FIELD IS CALLED “DATA MINING”,
NOT “ALGORITHM MINING”

To understand data
mining, look at the data,
not the algorithms

Our results should be
insights about data,
•  not trivia about (say)
decision tree algorithms

Besides, the thing that
most predicts for
performance is the data,
not the algorithm,
•  Domingos & Pazzani: Optimality of
the Simple Bayesian Classifier under
Zero-One Loss, Machine Learning,

5
Volume 29, [103-130, 1997

Handle
“Conclusion instability”

12/1/2011

6

CONCLUSION INSTABILITY:
WHAT WORKS THERE DOES NOT WORK HERE

12/1/2011

7

Conclusion Instability:
what works there does not work here

Posnet et al [2011]
Zimmermann [2009] : learned defect predictors from 622 pairs of
projects ⟨project1, project2⟩.
•  4% of pairs did project1’s predictors work for project2.
Kitchenham [2007] : studies comparing effort models learned from local
or imported models
•  1/3 better, 1/3 same, 1/3 worse
Jørgensen [2004] :
15 studies comparing model-based to expert-based estimation.
•  1/3 better, 1/3 same, 1/3 worse
Mair [2005] : studies comparing regression to analogy methods for
effort estimation
•  7/20 better,4/20 same, 9/20

ROOT CAUSE OF
CONCLUSION INSTABILITY?

HYPOTHESIS #1 HYPOTHESIS #2
Any one of…. SE is an inherently varied
•  Over-generalization across activity
different kinds of projects?
•  Solve with “delphi •  So conclusion instability
localization” can’t be fixed
•  Noisy data? •  It must be managed
•  Too little data?
•  Poor statistical technique? •  Needs different kinds of
•  Stochastic choice within data miners
data miner (e.g. random •  Cluster, then learn
forests)
•  Insert idea here

12/1/2011

9

SOLVE CONCLUSION INSTABILITY
WITH “DELPHI LOCALIZATIONS” ?
Restrict data mining to just related projects

Ask an expert to find the right local context
•  Are we sure they’re right?
•  Posnett at al. 2011:
•  What is right level for learning?
•  Files or packages?
•  Methods or classes?
•  Changes from study to study

And even if they are “right”:
•  Should we use those contexts?
•  What if not enough info in our own delphi localization?

10
12/1/2011

Q: What to do
about rare
DELPHI LOCALIZATIONS zones?

A: Select the nearest ones from the rest

11
But how? 11"

Cluster then learn

12
12/1/2011

KOCAGUNELI [2011]
CLUSTERING TO FIND “LOCAL”
TEAK: estimates from “k”
nearest-neighbors
•  “k” auto-selected
per test case
•  Pre-processor to cluster data,
remove worrisome regions
•  IEEE TSE, Jan’11

ESEM’11
•  Train within one delphi localization
•  Or train on all and see what it picks
•  Result #1: usually, cross as good as within
•  Result #2: given a choice of both, TEAK picks “within” as much as “cross

13
12/1/2011

LESSON : DATA MAY NOT DIVIDE
NEATLY ON RAW DIMENSIONS

The best description for SE projects may be synthesize
dimensions extracted from the raw dimensions

14
12/1/2011

SYNTHESIZED DIMENSIONS

PCA : e.g. Nagappan [2006] Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
Finds orthogonal “components”
•  Pick any point W;
•  Transforms N correlated •  Find X furthest from W,
variables to •  Find Y furthest from Y.
fewer uncorrelated
"components".
c = dist(X,Y)
•  Component[i]: accounts for All points have distance a,b to (X,Y)
as much variability as
possible.
•  Component[ j>I ] : accounts •  x = (a2 + c2 − b2)/2c
for remaining variability •  y= sqrt(a2 – x2)

O(N2) to generate

15
12/1/2011

HIERARCHICAL PARTITIONING
Grow Prune
Find two orthogonal dimensions Combine quadtree leaves
Find median(x), median(y) with similar densities

Recurse on four quadrants Score each cluster by median
score of class variable

16

Q: WHY CLUSTER VIA FASTMAP?

A1: Circular methods (e.g. k-means) assume
round clusters.
•  But density-based clustering allows
clusters to be any shape

A2: No need to pre-set the number of clusters

A3: the O(2N) heuristic
is very fast,
•  Unoptimized Python:

17
12/1/2011

Learning via “envy”

18
12/1/2011

Q: WHY TRAIN ON NEIGHBORING
CLUSTERS WITH BETTER SCORES?

A1: Why learn from
your own mistakes?
•  When there exists
a smarter
neighbor?

•  The “grass is
greener” principle

19
12/1/2011

Grow Prune
with similar densities
Find median(x), median(y)
Score each cluster by median
Recurse on four quadrants score of class variable

20

Grow Prune
with similar densities
Find median(x), median(y)
Score each cluster by median
Recurse on four quadrants score of class variable

Where is grass greenest?
C1 envies neighbor C2 with max
abs(score(C2) - score(C1))

21
•  Train on C2, test on C1

Q: HOW TO LEARN RULES FROM
NEIGHBORING CLUSTERS

A: it doesn’t really matter
•  But when comparing global & intra-cluster rules
•  Use the same rule learner

This study uses WHICH (Menzies [2010])
• Customizable scoring operator
• Faster termination
• Generates very small rules (good for explanation)

22
12/1/2011

DATA FROM
HTTP://PROMISEDATA.ORG/DATA
Distributions have percentiles:
Effort reduction =
{ NasaCoc, China } : 100th
COCOMO or function points
75th
Defect reduction =
{lucene,xalan jedit,synapse,etc } : 50th
CK metrics(OO) 25th

0 20 40 60 80 100
Clusters have untreated class
distribution. untreated global local
Rules select a subset of the
examples: Treated with rules
learned from all data
•  generate a treated class
distribution
Treated with rules learned

23
12/1/2011

from neighboring cluster

BY ANY MEASURE,
PER-CLUSTER LEARNING IS BEST

Lower median efforts/defects (50th percentile)
Greater stability (75th – 25th percentile)
Decreased worst case (100th percentile)

24
12/1/2011

CLUSTERS GENERATE
DIFFERENT RULES

What works “here” does not work “there”
•  Misguided to try and tame conclusion instability
•  Inherent in the data

Don’t tame it, use it: build lots of local models

25
12/1/2011

Related work

26
12/1/2011

RELATED WORK
Defect & effort prediction: 1,000 papers Design of experiments
•  All about making predictions •  Don’t learn from immediate
•  This work: learning controllers to change data, learn from better
prediction neighbors
•  Here: , train once per cluster
Outlier removal : (small subset of whole data)
•  Yin [2011], Yoon [2010], Kocaguneli [2011] •  Orders of magnitude faster
•  Subsumed by this work than N*M cross-val

Clustering & case-based reasoning Localizations:
•  Kocaguneli [2011], Turhan [2009], •  Expert-based Petersen [2009]:
Cuadrado [2007] how to know it correct?
•  No generated, nothing to reflect about •  Source code-based: ecological
•  Needs indexing (runtime speed) inference: Posnett [2011]
•  This work: auto-learning of
Structured literature reviews: contexts; beneficial
•  Kitchenham [2007] + many more besides
•  May be over-generalizing across cluster
boundaries

27
12/1/2011

Conclusion

28
12/1/2011

THIS TALK

Something is fundamentally wrong with data mining research in
software engineering
•  Needs more “data mining”, less “algorithm mining”
•  Handle “conclusion instability”

Need to do a different kind of data mining
•  Cluster, then learn

29
12/1/2011

NOT “ONE RING TO RULE THEM ALL”

Trite global statements about multiple SE
projects are… trite

Need effective ways to learn local lessons
•  Automatic clustering tools
•  Rule learning (per cluster, using envy)

30
12/1/2011

THE WISDOM OF THE CROWDS

31
12/1/2011


32
12/1/2011


33
12/1/2011

THE WISDOM OF THE COWS

34
12/1/2011

THE WISDOM OF THE COWS

•  Seek the fence where
the grass is greener
on the other side.
•  Learn from there
•  Test on here

•  Don’t rely on trite
definitions of “there”
and “here”
•  Cluster to find
“here” and “there”

35
12/1/2011

Local vs. Global Models for Effort Estimation and Defect Prediction

More Related Content

Viewers also liked (11)

Similar to Local vs. Global Models for Effort Estimation and Defect Prediction (20)

More from CS, NcState (20)

Recently uploaded (20)

Local vs. Global Models for Effort Estimation and Defect Prediction