SlideShare a Scribd company logo
A Principled Evaluation of Ensembles of Learning
            Machines for Software Effort Estimation

                                   Leandro Minku, Xin Yao
                              {L.L.Minku,X.Yao}@cs.bham.ac.uk

               CERCIA, School of Computer Science, The University of Birmingham




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   1 / 22
Outline




            Introduction (Background and Motivation)
            Research Questions (Aims)
            Experiments (Method and Results)
            Answers to Research Questions (Conclusions)
            Future Work




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   2 / 22
Introduction

     Software cost estimation:
            Set of techniques and procedures that an organisation uses to
            arrive at an estimate.
            Major contributing factor is effort (in person-hours,
            person-month, etc).
            Overestimation vs. underestimation.

     Several software cost/effort estimation models have been proposed.

     ML models have been receiving increased attention:
            They make no or minimal assumptions about the data and the
            function being modelled.


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   3 / 22
Introduction
     Ensembles of Learning Machines are groups of learning machines
     trained to perform the same task and combined with the aim of
     improving predictive performance.

     Studies comparing ensembles against single learners in software
     effort estimation are contradictory:
            Braga et al IJCNN’07 claims that Bagging improves a bit
            effort estimations produced by single learners.
            Kultur et al KBS’09 claims that an adapted Bagging provides
            large improvements.
            Kocaguneli et al ISSRE’09 claims that combining different
            learners does not improve effort estimations.

     These studies either miss statistical tests or do not present the
     parameters choice. None of them analyse the reason for the
     achieved results.
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   4 / 22
Introduction
     Ensembles of Learning Machines are groups of learning machines
     trained to perform the same task and combined with the aim of
     improving predictive performance.

     Studies comparing ensembles against single learners in software
     effort estimation are contradictory:
            Braga et al IJCNN’07 claims that Bagging improves a bit
            effort estimations produced by single learners.
            Kultur et al KBS’09 claims that an adapted Bagging provides
            large improvements.
            Kocaguneli et al ISSRE’09 claims that combining different
            learners does not improve effort estimations.

     These studies either miss statistical tests or do not present the
     parameters choice. None of them analyse the reason for the
     achieved results.
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   4 / 22
Introduction
     Ensembles of Learning Machines are groups of learning machines
     trained to perform the same task and combined with the aim of
     improving predictive performance.

     Studies comparing ensembles against single learners in software
     effort estimation are contradictory:
            Braga et al IJCNN’07 claims that Bagging improves a bit
            effort estimations produced by single learners.
            Kultur et al KBS’09 claims that an adapted Bagging provides
            large improvements.
            Kocaguneli et al ISSRE’09 claims that combining different
            learners does not improve effort estimations.

     These studies either miss statistical tests or do not present the
     parameters choice. None of them analyse the reason for the
     achieved results.
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   4 / 22
Introduction
     Ensembles of Learning Machines are groups of learning machines
     trained to perform the same task and combined with the aim of
     improving predictive performance.

     Studies comparing ensembles against single learners in software
     effort estimation are contradictory:
            Braga et al IJCNN’07 claims that Bagging improves a bit
            effort estimations produced by single learners.
            Kultur et al KBS’09 claims that an adapted Bagging provides
            large improvements.
            Kocaguneli et al ISSRE’09 claims that combining different
            learners does not improve effort estimations.

     These studies either miss statistical tests or do not present the
     parameters choice. None of them analyse the reason for the
     achieved results.
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   4 / 22
Research Questions




     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Research Questions


     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?

            The current studies are contradictory.
            They either do not perform statistical comparisons or do not
            explain the parameters choice.
            It would be worth to investigate the use of different ensemble
            approaches.
            We build upon current work by considering these points.



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Research Questions



     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?

     Question 2
     If a particular method is singled out, what insight on how to
     improve effort estimations can we gain by analysing its behaviour
     and the reasons for its better performance?




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Research Questions


     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?

     Question 2
     If a particular method is singled out, what insight on how to
     improve effort estimations can we gain by analysing its behaviour
     and the reasons for its better performance?

            Principled experiments, not just intuition or speculations.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Research Questions


     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?

     Question 2
     If a particular method is singled out, what insight on how to
     improve effort estimations can we gain by analysing its behaviour
     and the reasons for its better performance?

     Question 3
     How can someone determine what model to be used considering a
     particular data set?


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Research Questions
     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?

     Question 2
     If a particular method is singled out, what insight on how to
     improve effort estimations can we gain by analysing its behaviour
     and the reasons for its better performance?

     Question 3
     How can someone determine what model to be used considering a
     particular data set?

            Our study complements previous work, parameters choice is
            important.
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   5 / 22
Data Sets and Preprocessing



            Data sets: cocomo81, nasa93, nasa, cocomo2, desharnais, 7
            ISBSG organization type subsets.
                    Cover a wide range of features.
                    In particular, ISBSG subsets’ productivity rate is statistically
                    different.
            Attributes: cocomo attributes for PROMISE data, functional
            size, development type and language type for ISBSG.
            Missing values: delete for PROMISE, k-NN imputation for
            ISBSG.
            Outliers: K-means detection / elimination.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   6 / 22
Experimental Framework – Step 1: choice of learning
machines


            Single learners:
                    MultiLayer Perceptrons (MLPs) – universal approximators;
                    Radial Basis Function networks (RBFs) – local learning; and
                    Regression Trees (RTs) – simple and comprehensive.


            Ensemble learners:
                    Bagging with MLPs, with RBFs and with RTs – widely and
                    successfully used;
                    Random with MLPs – use full training set for each learner; and
                    Negative Correlation Learning (NCL) with MLPs – regression.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   7 / 22
Experimental Framework – Step 2: choice of evaluation
method
     Executions were done in 30 rounds, 10 projects for testing and
     remaining for training, as suggested by Menzies et al. TSE’06.

     Evaluation was done in two steps:
       1 Menzies et al. TSE’06’s survival rejection rules:

                    If MMREs are significantly different according to a paired
                    t-test with 95% of confidence, the best model is the one with
                    the lowest average MMRE.
                    If not, the best method is the one with the best:
                       1   Correlation
                       2   Standard deviation
                       3   PRED(N)
                       4   Number of attributes
        2   Wilcoxon tests with 95% of confidence to compare the two
            methods more often among the best in terms of MMRE and
            PRED(25).
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   8 / 22
Experimental Framework – Step 2: choice of evaluation
method
     Executions were done in 30 rounds, 10 projects for testing and
     remaining for training, as suggested by Menzies et al. TSE’06.

     Evaluation was done in two steps:
       1 Menzies et al. TSE’06’s survival rejection rules:

                    If MMREs are significantly different according to a paired
                    t-test with 95% of confidence, the best model is the one with
                    the lowest average MMRE.
                    If not, the best method is the one with the best:
                       1   Correlation
                       2   Standard deviation
                       3   PRED(N)
                       4   Number of attributes
        2   Wilcoxon tests with 95% of confidence to compare the two
            methods more often among the best in terms of MMRE and
            PRED(25).
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   8 / 22
Experimental Framework – Step 2: choice of evaluation
method
     Executions were done in 30 rounds, 10 projects for testing and
     remaining for training, as suggested by Menzies et al. TSE’06.

     Evaluation was done in two steps:
       1 Menzies et al. TSE’06’s survival rejection rules:

                    If MMREs are significantly different according to a paired
                    t-test with 95% of confidence, the best model is the one with
                    the lowest average MMRE.
                    If not, the best method is the one with the best:
                       1   Correlation
                       2   Standard deviation
                       3   PRED(N)
                       4   Number of attributes
        2   Wilcoxon tests with 95% of confidence to compare the two
            methods more often among the best in terms of MMRE and
            PRED(25).
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   8 / 22
Experimental Framework – Step 2: choice of evaluation
method

     Mean Magnitude of the Relative Error
                                                                         |predictedi −actuali |
     M M RE = T T M REi , where M REi =
               1
                   i=1                                                          actuali

     Percentage of estimations within N % of the actual values
                                              N
                             1, if M REi ≤ 100
     P RED(N ) = T T1
                        i=1
                             0, otherwise
     Correlation between estimated and actual effort:
                  S
     CORR = √ pa , where
                         Sp Sa
                   T
                   i=1 (predictedi −¯)(actuali −¯)
                                    p           a
     Spa =                    T −1
                  T   (predictedi −¯)2
                                   p                      T    (actuali −¯)2
                                                                         a
     Sp =         i=1      T −1        ,        Sa = i=1           T −1      ,
                T   predictedi                   T   actuali
     p=
     ¯          i=1      T     , a=¯             i=1    T    .


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   9 / 22
Experimental Framework – Step 2: choice of evaluation
method

     Mean Magnitude of the Relative Error
                                                                         |predictedi −actuali |
     M M RE = T T M REi , where M REi =
               1
                   i=1                                                          actuali

     Percentage of estimations within N % of the actual values
                                              N
                             1, if M REi ≤ 100
     P RED(N ) = T T1
                        i=1
                             0, otherwise
     Correlation between estimated and actual effort:
                  S
     CORR = √ pa , where
                         Sp Sa
                   T
                   i=1 (predictedi −¯)(actuali −¯)
                                    p           a
     Spa =                    T −1
                  T   (predictedi −¯)2
                                   p                      T    (actuali −¯)2
                                                                         a
     Sp =         i=1      T −1        ,        Sa = i=1           T −1      ,
                T   predictedi                   T   actuali
     p=
     ¯          i=1      T     , a=¯             i=1    T    .


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   9 / 22
Experimental Framework – Step 2: choice of evaluation
method

     Mean Magnitude of the Relative Error
                                                                         |predictedi −actuali |
     M M RE = T T M REi , where M REi =
               1
                   i=1                                                          actuali

     Percentage of estimations within N % of the actual values
                                              N
                             1, if M REi ≤ 100
     P RED(N ) = T T1
                        i=1
                             0, otherwise
     Correlation between estimated and actual effort:
                  S
     CORR = √ pa , where
                         Sp Sa
                   T
                   i=1 (predictedi −¯)(actuali −¯)
                                    p           a
     Spa =                    T −1
                  T   (predictedi −¯)2
                                   p                      T    (actuali −¯)2
                                                                         a
     Sp =         i=1      T −1        ,        Sa = i=1           T −1      ,
                T   predictedi                   T   actuali
     p=
     ¯          i=1      T     , a=¯             i=1    T    .


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   9 / 22
Experimental Framework – Step 3: choice of parameters




            Preliminary experiments using 5 runs.
            Each approach was run with all the combinations of 3 or 5
            parameter values.
            Parameters with the lowest MMRE were chosen for further 30
            runs.
            Base learners will not necessarily have the same parameters as
            single learners.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   10 / 22
Comparison of Learning Machines – Menzies et al.
TSE’06’s survival rejection rules



     Table: Number of Data Sets in which Each Method Survived. Methods
     that never survived are omitted.
                               PROMISE Data      ISBSG Data       All Data
                               RT:         2     MLP:         2   RT:           3
                               Bag + MLP: 1      Bag + RTs:   2   Bag + MLP:    2
                               NCL + MLP: 1      Bag + MLP:   1   NCL + MLP:    2
                               Rand + MLP: 1     RT:          1   Bag + RTs:    2
                                                 Bag + RBF:   1   MLP:          2
                                                 NCL + MLP:   1   Rand + MLP:   1
                                                                  Bag + RBF:    1



            No approach is consistently the best, even considering
            ensembles!



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   11 / 22
Comparison of Learning Machines – Menzies et al.
TSE’06’s survival rejection rules



     Table: Number of Data Sets in which Each Method Survived. Methods
     that never survived are omitted.
                               PROMISE Data      ISBSG Data       All Data
                               RT:         2     MLP:         2   RT:           3
                               Bag + MLP: 1      Bag + RTs:   2   Bag + MLP:    2
                               NCL + MLP: 1      Bag + MLP:   1   NCL + MLP:    2
                               Rand + MLP: 1     RT:          1   Bag + RTs:    2
                                                 Bag + RBF:   1   MLP:          2
                                                 NCL + MLP:   1   Rand + MLP:   1
                                                                  Bag + RBF:    1



            No approach is consistently the best, even considering
            ensembles!



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   11 / 22
Comparison of Learning Machines
  What methods are usually among
  the best?
                                                                     RTs and bag+MLPs are more
  Table:     Number of Data Sets in which Each Method                frequently among the best
  Was Ranked First or Second According to MMRE and
  PRED(25). Methods never among the first and second
                                                                     considering MMRE than
  are omitted.                                                       considering PRED(25).
              (a) Accoding to MMRE
                                                                     The first ranked method’s
     PROMISE Data      ISBSG Data         All Data
     RT:        4      RT:           5    RT:            9           MMRE is statistically different
     Bag + MLP: 3      Bag + MLP     5    Bag + MLP:     8
     Bag + RT:  2      Bag + RBF:    3    Bag + RBF:     3
                                                                     from the others in 35.16% of
     MLP:       1      MLP:          1    MLP:           2           the cases.
                       Rand + MLP:   1    Bag + RT:      2
                       NCL + MLP:    1    Rand + MLP:    1
                                          NCL + MLP:     1           The second ranked method’s
                                                                     MMRE is statistically different
             (b) Acording to PRED(25)
                                                                     from the lower ranked methods
     PROMISE Data      ISBSG Data         All Data
     Bag + MLP: 3      RT:           5    RT:            6           in 16.67% of the cases.
     Rand + MLP: 3     Rand + MLP:   3    Rand + MLP:    6
     Bag + RT:
     RT:
                 2
                 1
                       Bag + MLP:
                       MLP:
                                     2
                                     2
                                          Bag + MLP:
                                          Bag + RT:
                                                         5
                                                         3
                                                                     RTs and bag+MLPs are
     MLP:        1     RBF:          2    MLP:           3           usually statistically equal in
                       Bag + RBF:    1    RBF:           2
                       Bag + RT:     1    Bag + RBF:     1           terms of MMRE and
                                                                     PRED(25).
Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk       Ensembles for Software Effort Estimation   12 / 22
Research Questions – Revisited



     Question 1
     Do readily available ensemble methods generally improve effort
     estimations given by single learners? Which of them would be
     more useful?
            Even though bag+MLPs is frequently among the best
            methods, it is statistically similar to RTs.
            RTs are more comprehensive and have faster training.
            Bag+MLPs seem to have more potential for improvements.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   13 / 22
Why Were RTs Singled Out?


            Hypothesis: As RTs have splits based on information gain,
            they may work in such a way to give more importance for
            more relevant attributes.
            A further study using correlation-based feature selection
            revealed that RTs usually put higher features higher ranked by
            the feature selection method in higher level splits of the tree.
            Feature selection by itself was not able to always improve
            accuracy.

     It may be important to give weights to features when using ML
     approaches.



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   14 / 22
Why Were RTs Singled Out?


     Table: Correlation-Based Feature Selection and RT Attributes Relative
     Importance for Cocomo81.
                Attributes ranking              First tree level in which the attribute   Percentage of
                                                appears in more than 50% of the trees     trees
                LOC                             Level 0                                   100.00%
                Development mode
                Required software reliability   Level 1                                   90.00%
                Modern programing practices
                Time constraint for cpu         Level 2                                   73.33%
                Data base size                  Level 2                                   83.34%
                Main memory constraint
                Turnaround time
                Programmers capability
                Analysts capability
                Language experience
                Virtual machine experience
                Schedule constraint
                Application experience          Level 2                                   66.67%
                Use of software tools
                Machine volatility




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk        Ensembles for Software Effort Estimation     15 / 22
Why Were Bag+MLPs Singled Out

            Hypothesis: bag+MLPs may have lead to a more adequate
            level of diversity.
            If we use correlation as the diversity measure, we can see that
            bag+MLPs usually had more moderate values when it was the
            1st or 2nd ranked MMRE method.
            However, the correlation between diversity and MMRE was
            usually quite low.
  Table:  Correlation Considering Data Sets in which
  Bag+MLPs Were Ranked 1st or 2nd.                           Table:    Correlation Considering All Data Sets.

         Approach       Correlation interval                     Approach       Correlation interval
                        across different data sets                               across different data sets
         Bag+MLP        0.74-0.92                                Bag+MLP        0.47-0.98
         Bag+RBF        0.40-0.83                                Bag+RBF        0.40-0.83
         Bag+RT         0.51-0.81                                Bag+RT         0.37-0.88
         NCL+MLP        0.59-1.00                                NCL+MLP        0.59-1.00
         Rand+MLP       0.93-1.00                                Rand+MLP       0.93-1.00




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation                16 / 22
Taking a Closer Look...




     Table: Correlations between ensemble covariance (diversity) and
     train/test MMRE for the data sets in which bag+MLP obtained the best
     MMREs and was ranked 1st or 2nd against the data sets in which it
     obtained the worst MMREs.
                                                           Cov. vs       Cov. vs
                                                         Test MMRE     Train MMRE
                              Best MMRE (desharnais)         0.24          0.14
                              2nd best MMRE (org2)           0.70           0.38
                              2nd worst MMRE (org7)         -0.42          -0.37
                              Worst MMRE (cocomo2)          -0.99          -0.99




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk     Ensembles for Software Effort Estimation   17 / 22
Taking a Closer Look...


     Table: Correlations between ensemble covariance (diversity) and
     train/test MMRE for the data sets in which bag+MLP obtained the best
     MMREs and was ranked 1st or 2nd against the data sets in which it
     obtained the worst MMREs.
                                                           Cov. vs       Cov. vs
                                                         Test MMRE     Train MMRE
                              Best MMRE (desharnais)         0.24          0.14
                              2nd best MMRE (org2)           0.70           0.38
                              2nd worst MMRE (org7)         -0.42          -0.37
                              Worst MMRE (cocomo2)          -0.99          -0.99




     Diversity is not only affected by the ensemble method, but also by
     the data set:
            Software effort estimation data sets are very different from
            each other.


Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk     Ensembles for Software Effort Estimation   17 / 22
Taking a Closer Look...


     Table: Correlations between ensemble covariance (diversity) and
     train/test MMRE for the data sets in which bag+MLP obtained the best
     MMREs and was ranked 1st or 2nd against the data sets in which it
     obtained the worst MMREs.
                                                           Cov. vs       Cov. vs
                                                         Test MMRE     Train MMRE
                              Best MMRE (desharnais)         0.24          0.14
                              2nd best MMRE (org2)           0.70           0.38
                              2nd worst MMRE (org7)         -0.42          -0.37
                              Worst MMRE (cocomo2)          -0.99          -0.99




     Correlation between diversity and performance on test set follows
     tendency on train set.
            Why do we have a negative correlation in the worst cases?
            Could a method that self-adapts diversity help to improve
            estimations? How?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk     Ensembles for Software Effort Estimation   17 / 22
Research Questions – Revisited


     Question 2
     If a particular method is singled out, what insight on how to
     improve effort estimations can we gain by analysing its behaviour
     and the reasons for its better performance?
            RTs give more importance to more important features.
            Weighting attributes may be helpful when using ML for
            software effort estimation.
            Ensembles seem to have more room for improvement for
            software effort estimation.
            A method to self-adapt diversity might help to improve
            estimations.



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   18 / 22
Research Questions – Revisited


     Question 3
     How can someone determine what model to be used considering a
     particular data set?
            Effort estimation data sets affect dramatically the behaviour
            and performance of different learning machines, even
            considering ensembles.
            So, it would be necessary to run experiments (parameters
            choice is important) using existing data from a particular
            company to determine what method is likely to be the best.
            If the software manager does not have enough knowledge of
            the models, RTs are a good choice.



Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   19 / 22
Risk Analysis


     The learning machines singled out (RTs and bagging+MLPs) were
     further tested using the outlier projects.
            MMRE similar or lower (better), usually better than for
            outliers-free data sets.
            PRED(25) similar or lower (worse), usually lower.


     Even though outliers are projects to which the learning machines
     have more difficulties in predicting within 25% of the actual effort,
     they are not the projects to which they give the worst estimates.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   20 / 22
Risk Analysis


     The learning machines singled out (RTs and bagging+MLPs) were
     further tested using the outlier projects.
            MMRE similar or lower (better), usually better than for
            outliers-free data sets.
            PRED(25) similar or lower (worse), usually lower.


     Even though outliers are projects to which the learning machines
     have more difficulties in predicting within 25% of the actual effort,
     they are not the projects to which they give the worst estimates.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   20 / 22
Conclusions and Future Work

            RQ1 – readily available ensembles do not provide generally
            better effort estimations.
                    Principled experiments (parameters, statistical analysis, several
                    data sets, more ensemble approaches) to deal with validity
                    issues.
            RQ2 – RTs + weighting features; bagging with MLPs + self
            adapting diversity.
                    Insight based on experiments, not just intuition or speculation.
            RQ3 – principled experiments to choose model, RTs if no
            resources.
                    No universally good model, even when using ensembles;
                    parameters choice in framework.
            Future work:
                    Learning feature weights in ML for effort estimation.
                    Can we use self-tuning diversity in ensembles of learning
                    machines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   21 / 22
Conclusions and Future Work

            RQ1 – readily available ensembles do not provide generally
            better effort estimations.
                    Principled experiments (parameters, statistical analysis, several
                    data sets, more ensemble approaches) to deal with validity
                    issues.
            RQ2 – RTs + weighting features; bagging with MLPs + self
            adapting diversity.
                    Insight based on experiments, not just intuition or speculation.
            RQ3 – principled experiments to choose model, RTs if no
            resources.
                    No universally good model, even when using ensembles;
                    parameters choice in framework.
            Future work:
                    Learning feature weights in ML for effort estimation.
                    Can we use self-tuning diversity in ensembles of learning
                    machines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   21 / 22
Conclusions and Future Work

            RQ1 – readily available ensembles do not provide generally
            better effort estimations.
                    Principled experiments (parameters, statistical analysis, several
                    data sets, more ensemble approaches) to deal with validity
                    issues.
            RQ2 – RTs + weighting features; bagging with MLPs + self
            adapting diversity.
                    Insight based on experiments, not just intuition or speculation.
            RQ3 – principled experiments to choose model, RTs if no
            resources.
                    No universally good model, even when using ensembles;
                    parameters choice in framework.
            Future work:
                    Learning feature weights in ML for effort estimation.
                    Can we use self-tuning diversity in ensembles of learning
                    machines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   21 / 22
Conclusions and Future Work

            RQ1 – readily available ensembles do not provide generally
            better effort estimations.
                    Principled experiments (parameters, statistical analysis, several
                    data sets, more ensemble approaches) to deal with validity
                    issues.
            RQ2 – RTs + weighting features; bagging with MLPs + self
            adapting diversity.
                    Insight based on experiments, not just intuition or speculation.
            RQ3 – principled experiments to choose model, RTs if no
            resources.
                    No universally good model, even when using ensembles;
                    parameters choice in framework.
            Future work:
                    Learning feature weights in ML for effort estimation.
                    Can we use self-tuning diversity in ensembles of learning
                    machines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   21 / 22
Acknowledgements




            Search Based Software Engineering (SEBASE) research group.
            Dr. Rami Bahsoon.
            This work was funded by EPSRC grant No. EP/D052785/1.




Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk   Ensembles for Software Effort Estimation   22 / 22

More Related Content

What's hot (16)

PDF
應用行動科技紀錄與研究人們日常生活行為與脈絡
Stanley Chang
 
PDF
130321 zephyrin soh - on the effect of exploration strategies on maintenanc...
Ptidej Team
 
DOCX
Running head finding employment as a java developer
DIPESH30
 
PDF
IRJET- An Automated Approach to Conduct Pune University’s In-Sem Examination
IRJET Journal
 
PDF
Automated exam question set generator using utility based agent and learning ...
Journal Papers
 
PDF
IRJET- Predictive Analytics for Placement of Student- A Comparative Study
IRJET Journal
 
DOC
CSE333 project initial spec: Learning agents
butest
 
DOCX
Running head cyber security competition framework cyber securi
DIPESH30
 
PDF
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
ijtsrd
 
PDF
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET Journal
 
PDF
The efficiency examination of teaching of different normalization methods
IJDMS
 
PPTX
Sources of errors in distributed development projects implications for colla...
Bhagyashree Deokar
 
PDF
Multilevel analysis of factors
IJITE
 
PPTX
First Year Report, PhD presentation
Bang Xiang Yong
 
PDF
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
IRJET Journal
 
PDF
final
Borchyi Lin
 
應用行動科技紀錄與研究人們日常生活行為與脈絡
Stanley Chang
 
130321 zephyrin soh - on the effect of exploration strategies on maintenanc...
Ptidej Team
 
Running head finding employment as a java developer
DIPESH30
 
IRJET- An Automated Approach to Conduct Pune University’s In-Sem Examination
IRJET Journal
 
Automated exam question set generator using utility based agent and learning ...
Journal Papers
 
IRJET- Predictive Analytics for Placement of Student- A Comparative Study
IRJET Journal
 
CSE333 project initial spec: Learning agents
butest
 
Running head cyber security competition framework cyber securi
DIPESH30
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
ijtsrd
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET Journal
 
The efficiency examination of teaching of different normalization methods
IJDMS
 
Sources of errors in distributed development projects implications for colla...
Bhagyashree Deokar
 
Multilevel analysis of factors
IJITE
 
First Year Report, PhD presentation
Bang Xiang Yong
 
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
IRJET Journal
 

Similar to Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation" (20)

DOCX
Analysis of pair Programming Effectiveness in Academic Environment
ijcnes
 
PDF
ANALYSIS OF STUDENT ACADEMIC PERFORMANCE USING MACHINE LEARNING ALGORITHMS:– ...
indexPub
 
PDF
Clustering Students of Computer in Terms of Level of Programming
Editor IJCATR
 
PDF
B05110409
IOSR-JEN
 
PDF
Automated Thai Online Assignment Scoring
Mary Montoya
 
PDF
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
PDF
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
PDF
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
PDF
IRJET- Academic Performance Analysis System
IRJET Journal
 
PDF
Applicability of Extreme Programming In Educational Environment
CSCJournals
 
PDF
A Survey on Research work in Educational Data Mining
iosrjce
 
PDF
G017224349
IOSR Journals
 
PDF
Journal publications
Sarita30844
 
PDF
scopus journal.pdf
nareshkotra
 
PDF
IJMERT.pdf
nareshkotra
 
PDF
IRJET- Sentimental Analysis for Students’ Feedback using Machine Learning App...
IRJET Journal
 
PDF
Software Cost Estimation Using Clustering and Ranking Scheme
Editor IJMTER
 
PPT
A Hybrid Approach to Expert and Model Based Effort Estimation
CS, NcState
 
DOC
Performance Evaluation of Feature Selection Algorithms in Educational Data Mi...
IIRindia
 
PDF
Games to Improve Clinical Practice and Healthcare Administration
SeriousGamesAssoc
 
Analysis of pair Programming Effectiveness in Academic Environment
ijcnes
 
ANALYSIS OF STUDENT ACADEMIC PERFORMANCE USING MACHINE LEARNING ALGORITHMS:– ...
indexPub
 
Clustering Students of Computer in Terms of Level of Programming
Editor IJCATR
 
B05110409
IOSR-JEN
 
Automated Thai Online Assignment Scoring
Mary Montoya
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
IRJET- Academic Performance Analysis System
IRJET Journal
 
Applicability of Extreme Programming In Educational Environment
CSCJournals
 
A Survey on Research work in Educational Data Mining
iosrjce
 
G017224349
IOSR Journals
 
Journal publications
Sarita30844
 
scopus journal.pdf
nareshkotra
 
IJMERT.pdf
nareshkotra
 
IRJET- Sentimental Analysis for Students’ Feedback using Machine Learning App...
IRJET Journal
 
Software Cost Estimation Using Clustering and Ranking Scheme
Editor IJMTER
 
A Hybrid Approach to Expert and Model Based Effort Estimation
CS, NcState
 
Performance Evaluation of Feature Selection Algorithms in Educational Data Mi...
IIRindia
 
Games to Improve Clinical Practice and Healthcare Administration
SeriousGamesAssoc
 
Ad

More from CS, NcState (20)

PPTX
Talks2015 novdec
CS, NcState
 
PPTX
Future se oct15
CS, NcState
 
PPTX
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
PPTX
Big Data: the weakest link
CS, NcState
 
PPTX
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
PPTX
Lexisnexis june9
CS, NcState
 
PPTX
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
PPTX
Icse15 Tech-briefing Data Science
CS, NcState
 
PPTX
Kits to Find the Bits that Fits
CS, NcState
 
PPTX
Ai4se lab template
CS, NcState
 
PPTX
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
PPT
Requirements Engineering
CS, NcState
 
PPT
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
PPTX
Automated Software Engineering
CS, NcState
 
PDF
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
PPTX
Tim Menzies, directions in Data Science
CS, NcState
 
PPTX
Goldrush
CS, NcState
 
PPTX
Dagstuhl14 intro-v1
CS, NcState
 
PPTX
Know thy tools
CS, NcState
 
PPTX
The Art and Science of Analyzing Software Data
CS, NcState
 
Talks2015 novdec
CS, NcState
 
Future se oct15
CS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
Big Data: the weakest link
CS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
Lexisnexis june9
CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
Icse15 Tech-briefing Data Science
CS, NcState
 
Kits to Find the Bits that Fits
CS, NcState
 
Ai4se lab template
CS, NcState
 
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
Requirements Engineering
CS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
Automated Software Engineering
CS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
Tim Menzies, directions in Data Science
CS, NcState
 
Goldrush
CS, NcState
 
Dagstuhl14 intro-v1
CS, NcState
 
Know thy tools
CS, NcState
 
The Art and Science of Analyzing Software Data
CS, NcState
 
Ad

Recently uploaded (20)

PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 

Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

  • 1. A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk CERCIA, School of Computer Science, The University of Birmingham Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 1 / 22
  • 2. Outline Introduction (Background and Motivation) Research Questions (Aims) Experiments (Method and Results) Answers to Research Questions (Conclusions) Future Work Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 2 / 22
  • 3. Introduction Software cost estimation: Set of techniques and procedures that an organisation uses to arrive at an estimate. Major contributing factor is effort (in person-hours, person-month, etc). Overestimation vs. underestimation. Several software cost/effort estimation models have been proposed. ML models have been receiving increased attention: They make no or minimal assumptions about the data and the function being modelled. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 3 / 22
  • 4. Introduction Ensembles of Learning Machines are groups of learning machines trained to perform the same task and combined with the aim of improving predictive performance. Studies comparing ensembles against single learners in software effort estimation are contradictory: Braga et al IJCNN’07 claims that Bagging improves a bit effort estimations produced by single learners. Kultur et al KBS’09 claims that an adapted Bagging provides large improvements. Kocaguneli et al ISSRE’09 claims that combining different learners does not improve effort estimations. These studies either miss statistical tests or do not present the parameters choice. None of them analyse the reason for the achieved results. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22
  • 5. Introduction Ensembles of Learning Machines are groups of learning machines trained to perform the same task and combined with the aim of improving predictive performance. Studies comparing ensembles against single learners in software effort estimation are contradictory: Braga et al IJCNN’07 claims that Bagging improves a bit effort estimations produced by single learners. Kultur et al KBS’09 claims that an adapted Bagging provides large improvements. Kocaguneli et al ISSRE’09 claims that combining different learners does not improve effort estimations. These studies either miss statistical tests or do not present the parameters choice. None of them analyse the reason for the achieved results. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22
  • 6. Introduction Ensembles of Learning Machines are groups of learning machines trained to perform the same task and combined with the aim of improving predictive performance. Studies comparing ensembles against single learners in software effort estimation are contradictory: Braga et al IJCNN’07 claims that Bagging improves a bit effort estimations produced by single learners. Kultur et al KBS’09 claims that an adapted Bagging provides large improvements. Kocaguneli et al ISSRE’09 claims that combining different learners does not improve effort estimations. These studies either miss statistical tests or do not present the parameters choice. None of them analyse the reason for the achieved results. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22
  • 7. Introduction Ensembles of Learning Machines are groups of learning machines trained to perform the same task and combined with the aim of improving predictive performance. Studies comparing ensembles against single learners in software effort estimation are contradictory: Braga et al IJCNN’07 claims that Bagging improves a bit effort estimations produced by single learners. Kultur et al KBS’09 claims that an adapted Bagging provides large improvements. Kocaguneli et al ISSRE’09 claims that combining different learners does not improve effort estimations. These studies either miss statistical tests or do not present the parameters choice. None of them analyse the reason for the achieved results. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22
  • 8. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 9. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? The current studies are contradictory. They either do not perform statistical comparisons or do not explain the parameters choice. It would be worth to investigate the use of different ensemble approaches. We build upon current work by considering these points. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 10. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 11. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Principled experiments, not just intuition or speculations. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 12. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Question 3 How can someone determine what model to be used considering a particular data set? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 13. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Question 3 How can someone determine what model to be used considering a particular data set? Our study complements previous work, parameters choice is important. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22
  • 14. Data Sets and Preprocessing Data sets: cocomo81, nasa93, nasa, cocomo2, desharnais, 7 ISBSG organization type subsets. Cover a wide range of features. In particular, ISBSG subsets’ productivity rate is statistically different. Attributes: cocomo attributes for PROMISE data, functional size, development type and language type for ISBSG. Missing values: delete for PROMISE, k-NN imputation for ISBSG. Outliers: K-means detection / elimination. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 6 / 22
  • 15. Experimental Framework – Step 1: choice of learning machines Single learners: MultiLayer Perceptrons (MLPs) – universal approximators; Radial Basis Function networks (RBFs) – local learning; and Regression Trees (RTs) – simple and comprehensive. Ensemble learners: Bagging with MLPs, with RBFs and with RTs – widely and successfully used; Random with MLPs – use full training set for each learner; and Negative Correlation Learning (NCL) with MLPs – regression. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 7 / 22
  • 16. Experimental Framework – Step 2: choice of evaluation method Executions were done in 30 rounds, 10 projects for testing and remaining for training, as suggested by Menzies et al. TSE’06. Evaluation was done in two steps: 1 Menzies et al. TSE’06’s survival rejection rules: If MMREs are significantly different according to a paired t-test with 95% of confidence, the best model is the one with the lowest average MMRE. If not, the best method is the one with the best: 1 Correlation 2 Standard deviation 3 PRED(N) 4 Number of attributes 2 Wilcoxon tests with 95% of confidence to compare the two methods more often among the best in terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22
  • 17. Experimental Framework – Step 2: choice of evaluation method Executions were done in 30 rounds, 10 projects for testing and remaining for training, as suggested by Menzies et al. TSE’06. Evaluation was done in two steps: 1 Menzies et al. TSE’06’s survival rejection rules: If MMREs are significantly different according to a paired t-test with 95% of confidence, the best model is the one with the lowest average MMRE. If not, the best method is the one with the best: 1 Correlation 2 Standard deviation 3 PRED(N) 4 Number of attributes 2 Wilcoxon tests with 95% of confidence to compare the two methods more often among the best in terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22
  • 18. Experimental Framework – Step 2: choice of evaluation method Executions were done in 30 rounds, 10 projects for testing and remaining for training, as suggested by Menzies et al. TSE’06. Evaluation was done in two steps: 1 Menzies et al. TSE’06’s survival rejection rules: If MMREs are significantly different according to a paired t-test with 95% of confidence, the best model is the one with the lowest average MMRE. If not, the best method is the one with the best: 1 Correlation 2 Standard deviation 3 PRED(N) 4 Number of attributes 2 Wilcoxon tests with 95% of confidence to compare the two methods more often among the best in terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22
  • 19. Experimental Framework – Step 2: choice of evaluation method Mean Magnitude of the Relative Error |predictedi −actuali | M M RE = T T M REi , where M REi = 1 i=1 actuali Percentage of estimations within N % of the actual values N 1, if M REi ≤ 100 P RED(N ) = T T1 i=1 0, otherwise Correlation between estimated and actual effort: S CORR = √ pa , where Sp Sa T i=1 (predictedi −¯)(actuali −¯) p a Spa = T −1 T (predictedi −¯)2 p T (actuali −¯)2 a Sp = i=1 T −1 , Sa = i=1 T −1 , T predictedi T actuali p= ¯ i=1 T , a=¯ i=1 T . Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22
  • 20. Experimental Framework – Step 2: choice of evaluation method Mean Magnitude of the Relative Error |predictedi −actuali | M M RE = T T M REi , where M REi = 1 i=1 actuali Percentage of estimations within N % of the actual values N 1, if M REi ≤ 100 P RED(N ) = T T1 i=1 0, otherwise Correlation between estimated and actual effort: S CORR = √ pa , where Sp Sa T i=1 (predictedi −¯)(actuali −¯) p a Spa = T −1 T (predictedi −¯)2 p T (actuali −¯)2 a Sp = i=1 T −1 , Sa = i=1 T −1 , T predictedi T actuali p= ¯ i=1 T , a=¯ i=1 T . Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22
  • 21. Experimental Framework – Step 2: choice of evaluation method Mean Magnitude of the Relative Error |predictedi −actuali | M M RE = T T M REi , where M REi = 1 i=1 actuali Percentage of estimations within N % of the actual values N 1, if M REi ≤ 100 P RED(N ) = T T1 i=1 0, otherwise Correlation between estimated and actual effort: S CORR = √ pa , where Sp Sa T i=1 (predictedi −¯)(actuali −¯) p a Spa = T −1 T (predictedi −¯)2 p T (actuali −¯)2 a Sp = i=1 T −1 , Sa = i=1 T −1 , T predictedi T actuali p= ¯ i=1 T , a=¯ i=1 T . Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22
  • 22. Experimental Framework – Step 3: choice of parameters Preliminary experiments using 5 runs. Each approach was run with all the combinations of 3 or 5 parameter values. Parameters with the lowest MMRE were chosen for further 30 runs. Base learners will not necessarily have the same parameters as single learners. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 10 / 22
  • 23. Comparison of Learning Machines – Menzies et al. TSE’06’s survival rejection rules Table: Number of Data Sets in which Each Method Survived. Methods that never survived are omitted. PROMISE Data ISBSG Data All Data RT: 2 MLP: 2 RT: 3 Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2 NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2 Rand + MLP: 1 RT: 1 Bag + RTs: 2 Bag + RBF: 1 MLP: 2 NCL + MLP: 1 Rand + MLP: 1 Bag + RBF: 1 No approach is consistently the best, even considering ensembles! Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 11 / 22
  • 24. Comparison of Learning Machines – Menzies et al. TSE’06’s survival rejection rules Table: Number of Data Sets in which Each Method Survived. Methods that never survived are omitted. PROMISE Data ISBSG Data All Data RT: 2 MLP: 2 RT: 3 Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2 NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2 Rand + MLP: 1 RT: 1 Bag + RTs: 2 Bag + RBF: 1 MLP: 2 NCL + MLP: 1 Rand + MLP: 1 Bag + RBF: 1 No approach is consistently the best, even considering ensembles! Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 11 / 22
  • 25. Comparison of Learning Machines What methods are usually among the best? RTs and bag+MLPs are more Table: Number of Data Sets in which Each Method frequently among the best Was Ranked First or Second According to MMRE and PRED(25). Methods never among the first and second considering MMRE than are omitted. considering PRED(25). (a) Accoding to MMRE The first ranked method’s PROMISE Data ISBSG Data All Data RT: 4 RT: 5 RT: 9 MMRE is statistically different Bag + MLP: 3 Bag + MLP 5 Bag + MLP: 8 Bag + RT: 2 Bag + RBF: 3 Bag + RBF: 3 from the others in 35.16% of MLP: 1 MLP: 1 MLP: 2 the cases. Rand + MLP: 1 Bag + RT: 2 NCL + MLP: 1 Rand + MLP: 1 NCL + MLP: 1 The second ranked method’s MMRE is statistically different (b) Acording to PRED(25) from the lower ranked methods PROMISE Data ISBSG Data All Data Bag + MLP: 3 RT: 5 RT: 6 in 16.67% of the cases. Rand + MLP: 3 Rand + MLP: 3 Rand + MLP: 6 Bag + RT: RT: 2 1 Bag + MLP: MLP: 2 2 Bag + MLP: Bag + RT: 5 3 RTs and bag+MLPs are MLP: 1 RBF: 2 MLP: 3 usually statistically equal in Bag + RBF: 1 RBF: 2 Bag + RT: 1 Bag + RBF: 1 terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 12 / 22
  • 26. Research Questions – Revisited Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Even though bag+MLPs is frequently among the best methods, it is statistically similar to RTs. RTs are more comprehensive and have faster training. Bag+MLPs seem to have more potential for improvements. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 13 / 22
  • 27. Why Were RTs Singled Out? Hypothesis: As RTs have splits based on information gain, they may work in such a way to give more importance for more relevant attributes. A further study using correlation-based feature selection revealed that RTs usually put higher features higher ranked by the feature selection method in higher level splits of the tree. Feature selection by itself was not able to always improve accuracy. It may be important to give weights to features when using ML approaches. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 14 / 22
  • 28. Why Were RTs Singled Out? Table: Correlation-Based Feature Selection and RT Attributes Relative Importance for Cocomo81. Attributes ranking First tree level in which the attribute Percentage of appears in more than 50% of the trees trees LOC Level 0 100.00% Development mode Required software reliability Level 1 90.00% Modern programing practices Time constraint for cpu Level 2 73.33% Data base size Level 2 83.34% Main memory constraint Turnaround time Programmers capability Analysts capability Language experience Virtual machine experience Schedule constraint Application experience Level 2 66.67% Use of software tools Machine volatility Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 15 / 22
  • 29. Why Were Bag+MLPs Singled Out Hypothesis: bag+MLPs may have lead to a more adequate level of diversity. If we use correlation as the diversity measure, we can see that bag+MLPs usually had more moderate values when it was the 1st or 2nd ranked MMRE method. However, the correlation between diversity and MMRE was usually quite low. Table: Correlation Considering Data Sets in which Bag+MLPs Were Ranked 1st or 2nd. Table: Correlation Considering All Data Sets. Approach Correlation interval Approach Correlation interval across different data sets across different data sets Bag+MLP 0.74-0.92 Bag+MLP 0.47-0.98 Bag+RBF 0.40-0.83 Bag+RBF 0.40-0.83 Bag+RT 0.51-0.81 Bag+RT 0.37-0.88 NCL+MLP 0.59-1.00 NCL+MLP 0.59-1.00 Rand+MLP 0.93-1.00 Rand+MLP 0.93-1.00 Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 16 / 22
  • 30. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22
  • 31. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Diversity is not only affected by the ensemble method, but also by the data set: Software effort estimation data sets are very different from each other. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22
  • 32. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Correlation between diversity and performance on test set follows tendency on train set. Why do we have a negative correlation in the worst cases? Could a method that self-adapts diversity help to improve estimations? How? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22
  • 33. Research Questions – Revisited Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? RTs give more importance to more important features. Weighting attributes may be helpful when using ML for software effort estimation. Ensembles seem to have more room for improvement for software effort estimation. A method to self-adapt diversity might help to improve estimations. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 18 / 22
  • 34. Research Questions – Revisited Question 3 How can someone determine what model to be used considering a particular data set? Effort estimation data sets affect dramatically the behaviour and performance of different learning machines, even considering ensembles. So, it would be necessary to run experiments (parameters choice is important) using existing data from a particular company to determine what method is likely to be the best. If the software manager does not have enough knowledge of the models, RTs are a good choice. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 19 / 22
  • 35. Risk Analysis The learning machines singled out (RTs and bagging+MLPs) were further tested using the outlier projects. MMRE similar or lower (better), usually better than for outliers-free data sets. PRED(25) similar or lower (worse), usually lower. Even though outliers are projects to which the learning machines have more difficulties in predicting within 25% of the actual effort, they are not the projects to which they give the worst estimates. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22
  • 36. Risk Analysis The learning machines singled out (RTs and bagging+MLPs) were further tested using the outlier projects. MMRE similar or lower (better), usually better than for outliers-free data sets. PRED(25) similar or lower (worse), usually lower. Even though outliers are projects to which the learning machines have more difficulties in predicting within 25% of the actual effort, they are not the projects to which they give the worst estimates. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22
  • 37. Conclusions and Future Work RQ1 – readily available ensembles do not provide generally better effort estimations. Principled experiments (parameters, statistical analysis, several data sets, more ensemble approaches) to deal with validity issues. RQ2 – RTs + weighting features; bagging with MLPs + self adapting diversity. Insight based on experiments, not just intuition or speculation. RQ3 – principled experiments to choose model, RTs if no resources. No universally good model, even when using ensembles; parameters choice in framework. Future work: Learning feature weights in ML for effort estimation. Can we use self-tuning diversity in ensembles of learning machines to improve estimations? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22
  • 38. Conclusions and Future Work RQ1 – readily available ensembles do not provide generally better effort estimations. Principled experiments (parameters, statistical analysis, several data sets, more ensemble approaches) to deal with validity issues. RQ2 – RTs + weighting features; bagging with MLPs + self adapting diversity. Insight based on experiments, not just intuition or speculation. RQ3 – principled experiments to choose model, RTs if no resources. No universally good model, even when using ensembles; parameters choice in framework. Future work: Learning feature weights in ML for effort estimation. Can we use self-tuning diversity in ensembles of learning machines to improve estimations? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22
  • 39. Conclusions and Future Work RQ1 – readily available ensembles do not provide generally better effort estimations. Principled experiments (parameters, statistical analysis, several data sets, more ensemble approaches) to deal with validity issues. RQ2 – RTs + weighting features; bagging with MLPs + self adapting diversity. Insight based on experiments, not just intuition or speculation. RQ3 – principled experiments to choose model, RTs if no resources. No universally good model, even when using ensembles; parameters choice in framework. Future work: Learning feature weights in ML for effort estimation. Can we use self-tuning diversity in ensembles of learning machines to improve estimations? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22
  • 40. Conclusions and Future Work RQ1 – readily available ensembles do not provide generally better effort estimations. Principled experiments (parameters, statistical analysis, several data sets, more ensemble approaches) to deal with validity issues. RQ2 – RTs + weighting features; bagging with MLPs + self adapting diversity. Insight based on experiments, not just intuition or speculation. RQ3 – principled experiments to choose model, RTs if no resources. No universally good model, even when using ensembles; parameters choice in framework. Future work: Learning feature weights in ML for effort estimation. Can we use self-tuning diversity in ensembles of learning machines to improve estimations? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22
  • 41. Acknowledgements Search Based Software Engineering (SEBASE) research group. Dr. Rami Bahsoon. This work was funded by EPSRC grant No. EP/D052785/1. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 22 / 22