Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

1. A Principled Evaluation of Ensembles of Learning Machines for Software Eﬀort Estimation Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk CERCIA, School of Computer Science, The University of Birmingham Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 1 / 22

2. Outline Introduction (Background and Motivation) Research Questions (Aims) Experiments (Method and Results) Answers to Research Questions (Conclusions) Future Work Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 2 / 22

3. Introduction Software cost estimation: Set of techniques and procedures that an organisation uses to arrive at an estimate. Major contributing factor is effort (in person-hours, person-month, etc). Overestimation vs. underestimation. Several software cost/effort estimation models have been proposed. ML models have been receiving increased attention: They make no or minimal assumptions about the data and the function being modelled. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 3 / 22

4. Introduction Ensembles of Learning Machines are groups of learning machines trained to perform the same task and combined with the aim of improving predictive performance. Studies comparing ensembles against single learners in software effort estimation are contradictory: Braga et al IJCNN’07 claims that Bagging improves a bit effort estimations produced by single learners. Kultur et al KBS’09 claims that an adapted Bagging provides large improvements. Kocaguneli et al ISSRE’09 claims that combining different learners does not improve effort estimations. These studies either miss statistical tests or do not present the parameters choice. None of them analyse the reason for the achieved results. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22

8. Research Questions Question 1 Do readily available ensemble methods generally improve eﬀort estimations given by single learners? Which of them would be more useful? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 5 / 22

9. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? The current studies are contradictory. They either do not perform statistical comparisons or do not explain the parameters choice. It would be worth to investigate the use of different ensemble approaches. We build upon current work by considering these points. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

10. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

11. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Principled experiments, not just intuition or speculations. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

12. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Question 3 How can someone determine what model to be used considering a particular data set? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

13. Research Questions Question 1 Do readily available ensemble methods generally improve effort estimations given by single learners? Which of them would be more useful? Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? Question 3 How can someone determine what model to be used considering a particular data set? Our study complements previous work, parameters choice is important. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

14. Data Sets and Preprocessing Data sets: cocomo81, nasa93, nasa, cocomo2, desharnais, 7 ISBSG organization type subsets. Cover a wide range of features. In particular, ISBSG subsets’ productivity rate is statistically diﬀerent. Attributes: cocomo attributes for PROMISE data, functional size, development type and language type for ISBSG. Missing values: delete for PROMISE, k-NN imputation for ISBSG. Outliers: K-means detection / elimination. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 6 / 22

15. Experimental Framework – Step 1: choice of learning machines Single learners: MultiLayer Perceptrons (MLPs) – universal approximators; Radial Basis Function networks (RBFs) – local learning; and Regression Trees (RTs) – simple and comprehensive. Ensemble learners: Bagging with MLPs, with RBFs and with RTs – widely and successfully used; Random with MLPs – use full training set for each learner; and Negative Correlation Learning (NCL) with MLPs – regression. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 7 / 22

16. Experimental Framework – Step 2: choice of evaluation method Executions were done in 30 rounds, 10 projects for testing and remaining for training, as suggested by Menzies et al. TSE’06. Evaluation was done in two steps: 1 Menzies et al. TSE’06’s survival rejection rules: If MMREs are significantly different according to a paired t-test with 95% of confidence, the best model is the one with the lowest average MMRE. If not, the best method is the one with the best: 1 Correlation 2 Standard deviation 3 PRED(N) 4 Number of attributes 2 Wilcoxon tests with 95% of confidence to compare the two methods more often among the best in terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22

19. Experimental Framework – Step 2: choice of evaluation method Mean Magnitude of the Relative Error |predictedi −actuali | M M RE = T T M REi , where M REi = 1 i=1 actuali Percentage of estimations within N % of the actual values N 1, if M REi ≤ 100 P RED(N ) = T T1 i=1 0, otherwise Correlation between estimated and actual eﬀort: S CORR = √ pa , where Sp Sa T i=1 (predictedi −¯)(actuali −¯) p a Spa = T −1 T (predictedi −¯)2 p T (actuali −¯)2 a Sp = i=1 T −1 , Sa = i=1 T −1 , T predictedi T actuali p= ¯ i=1 T , a=¯ i=1 T . Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 9 / 22

22. Experimental Framework – Step 3: choice of parameters Preliminary experiments using 5 runs. Each approach was run with all the combinations of 3 or 5 parameter values. Parameters with the lowest MMRE were chosen for further 30 runs. Base learners will not necessarily have the same parameters as single learners. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 10 / 22

23. Comparison of Learning Machines – Menzies et al. TSE’06’s survival rejection rules Table: Number of Data Sets in which Each Method Survived. Methods that never survived are omitted. PROMISE Data ISBSG Data All Data RT: 2 MLP: 2 RT: 3 Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2 NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2 Rand + MLP: 1 RT: 1 Bag + RTs: 2 Bag + RBF: 1 MLP: 2 NCL + MLP: 1 Rand + MLP: 1 Bag + RBF: 1 No approach is consistently the best, even considering ensembles! Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 11 / 22

24. Comparison of Learning Machines – Menzies et al. TSE’06’s survival rejection rules Table: Number of Data Sets in which Each Method Survived. Methods that never survived are omitted. PROMISE Data ISBSG Data All Data RT: 2 MLP: 2 RT: 3 Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2 NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2 Rand + MLP: 1 RT: 1 Bag + RTs: 2 Bag + RBF: 1 MLP: 2 NCL + MLP: 1 Rand + MLP: 1 Bag + RBF: 1 No approach is consistently the best, even considering ensembles! Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 11 / 22

25. Comparison of Learning Machines What methods are usually among the best? RTs and bag+MLPs are more Table: Number of Data Sets in which Each Method frequently among the best Was Ranked First or Second According to MMRE and PRED(25). Methods never among the first and second considering MMRE than are omitted. considering PRED(25). (a) Accoding to MMRE The first ranked method’s PROMISE Data ISBSG Data All Data RT: 4 RT: 5 RT: 9 MMRE is statistically different Bag + MLP: 3 Bag + MLP 5 Bag + MLP: 8 Bag + RT: 2 Bag + RBF: 3 Bag + RBF: 3 from the others in 35.16% of MLP: 1 MLP: 1 MLP: 2 the cases. Rand + MLP: 1 Bag + RT: 2 NCL + MLP: 1 Rand + MLP: 1 NCL + MLP: 1 The second ranked method’s MMRE is statistically different (b) Acording to PRED(25) from the lower ranked methods PROMISE Data ISBSG Data All Data Bag + MLP: 3 RT: 5 RT: 6 in 16.67% of the cases. Rand + MLP: 3 Rand + MLP: 3 Rand + MLP: 6 Bag + RT: RT: 2 1 Bag + MLP: MLP: 2 2 Bag + MLP: Bag + RT: 5 3 RTs and bag+MLPs are MLP: 1 RBF: 2 MLP: 3 usually statistically equal in Bag + RBF: 1 RBF: 2 Bag + RT: 1 Bag + RBF: 1 terms of MMRE and PRED(25). Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 12 / 22

26. Research Questions – Revisited Question 1 Do readily available ensemble methods generally improve eﬀort estimations given by single learners? Which of them would be more useful? Even though bag+MLPs is frequently among the best methods, it is statistically similar to RTs. RTs are more comprehensive and have faster training. Bag+MLPs seem to have more potential for improvements. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 13 / 22

27. Why Were RTs Singled Out? Hypothesis: As RTs have splits based on information gain, they may work in such a way to give more importance for more relevant attributes. A further study using correlation-based feature selection revealed that RTs usually put higher features higher ranked by the feature selection method in higher level splits of the tree. Feature selection by itself was not able to always improve accuracy. It may be important to give weights to features when using ML approaches. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 14 / 22

28. Why Were RTs Singled Out? Table: Correlation-Based Feature Selection and RT Attributes Relative Importance for Cocomo81. Attributes ranking First tree level in which the attribute Percentage of appears in more than 50% of the trees trees LOC Level 0 100.00% Development mode Required software reliability Level 1 90.00% Modern programing practices Time constraint for cpu Level 2 73.33% Data base size Level 2 83.34% Main memory constraint Turnaround time Programmers capability Analysts capability Language experience Virtual machine experience Schedule constraint Application experience Level 2 66.67% Use of software tools Machine volatility Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 15 / 22

29. Why Were Bag+MLPs Singled Out Hypothesis: bag+MLPs may have lead to a more adequate level of diversity. If we use correlation as the diversity measure, we can see that bag+MLPs usually had more moderate values when it was the 1st or 2nd ranked MMRE method. However, the correlation between diversity and MMRE was usually quite low. Table: Correlation Considering Data Sets in which Bag+MLPs Were Ranked 1st or 2nd. Table: Correlation Considering All Data Sets. Approach Correlation interval Approach Correlation interval across different data sets across different data sets Bag+MLP 0.74-0.92 Bag+MLP 0.47-0.98 Bag+RBF 0.40-0.83 Bag+RBF 0.40-0.83 Bag+RT 0.51-0.81 Bag+RT 0.37-0.88 NCL+MLP 0.59-1.00 NCL+MLP 0.59-1.00 Rand+MLP 0.93-1.00 Rand+MLP 0.93-1.00 Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 16 / 22

30. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 17 / 22

31. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Diversity is not only affected by the ensemble method, but also by the data set: Software effort estimation data sets are very different from each other. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22

32. Taking a Closer Look... Table: Correlations between ensemble covariance (diversity) and train/test MMRE for the data sets in which bag+MLP obtained the best MMREs and was ranked 1st or 2nd against the data sets in which it obtained the worst MMREs. Cov. vs Cov. vs Test MMRE Train MMRE Best MMRE (desharnais) 0.24 0.14 2nd best MMRE (org2) 0.70 0.38 2nd worst MMRE (org7) -0.42 -0.37 Worst MMRE (cocomo2) -0.99 -0.99 Correlation between diversity and performance on test set follows tendency on train set. Why do we have a negative correlation in the worst cases? Could a method that self-adapts diversity help to improve estimations? How? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 17 / 22

33. Research Questions – Revisited Question 2 If a particular method is singled out, what insight on how to improve effort estimations can we gain by analysing its behaviour and the reasons for its better performance? RTs give more importance to more important features. Weighting attributes may be helpful when using ML for software effort estimation. Ensembles seem to have more room for improvement for software effort estimation. A method to self-adapt diversity might help to improve estimations. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 18 / 22

34. Research Questions – Revisited Question 3 How can someone determine what model to be used considering a particular data set? Effort estimation data sets affect dramatically the behaviour and performance of different learning machines, even considering ensembles. So, it would be necessary to run experiments (parameters choice is important) using existing data from a particular company to determine what method is likely to be the best. If the software manager does not have enough knowledge of the models, RTs are a good choice. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 19 / 22

35. Risk Analysis The learning machines singled out (RTs and bagging+MLPs) were further tested using the outlier projects. MMRE similar or lower (better), usually better than for outliers-free data sets. PRED(25) similar or lower (worse), usually lower. Even though outliers are projects to which the learning machines have more difficulties in predicting within 25% of the actual effort, they are not the projects to which they give the worst estimates. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22

36. Risk Analysis The learning machines singled out (RTs and bagging+MLPs) were further tested using the outlier projects. MMRE similar or lower (better), usually better than for outliers-free data sets. PRED(25) similar or lower (worse), usually lower. Even though outliers are projects to which the learning machines have more difficulties in predicting within 25% of the actual effort, they are not the projects to which they give the worst estimates. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22

37. Conclusions and Future Work RQ1 – readily available ensembles do not provide generally better effort estimations. Principled experiments (parameters, statistical analysis, several data sets, more ensemble approaches) to deal with validity issues. RQ2 – RTs + weighting features; bagging with MLPs + self adapting diversity. Insight based on experiments, not just intuition or speculation. RQ3 – principled experiments to choose model, RTs if no resources. No universally good model, even when using ensembles; parameters choice in framework. Future work: Learning feature weights in ML for effort estimation. Can we use self-tuning diversity in ensembles of learning machines to improve estimations? Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22

41. Acknowledgements Search Based Software Engineering (SEBASE) research group. Dr. Rami Bahsoon. This work was funded by EPSRC grant No. EP/D052785/1. Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Eﬀort Estimation 22 / 22

Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

More Related Content

What's hot (16)

Similar to Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation" (20)

More from CS, NcState (20)

Recently uploaded (20)

Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"