Approximating Change Sets
   at Philips Healthcare:
       A Case Study


 Adam Vanya   Rahul Premraj   Hans van Vliet

          VU University Amsterdam
some healthcare products that are
     subjects of this study...
Achieva 3.0T TX
Intera 1.5T MRI
Panorama HFO
Philips MRI Systems
• Eight million lines of code across 34,000
  files.
• C, C++, and C# used.
• Hundreds of developers across 3 sites.
• An old version of IBM ClearCase used for
  version control.
• Nine years of version control data available.
Approximating Change Sets at Philips Healthcare: A Case Study
!"#$%&'(%&)*+,'),-(%,(.*+/+0-
“   !"#$%&"$'%()*"#$'+',#-'$
    #*'+$,%$-*./,*./$*/0$
                       ”
    #1%21#$./$,3#$4&,&"#5
“Re-engineering in the large...”
                                                  Jens Borchers




     Reengineering from a Practitioner’s View –
      A Personal Lesson’s Learned Assessment
                   Invited talk at CSMR 2011
Re-engineering in the large

By functionality   By team location   By other criteria
“
!"#$%"&'$()*+&,-&.)/0&
1&2$3,4$*5&0$.&16$,/&
7%2/&7*8)*+&.01/&1*8&
.0"%"&/0"&-%$63"(&)29            ”
      !"#$%#&%'()%*%+',#+-.&/
               .&'01%'2+#3%(04
We need data!
              •   Which developers change
                  which files?

              •   Which functionality is
                  implemented in a file?

              •   Which sub-systems are
                  often changed together?

              •   ...
Change Sets
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,          VOL. 31,   NO. 6,   JUNE 2005                                                               429




                              Mining Version Histories to
                              Guide Software Changes
                      Thomas Zimmermann, Student Member, IEEE, Peter Weißgerber,
                     Stephan Diehl, and Andreas Zeller, Member, IEEE Computer Society

     Abstract—We apply data mining to version histories in order to guide programmers along related changes: “Programmers who
     changed these functions also changed....” Given a set of existing changes, the mined association rules 1) suggest and predict likely
     further changes, 2) show up item coupling that is undetectable by program analysis, and 3) can prevent errors due to incomplete
     changes. After an initial change, our ROSE prototype can correctly predict further locations to be changed; the best predictive power is
     obtained for changes to existing software. In our evaluation based on the history of eight popular open source projects, ROSE’s
     topmost three suggestions contained a correct location with a likelihood of more than 70 percent.

     Index Terms—Programming environments/construction tools, distribution, maintenance, enhancement, configuration management,
     clustering, classification, association rules, data mining.
Identifying Transactions



A:1.3   B:1.6   C:1.1   D:1.3   E:1.5
Identifying Transactions
                developer:
hugo
                log
msg.
:
Fixed
bug
#13463
                timestamp:
Jul
23
2005
02:16:57



A:1.3   B:1.6    C:1.1   D:1.3   E:1.5
Identifying Transactions
                developer:
hugo
                log
msg.
:
Fixed
bug
#13463
                timestamp:
Jul
23
2005
02:16:57



A:1.3   B:1.6    C:1.1   D:1.3   E:1.5         same author

                                         + same log message
Identifying Transactions
                   developer:
hugo
                   log
msg.
:
Fixed
bug
#13463
                   timestamp:
Jul
23
2005
02:16:57



A:1.3      B:1.6      C:1.1   D:1.3   E:1.5         same author

                                              + same log message

        200 seconds
Identifying Transactions
                   developer:
hugo
                   log
msg.
:
Fixed
bug
#13463
                   timestamp:
Jul
23
2005
02:16:57



A:1.3      B:1.6      C:1.1   D:1.3   E:1.5         same author

                                              + same log message

        200 seconds
Identifying Transactions
                   developer:
hugo
                   log
msg.
:
Fixed
bug
#13463
                   timestamp:
Jul
23
2005
02:16:57



A:1.3      B:1.6      C:1.1   D:1.3   E:1.5         same author

                                              + same log message

        200 seconds
Change Sets

                        Task 1           Task 2



A:1.3   B:1.6   C:1.1    D:1.3   E:1.5    A:1.4   J:1.2   E:1.6
Environment at Philips

• Developers often commit files associated
  to more than one task.
• No build-able system required for commit.
• Multiple developers may work together for
  complex tasks.
Environment at Philips
        Developers rarely add commit messages!




A:1.3    B:1.6   C:1.1   D:1.3   E:1.5
Environment at Philips
        Developers rarely add commit messages!

                 developer:
hugo
                 log
msg.
:
Fixed
bug
#13463
                 timestamp:
Jul
23
2005
02:16:57



A:1.3    B:1.6    C:1.1   D:1.3   E:1.5
Identifying Change Sets
Identifying Change Sets
200 seconds
Identifying Change Sets
200 seconds
Identifying Change Sets
200 seconds



    1 hour
Identifying Change Sets
200 seconds



    1 hour


      1 day


    1 week



   1 month
Approximated Change Sets
  Table I: The Approximated Change Sets (ACS)

                          #Check-ins per ACS
       δ    #ACSs    Min.    Max.    Med.      Avg.
  200 sec   115487    1      14002    2          8
   1 hour   82571     1      14551     2        11
    1 day   42447     1      14551    4        22
  1 week    13568     1      19404    9        69
 1 month     3408     1      27502    27       275
Approximated Change Sets
                        #Check-ins per ACS



“ !"#$"%&'(%#)*(+,-.%#/%
       δ   #ACSs   Min.    Max.    Med.      Avg.



                             ”
 200 sec 115487 1 14002 2  8
  0(/*%*1%2/(314551 4 22
  1 hour 82571
   1 day 42447
                1
                1
                  14551 2 11

  1 week   13568    1      19404    9        69
 1 month    3408    1      27502    27       275
Evaluating Change Sets

!                 !
    !"#$$%&'(&)
       *+,'%          !"#$%&#$#
    -(.(/#0("$
Evaluating Change Sets

!                 !
    !"#$$%&'(&)
       *+,'%          !"#$%&#$#
    -(.(/#0("$
Developer survey
Developer survey
Developer survey
Developer survey
Developer survey
Developer survey
Developer survey

• Ten most active developers invited to
  participate. Eight responded.
• Participants briefed on purpose of survey,
  how change sets were approximated, and
  possibility to discontinue survey.
• Randomly drawn change sets presented.
• Developers evaluated 75 change sets.
Precision from survey
  Table II: Precision estimated with help of developers

Time interval       Precision (in %)                #ACS
                                                       ∗

            (δ)   Max.    Min.    Avg.   Analyzed          Skipped
 200 seconds       100     50      91          19            3
       1 hour      100     33      91          15            4
        1 day      100     40      78          21            4
      1 week       100      6      66          14            7
    1 month        100     2       36           6            8
 ∗
     ACS stands for Approximated Change Sets
Precision from survey
  Table II: Precision estimated with help of developers

Time interval       Precision (in %)                #ACS
                                                       ∗

            (δ)   Max.    Min.    Avg.   Analyzed          Skipped
 200 seconds       100     50      91          19            3
       1 hour      100     33      91          15            4
        1 day      100     40      78          21            4
      1 week       100      6      66          14            7
    1 month        100     2       36           6            8
 ∗
     ACS stands for Approximated Change Sets
Evaluating Change Sets

!                 !
    !"#$$%&'(&)
       *+,'%          !"#$%&#$#
    -(.(/#0("$
Evaluating Change Sets

!                 !
    !"#$$%&'(&)
       *+,'%          !"#$%&#$#
    -(.(/#0("$
Example Postlist
              Unique ID   <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
     Stream information   <DEVSTREAM>FEMAIN
              Developer   <USER>Anna
                   Date   <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report?   <PR_SECTION>Y
  Problem report number   <SOLVED_PR>MR00035599
Rationale behind change   <REASON_TEXT>Improve simulation of the RF-Amplifier
                          <CODING_STANDARD>N
                          <PNS_SAR>N
     Developers playing   <BBLOCK_OWNER>James
          a role in the   <REVIEWER>Robert
         review process   <TEAMLEADER>David
                          <DOC_SECTION>N
Changed files submitted   <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
             for review   <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
                          <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
                          <TEST_SECTION>Y
                          <TEST_DONE>@OTM6
 Last versions of files   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
   used to build system   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
                          <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
Example Postlist
              Unique ID   <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
     Stream information   <DEVSTREAM>FEMAIN
              Developer   <USER>Anna
                   Date   <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report?   <PR_SECTION>Y
  Problem report number   <SOLVED_PR>MR00035599
Rationale behind change   <REASON_TEXT>Improve simulation of the RF-Amplifier
                          <CODING_STANDARD>N
                          <PNS_SAR>N
     Developers playing   <BBLOCK_OWNER>James
          a role in the   <REVIEWER>Robert
         review process   <TEAMLEADER>David
                          <DOC_SECTION>N
Changed files submitted   <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
             for review   <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
                          <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
                          <TEST_SECTION>Y
                          <TEST_DONE>@OTM6
 Last versions of files   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
   used to build system   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
                          <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
Example Postlist
              Unique ID   <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
     Stream information   <DEVSTREAM>FEMAIN
              Developer   <USER>Anna
                   Date   <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report?   <PR_SECTION>Y
  Problem report number   <SOLVED_PR>MR00035599
Rationale behind change   <REASON_TEXT>Improve simulation of the RF-Amplifier
                          <CODING_STANDARD>N
                          <PNS_SAR>N
     Developers playing   <BBLOCK_OWNER>James
          a role in the   <REVIEWER>Robert
         review process   <TEAMLEADER>David
                          <DOC_SECTION>N
Changed files submitted   <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
             for review   <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
                          <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
                          <TEST_SECTION>Y
                          <TEST_DONE>@OTM6
 Last versions of files   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
   used to build system   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
                          <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
Example Postlist
              Unique ID   <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
     Stream information   <DEVSTREAM>FEMAIN
              Developer   <USER>Anna
                   Date   <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report?   <PR_SECTION>Y
  Problem report number   <SOLVED_PR>MR00035599
Rationale behind change   <REASON_TEXT>Improve simulation of the RF-Amplifier
                          <CODING_STANDARD>N
                          <PNS_SAR>N
     Developers playing   <BBLOCK_OWNER>James
          a role in the   <REVIEWER>Robert
         review process   <TEAMLEADER>David
                          <DOC_SECTION>N
Changed files submitted   <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
             for review   <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
                          <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
                          <TEST_SECTION>Y
                          <TEST_DONE>@OTM6
 Last versions of files   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
   used to build system   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
                          <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
Example Postlist
              Unique ID   <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
     Stream information   <DEVSTREAM>FEMAIN
              Developer   <USER>Anna
                   Date   <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report?   <PR_SECTION>Y
  Problem report number   <SOLVED_PR>MR00035599
Rationale behind change   <REASON_TEXT>Improve simulation of the RF-Amplifier
                          <CODING_STANDARD>N
                          <PNS_SAR>N
     Developers playing   <BBLOCK_OWNER>James
          a role in the   <REVIEWER>Robert
         review process   <TEAMLEADER>David
                          <DOC_SECTION>N
Changed files submitted   <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
             for review   <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
                          <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
                          <TEST_SECTION>Y
                          <TEST_DONE>@OTM6
 Last versions of files   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
   used to build system   <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
                          <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
<SWID>29



         Results from postlists                  Figure 5: A samp


able III: Precision estimated IV: Recall estimated using postlists
                        Table using postlists

  Time interval         Time interval
                    Precision (in %)         Recall (in %)
             (δ)   Max.    Min.    Avg. Max.
                                   (δ)            Min.    Avg.
   200 seconds      100   200 seconds
                            50     93      100     20        74
         1 hour     100     20 1 hour
                                   89      100     20        84
          1 day     100      1 1 day
                                   69      100     22        92
        1 week      100            31
                            <11 week       100     24        94
      1 month       95      <1 month
                              1     8      100     25        94
<SWID>29



         Results from postlists                  Figure 5: A samp


able III: Precision estimated IV: Recall estimated using postlists
                        Table using postlists

  Time interval         Time interval
                    Precision (in %)         Recall (in %)
             (δ)   Max.    Min.    Avg. Max.
                                   (δ)            Min.    Avg.
   200 seconds      100   200 seconds
                            50     93      100     20        74
         1 hour     100     20 1 hour
                                   89      100     20        84
          1 day     100      1 1 day
                                   69      100     22        92
        1 week      100            31
                            <11 week       100     24        94
      1 month       95      <1 month
                              1     8      100     25        94
So, what works?



• The one hour time interval works best for
  our environment.
• Optimal time interval may differ from one
  environment to another.
Threats to validity
• Assumed developers have a good recall of
  their own change sets.
• Postlists were carefully selected, but in rare
  cases may relate to more than one change
  set.
• Change sets with multiple developers
  involved were not captured completely.
Summary
Summary
Summary re-
                  r p u
                  y o
        lu a te               a ta
E v a                 e d   d
         c e      s s
    pr o            s e !
               re u
        b e fo

More Related Content

PDF
Vhdl ppt
PPTX
TCI 2016 What drives prosperity?
PPT
The Value Of Distribution Philips Medical Systems Ck
PPTX
TCI 2016 Philips: Disruptive Innovations and new business models in Health Care
PPTX
Philips presentation3
DOCX
A report on MARKETING MIX in philips
PPTX
Static Code Analysis PHP[tek] 2023
PDF
Git vs. Mercurial
Vhdl ppt
TCI 2016 What drives prosperity?
The Value Of Distribution Philips Medical Systems Ck
TCI 2016 Philips: Disruptive Innovations and new business models in Health Care
Philips presentation3
A report on MARKETING MIX in philips
Static Code Analysis PHP[tek] 2023
Git vs. Mercurial

Similar to Approximating Change Sets at Philips Healthcare: A Case Study (20)

PPTX
Top Java Performance Problems and Metrics To Check in Your Pipeline
PPT
OOUG - Oracle Performance Tuning with AAS
PDF
Production Readiness Strategies in an Automated World
PPTX
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
PPTX
FluentMigrator - Dayton .NET - July 2023
PDF
Become a Performance Diagnostics Hero
PDF
Resilience Engineering: A field of study, a community, and some perspective s...
PDF
A tale of bug prediction in software development
PPTX
Taking Database Development to the 21st Century
PDF
JS Fest 2019. Олег Докука и Даниил Дробот. RSocket - future Reactive Applicat...
KEY
Turbocharge your automated tests with ci
ODP
Performance Optimization of Rails Applications
PDF
Fosdem10
PPTX
Devops is all greek
PDF
Consul administration at scale
PDF
Real-time collaboration in distributed systems for JavaScript developers. 
PDF
Cより速いRubyプログラム
PDF
AI For Software Engineering: Two Industrial Experience Reports
PDF
Mining and Untangling Change Genealogies (PhD Defense Talk)
PDF
Seaside Portability
Top Java Performance Problems and Metrics To Check in Your Pipeline
OOUG - Oracle Performance Tuning with AAS
Production Readiness Strategies in an Automated World
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
FluentMigrator - Dayton .NET - July 2023
Become a Performance Diagnostics Hero
Resilience Engineering: A field of study, a community, and some perspective s...
A tale of bug prediction in software development
Taking Database Development to the 21st Century
JS Fest 2019. Олег Докука и Даниил Дробот. RSocket - future Reactive Applicat...
Turbocharge your automated tests with ci
Performance Optimization of Rails Applications
Fosdem10
Devops is all greek
Consul administration at scale
Real-time collaboration in distributed systems for JavaScript developers. 
Cより速いRubyプログラム
AI For Software Engineering: Two Industrial Experience Reports
Mining and Untangling Change Genealogies (PhD Defense Talk)
Seaside Portability
Ad

More from Rahul Premraj (7)

PDF
An Empirical Analysis of Software Productivity Over Time
PDF
How Developer Communication Frequency Relates to Bug Introducing Changes
ZIP
Improving Bug Tracking Systems
PDF
What makes a good bug report?
PDF
Predicting Software Metrics at Design Time
PDF
On the Treatment of Bug Reports in Open-Source Projects
PDF
Building Cost Estimation Models using Homogeneous Data
An Empirical Analysis of Software Productivity Over Time
How Developer Communication Frequency Relates to Bug Introducing Changes
Improving Bug Tracking Systems
What makes a good bug report?
Predicting Software Metrics at Design Time
On the Treatment of Bug Reports in Open-Source Projects
Building Cost Estimation Models using Homogeneous Data
Ad

Recently uploaded (20)

PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
sbt 2.0: go big (Scala Days 2025 edition)
Co-training pseudo-labeling for text classification with support vector machi...
Module 1 Introduction to Web Programming .pptx
Comparative analysis of machine learning models for fake news detection in so...
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Flame analysis and combustion estimation using large language and vision assi...
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
The influence of sentiment analysis in enhancing early warning system model f...
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Basics of Cloud Computing - Cloud Ecosystem
Advancing precision in air quality forecasting through machine learning integ...
4 layer Arch & Reference Arch of IoT.pdf
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
sustainability-14-14877-v2.pddhzftheheeeee
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf

Approximating Change Sets at Philips Healthcare: A Case Study

  • 1. Approximating Change Sets at Philips Healthcare: A Case Study Adam Vanya Rahul Premraj Hans van Vliet VU University Amsterdam
  • 2. some healthcare products that are subjects of this study...
  • 6. Philips MRI Systems • Eight million lines of code across 34,000 files. • C, C++, and C# used. • Hundreds of developers across 3 sites. • An old version of IBM ClearCase used for version control. • Nine years of version control data available.
  • 9. !"#$%&"$'%()*"#$'+',#-'$ #*'+$,%$-*./,*./$*/0$ ” #1%21#$./$,3#$4&,&"#5
  • 10. “Re-engineering in the large...” Jens Borchers Reengineering from a Practitioner’s View – A Personal Lesson’s Learned Assessment Invited talk at CSMR 2011
  • 11. Re-engineering in the large By functionality By team location By other criteria
  • 13. We need data! • Which developers change which files? • Which functionality is implemented in a file? • Which sub-systems are often changed together? • ... Change Sets
  • 14. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 6, JUNE 2005 429 Mining Version Histories to Guide Software Changes Thomas Zimmermann, Student Member, IEEE, Peter Weißgerber, Stephan Diehl, and Andreas Zeller, Member, IEEE Computer Society Abstract—We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed....” Given a set of existing changes, the mined association rules 1) suggest and predict likely further changes, 2) show up item coupling that is undetectable by program analysis, and 3) can prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict further locations to be changed; the best predictive power is obtained for changes to existing software. In our evaluation based on the history of eight popular open source projects, ROSE’s topmost three suggestions contained a correct location with a likelihood of more than 70 percent. Index Terms—Programming environments/construction tools, distribution, maintenance, enhancement, configuration management, clustering, classification, association rules, data mining.
  • 15. Identifying Transactions A:1.3 B:1.6 C:1.1 D:1.3 E:1.5
  • 16. Identifying Transactions developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5
  • 17. Identifying Transactions developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5 same author + same log message
  • 18. Identifying Transactions developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5 same author + same log message 200 seconds
  • 19. Identifying Transactions developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5 same author + same log message 200 seconds
  • 20. Identifying Transactions developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5 same author + same log message 200 seconds
  • 21. Change Sets Task 1 Task 2 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5 A:1.4 J:1.2 E:1.6
  • 22. Environment at Philips • Developers often commit files associated to more than one task. • No build-able system required for commit. • Multiple developers may work together for complex tasks.
  • 23. Environment at Philips Developers rarely add commit messages! A:1.3 B:1.6 C:1.1 D:1.3 E:1.5
  • 24. Environment at Philips Developers rarely add commit messages! developer:
hugo log
msg.
:
Fixed
bug
#13463 timestamp:
Jul
23
2005
02:16:57 A:1.3 B:1.6 C:1.1 D:1.3 E:1.5
  • 28. Identifying Change Sets 200 seconds 1 hour
  • 29. Identifying Change Sets 200 seconds 1 hour 1 day 1 week 1 month
  • 30. Approximated Change Sets Table I: The Approximated Change Sets (ACS) #Check-ins per ACS δ #ACSs Min. Max. Med. Avg. 200 sec 115487 1 14002 2 8 1 hour 82571 1 14551 2 11 1 day 42447 1 14551 4 22 1 week 13568 1 19404 9 69 1 month 3408 1 27502 27 275
  • 31. Approximated Change Sets #Check-ins per ACS “ !"#$"%&'(%#)*(+,-.%#/% δ #ACSs Min. Max. Med. Avg. ” 200 sec 115487 1 14002 2 8 0(/*%*1%2/(314551 4 22 1 hour 82571 1 day 42447 1 1 14551 2 11 1 week 13568 1 19404 9 69 1 month 3408 1 27502 27 275
  • 32. Evaluating Change Sets ! ! !"#$$%&'(&) *+,'% !"#$%&#$# -(.(/#0("$
  • 33. Evaluating Change Sets ! ! !"#$$%&'(&) *+,'% !"#$%&#$# -(.(/#0("$
  • 40. Developer survey • Ten most active developers invited to participate. Eight responded. • Participants briefed on purpose of survey, how change sets were approximated, and possibility to discontinue survey. • Randomly drawn change sets presented. • Developers evaluated 75 change sets.
  • 41. Precision from survey Table II: Precision estimated with help of developers Time interval Precision (in %) #ACS ∗ (δ) Max. Min. Avg. Analyzed Skipped 200 seconds 100 50 91 19 3 1 hour 100 33 91 15 4 1 day 100 40 78 21 4 1 week 100 6 66 14 7 1 month 100 2 36 6 8 ∗ ACS stands for Approximated Change Sets
  • 42. Precision from survey Table II: Precision estimated with help of developers Time interval Precision (in %) #ACS ∗ (δ) Max. Min. Avg. Analyzed Skipped 200 seconds 100 50 91 19 3 1 hour 100 33 91 15 4 1 day 100 40 78 21 4 1 week 100 6 66 14 7 1 month 100 2 36 6 8 ∗ ACS stands for Approximated Change Sets
  • 43. Evaluating Change Sets ! ! !"#$$%&'(&) *+,'% !"#$%&#$# -(.(/#0("$
  • 44. Evaluating Change Sets ! ! !"#$$%&'(&) *+,'% !"#$%&#$# -(.(/#0("$
  • 45. Example Postlist Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114 Stream information <DEVSTREAM>FEMAIN Developer <USER>Anna Date <DATE_DD_MMM_YYYY>3 JAN 2006 Is it a problem report? <PR_SECTION>Y Problem report number <SOLVED_PR>MR00035599 Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier <CODING_STANDARD>N <PNS_SAR>N Developers playing <BBLOCK_OWNER>James a role in the <REVIEWER>Robert review process <TEAMLEADER>David <DOC_SECTION>N Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5 for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3 <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5 <TEST_SECTION>Y <TEST_DONE>@OTM6 Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2 used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1 <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
  • 46. Example Postlist Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114 Stream information <DEVSTREAM>FEMAIN Developer <USER>Anna Date <DATE_DD_MMM_YYYY>3 JAN 2006 Is it a problem report? <PR_SECTION>Y Problem report number <SOLVED_PR>MR00035599 Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier <CODING_STANDARD>N <PNS_SAR>N Developers playing <BBLOCK_OWNER>James a role in the <REVIEWER>Robert review process <TEAMLEADER>David <DOC_SECTION>N Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5 for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3 <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5 <TEST_SECTION>Y <TEST_DONE>@OTM6 Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2 used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1 <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
  • 47. Example Postlist Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114 Stream information <DEVSTREAM>FEMAIN Developer <USER>Anna Date <DATE_DD_MMM_YYYY>3 JAN 2006 Is it a problem report? <PR_SECTION>Y Problem report number <SOLVED_PR>MR00035599 Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier <CODING_STANDARD>N <PNS_SAR>N Developers playing <BBLOCK_OWNER>James a role in the <REVIEWER>Robert review process <TEAMLEADER>David <DOC_SECTION>N Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5 for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3 <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5 <TEST_SECTION>Y <TEST_DONE>@OTM6 Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2 used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1 <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
  • 48. Example Postlist Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114 Stream information <DEVSTREAM>FEMAIN Developer <USER>Anna Date <DATE_DD_MMM_YYYY>3 JAN 2006 Is it a problem report? <PR_SECTION>Y Problem report number <SOLVED_PR>MR00035599 Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier <CODING_STANDARD>N <PNS_SAR>N Developers playing <BBLOCK_OWNER>James a role in the <REVIEWER>Robert review process <TEAMLEADER>David <DOC_SECTION>N Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5 for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3 <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5 <TEST_SECTION>Y <TEST_DONE>@OTM6 Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2 used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1 <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
  • 49. Example Postlist Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114 Stream information <DEVSTREAM>FEMAIN Developer <USER>Anna Date <DATE_DD_MMM_YYYY>3 JAN 2006 Is it a problem report? <PR_SECTION>Y Problem report number <SOLVED_PR>MR00035599 Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier <CODING_STANDARD>N <PNS_SAR>N Developers playing <BBLOCK_OWNER>James a role in the <REVIEWER>Robert review process <TEAMLEADER>David <DOC_SECTION>N Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5 for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3 <POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5 <TEST_SECTION>Y <TEST_DONE>@OTM6 Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2 used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1 <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
  • 50. <SWID>29 Results from postlists Figure 5: A samp able III: Precision estimated IV: Recall estimated using postlists Table using postlists Time interval Time interval Precision (in %) Recall (in %) (δ) Max. Min. Avg. Max. (δ) Min. Avg. 200 seconds 100 200 seconds 50 93 100 20 74 1 hour 100 20 1 hour 89 100 20 84 1 day 100 1 1 day 69 100 22 92 1 week 100 31 <11 week 100 24 94 1 month 95 <1 month 1 8 100 25 94
  • 51. <SWID>29 Results from postlists Figure 5: A samp able III: Precision estimated IV: Recall estimated using postlists Table using postlists Time interval Time interval Precision (in %) Recall (in %) (δ) Max. Min. Avg. Max. (δ) Min. Avg. 200 seconds 100 200 seconds 50 93 100 20 74 1 hour 100 20 1 hour 89 100 20 84 1 day 100 1 1 day 69 100 22 92 1 week 100 31 <11 week 100 24 94 1 month 95 <1 month 1 8 100 25 94
  • 52. So, what works? • The one hour time interval works best for our environment. • Optimal time interval may differ from one environment to another.
  • 53. Threats to validity • Assumed developers have a good recall of their own change sets. • Postlists were carefully selected, but in rare cases may relate to more than one change set. • Change sets with multiple developers involved were not captured completely.
  • 56. Summary re- r p u y o lu a te a ta E v a e d d c e s s pr o s e ! re u b e fo

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: \n
  • #5: \n
  • #6: \n
  • #7: \n
  • #8: - increase efficiency\n- reduce overheads\n- reduce costs\n\n
  • #9: - increase efficiency\n- reduce overheads\n- reduce costs\n\n
  • #10: - increase efficiency\n- reduce overheads\n- reduce costs\n\n
  • #11: \n
  • #12: - few changes in files across different subsystems\n
  • #13: \n
  • #14: \n
  • #15: \n
  • #16: \n
  • #17: \n
  • #18: \n
  • #19: \n
  • #20: \n
  • #21: \n
  • #22: \n
  • #23: \n
  • #24: \n
  • #25: \n
  • #26: \n
  • #27: \n
  • #28: \n
  • #29: - series of intermediate commits\n- prevent against data loss\n- parallelization of tasks - develop an interface so everyone can keep working\n\n
  • #30: - not possible to learn the reason for change\n
  • #31: \n
  • #32: \n
  • #33: \n
  • #34: \n
  • #35: \n
  • #36: \n
  • #37: \n
  • #38: \n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: - time slots chosen with consulatayion with developers n managers\n
  • #45: \n
  • #46: \n
  • #47: \n
  • #48: \n
  • #49: - explained pr\n
  • #50: - explained pr\n
  • #51: - explained pr\n
  • #52: - explained pr\n
  • #53: \n
  • #54: - 1 month max is 100% means some tasks are really long.\n
  • #55: \n
  • #56: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #57: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #58: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #59: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #60: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #61: --- not change set specific\n--- incomplete changes\n--- generic reasons\n\n
  • #62: \n
  • #63: \n
  • #64: \n
  • #65: \n
  • #66: \n