Can we induce change with what we measure?

Data-driven software engineering @Microsoft
Michaela Greiler

Data-driven software engineering @Microsoft
•How can we optimize the testing process?
•Do code reviews make a difference?
•Is coding velocity and quality always a tradeoff?
•What’s the optimal way to organize work on a large team?
MSR Redmond/TSE:
Michaela GreilerJacek CzerwonkaWolfram SchulteSuresh Thummalapenta
MSR Redmond:
Christian BirdKathryn McKinleyNachi NagappanThomas Zimmermann
MSR Cambridge: Brendan MurphyKim Herzig

0
20
40
60
80
100
2010
2010
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
Code Coverage trigger of Checkins
% completely covered
% somewhat covered
% not covered

Reviewer recommendation: Does experience matter?

Can we change with what we can measure?
Michaela Greiler

What is measured?
0
1
2
3
4
5
6
7
8
Carl
Lisa
Rob
Danny
Number Bugs
What is changed?
0
0.5
1
1.5
2
2.5
Carl
Lisa
Rob
Danny
Number Bugs
Code Quality

SOCIO TECHNICAL CONGRUENCE
“Design and programming are human activities; forget that and all is lost” –Bjarne Stroustrop

So should we go without any measurements?

Interpretation
Data Collection
Usage
Lessons learned
No
Garbage!

•What is codemine? What data does codeminehave?

GMQ vs. Opportunistic data collection
•Easily available ≠ what’s needed
•Determine the needed data
•Find proxy measures if needed
•Know the analysis before collecting the data
Otherwise, data is not usable for the intended purpose
•Goal –Question –Metric
•Check for completeness, cleanness/ noise and usefulness
•Data background
•How was data generated?
•Why was it generated?
•Who consumes the data?
•What about outliers?
•How was the data processed?

Interpretation needs domain knowledge

Tools, processes,
practices and policies.
Release schedule
Time
Engineers
What roles exist?
Who does what?
Responsibilities?
M1
M2
Beta
Organization of code bases
Team structure and culture.

Engineers want to understand the nitty-gritty
•How do you calculate the recommended reviewers?
•Why was that person recommended?
•Why is Lisa not recommended?

Simplicity first
Files
without
bugs
Files
with
bugs
Files withoutbugs: main contributor made > 50% of all edits
Files withbugs: main contributor made < 60% of all edits
Ownership metric:
Proportion of edits of all edits for the contributor with the most edits
Reporting vs. Prediction
Comprehension
vs. automation
If you can do it with a decision tree… do it…

Iterative process with very close involvement of product teams and domain experts.
It’s a dialog
It’s a back and forth

Mixed Method Research
Is a research approach or methodology
•for questions that call for real-life contextual understandings;
•employing rigorous quantitative research assessing magnitude and frequency of constructs and
•rigorous qualitative researchexploring the meaning and understanding of constructs;
DR. MARGARET-ANNESTOREY
Professor of Computer Science University of Victoria
All methods are inherently flawed!
Generalizability
Precision
Realism
DR. ARIEVANDEURSEN
Professor of Software Engineering Delft University of Technology

Foundations of Mixed
Methods Research
Designing
Social Inquiry
Qualitative Research: Mixed Method Research
•Interviews
•Observations
•Focus groups
•Contextual Inquiry
•Grounded Theory
•…

A Grounded Theory Study
23
Systematic procedure to discover a theory from (qualitative) data
S. Adolph, W. Hall, Ph. Kruchten. Using Grounded theory to study the experience of software development. Empirical Software Engineering,2011.
B. Glaser and J. Holton. Remodeling grounded theory. Forum Qualitative Res., 2004.
Glaser and Strauss

Deductiveversus inductive
A deductive approach is concerned with developing a hypothesis (or hypotheses) based on existing theory, and then designing a research strategy to test the hypothesis (Wilson, 2010, p.7)
Inductive approach starts with observations. Theories emerge towards the end of the research and as a result of careful examination of patterns in observations (Goddard and Melville, 2004).
Theory
Hypotheses
Observation
Confirm/Reject
Observation
Patterns
Theory

All models are wrong but some are useful
(George E. P. Box)

Theo: Test Effectiveness Optimization from History
Kim Herzig*, Michaela Greiler+, Jacek Czerwonka+, Brendan Murphy*
*Microsoft Research, Cambridge
+Microsoft Corporation, US

Improving Development Processes
Product /
Service
Legacy
changes
New product
features
Technology
changes
Development Environment
$
Speed
R
Cost
Quality / Risk
(should be well balanced)
Microsoft aims for shorter release cycles
Empirical data to support & drive decisions
• Speed up development processes (e.g. code velocity)
• More frequent releases
• Maintaining / increasing product quality
Joint effort by MSR & product teams
• MSR Cambridge: Brendan Murphy, Kim Herzig
• TSE Redmond: Jacek Czerwonka, Michaela Greiler
• MSR Redmond: Tom Zimmermann, Chris Bird, Nachi Nagappan
• Windows, Windows Phone, Office, Dynamics product teams

Software Testing for Windows
Winmain (main branch)
Quality gate
(system testing)
Quality gate
(system & component testing)
Quality gate
(component testing)
time
Development branch
Multiple area branches
Multiple component branches
Software testing is very expensive
• Thousands test suites executed, millions test cases executed
• On different branches, architectures, languages, etc.
• We tend to repeat the same tests over and over again
• Too many false alarms (failures due to test and infrastructure issues)
• Each test failures slows down product development
• Aims to find code issues as early as possible
• At the cost of slower product development
Actual problem
Current process aims for maximal protection
{Simplified illustration}

Software Testing for Office
Software testing is very expensive
• Thousands test suites executed, millions test cases executed
• On different branches, architectures, languages, etc.
• We tend to repeat the same tests over and over again
• Too many false alarms (failures due to test and infrastructure issues)
• Each test failures slows down product development
• Aims to find code issues as early as possible
• At the cost of slower product development
Actual problem
Current process aims for maximal protection
Dev Inner Loop
BVT and CVT
on main
Dog food
Different
• Branching structure
• Development process
• Testing process
• Release schedules
• …
{Simplified illustration}

Goal
Reduce the number of test executions …
… without sacrificing code quality
Dynamic, self-adaptive optimization model

Solution
Reduce the number of test executions …
•Runevery test at least once beforeintegrating code change into main branch (e.g., winmain).
•We eventually find all code issues but take riskof finding them later (on higher level branches).
… without sacrificing code quality
High cost, unknown value
$$$$$
High cost, low value$$$$
Low cost,
low value$
Low cost, good value$$
How likely is a test causing:
1)false positivesor
2)finding code issues?
Analyzehistoric data:
-Test Events
-Builds
-Code Integrations
Analyzepast test results
-Passing tests, false alarms, detected code issues

Bug finding capabilities change with context

Solution
Using cost function to model risk.
푪풐풔풕푬풙풆풄풖풕풊풐풏>푪풐풔풕푺풌풊풑?suspend∶executetest
퐶표푠푡퐸푥푒푐푢푡푖표푛=퐶표푠푡푀푎푐ℎ푖푛푒/푇푖푚푒∗푇푖푚푒퐸푥푒푐푢푡푖표푛+"Costofpotentialfalsealarm"
=퐶표푠푡푀푎푐ℎ푖푛푒/푇푖푚푒∗푇푖푚푒퐸푥푒푐푢푡푖표푛+(푃푟표푏퐹푃∗퐶표푠푡퐷푒푣푒푙표푝푒푟/푇푖푚푒∗푇푖푚푒푇푟푖푎푔푒)
퐶표푠푡푆푘푖푝="Potentialcostoffindingadefectlater"
=푃푟표푏푇푃∗퐶표푠푡퐷푒푣푒푙표푝푒푟/푇푖푚푒∗푇푖푚푒퐹푟푒푒푧푒푏푟푎푛푐ℎ∗#퐷푒푣푒푙표푝푒푟푠퐵푟푎푛푐ℎ
Test
Costto run a test.
Valueof output.

Current Results
Simulated on Windows 8.1 development period (BVT only)

Dynamic, Self-Adaptive
Decision points are connected to each other
Skipping tests influences the risk factorsof higher level branches
We re-enable testsif code quality drops (e.g. different milestone)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
relative test reduction rate
Time (Windows 8.1)
Training period

Bug Finding Performance of Tests
How many test executions fail?
#failed test exec
Branch level
Number of test executions
How many of the failed test executions result in bug reports?
FP
TP test-unspecific
TP test-specific
Branch level

Impact on Development Process
Secondary Improvements
•Machine Setup: we may lower the number of machines allocated to testing process
•Developer satisfaction: Removing false test failures increases confidence in testing process
…hard to estimate speed improvement through simulation
“We used the data […] to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)

Michaela Greiler
@mgreiler
www.michaelagreiler.com
http://research.microsoft.com/en-us/projects/tse/

Can we induce change with what we measure?

More Related Content

What's hot (19)

Similar to Can we induce change with what we measure? (20)

Recently uploaded (20)

Can we induce change with what we measure?