Zap the Flakes!
Leveraging AI to Combat Flaky Tests with CANNIER
stackconf ‘25
Daniel Hiller
agenda
● about me
● about flakes
● pre-merge-detection v1
● CANNIER
● pre-merge-detection v2
● Q&A
about me
● Software Engineer | Red Hat OpenShift Virtualization
● KubeVirt | CI & automation in general
● prow.ci.kubevirt.io
● CI-Health
● Flaky Tests
kubevirt.io- Virtualization for Kubernetes
about flakes
source: https://prow.ci.kubevirt.io/pr-history/?org=kubevirt&repo=kubevirt&pr=9445
about flakes
a flake
is a test that
without any code change
will either fail or pass in successive runs
about flakes
“... In terms of severity, of the 91% of developers who claimed to deal
with flaky tests at least a few times a year,
… 23% [of developers] thought that they were
a serious problem. …”
source: “A survey of flaky tests”
about flakes
“… test flakiness was a frequently encountered problem,
with … 15% [of developers] dealing with it daily”
source: “A survey of flaky tests”
impact of flakes
in CI automated testing MUST give a reliable signal of stability
any failed test run signals that the product is unstable
test runs failed due to flakes do not give this reliable signal
they only waste time
impact of flakes
impact of flakes
Flaky tests cause
● for individual contributors
○ prolonged feedback cycles
○ test trust issues
● for the project community
○ slowdown of merging pull requests - “retest trap”
○ reversal of acceleration effects (i.e. batch testing)
○ waste of CI resources
pre-merge-detection v1
approach:
● look at files touched in pull request
● determine changed files containing e2e tests
● run these 5 times in random order
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh
pre-merge-detection v1
check-tests-for-flakes test lane
why: catch flakes before entering
main
(source)
●
pre-merge-detection v1
why:
● eliminate order-dependencies
● running five times is the boundary to find 88% of flaky tests
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh
pre-merge-detection v1
initial situation:
● most e2e tests take 10sec - 2mins to run
● 5 times re-run has “only” 88% chance of
detection
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh
pre-merge-detection v1
problems discovered with v1 approach:
● shotgun approach causes too many tests to be run
○ capping amount of tests re-run required to avoid timeouts
○ CI users complain about failing tests they didn’t touch
● re-run lane takes ~1h on average
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh
CANNIER
CANNIER is an approach
for reducing the time cost of rerunning-based detection techniques
by combining them with machine learning models
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
CANNIER
“... we found that CANNIER was able to reduce the time cost (and therefore
monetary cost) [of re-running tests] by an average of 88% ...”
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
CANNIER
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
single model
CANNIER
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
feature set
CANNIER
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
feature set (runtime)
CANNIER
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
feature set (static analysis)
pre-merge-detection v2.0
● CANNIER
○ implemented in Python
○ KubeVirt uses Go
● Runtime Data
○ where to store
○ when to capture
● data science
○ Python has well known frameworks
○ Go state unsure
questions
pre-merge-detection v2.0
steps:
● gather the set of changed tests
● per each element
○ run once to get the runtime
data
○ extract the static analysis
data
○ fetch the prediction using
entire feature vector
○ add to the set of re-run
tests if in iDFClass
● re-run the reduced set of tests
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
algorithm
pre-merge-detection v2.0
parts:
● code (Go)
○ test set extraction
○ feature extraction
○ model prediction
○ model generation
● test lane (Bash)
● model deployment (YAML)
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
implementation
pre-merge-detection v2.1
parts:
● model
○ generation
○ deployment
● test lane
● code
○ test set extraction
○ feature extraction
○ model prediction
parts:
● model
○ …
○ automatic updates
● test lane -> prow external-plugin
○ runs on presubmits
○ runs on postcommits (probably with a
larger test set)
○ adds helpful feedback
● code
○ test set extraction
○ feature extraction
○ model prediction
improvements
pre-merge-detection v2.1
prow external-plugin
● runs on presubmits
● adds helpful feedback
● gives advice according to the feature set what can be improved
pre-merge-detection v2.2
● components of feature vector provide insightful advice to the contributor
○ i.e. high cyclomatic complexity advises to reduction etc.
○ therefore it’s valuable to attach the analysis data to the PR
● possibly increase number of reruns (>5)
○ reduced runtime overall leaves time for more reruns
○ feature vector contains runtimes, thus we can estimate the total re-run time better and
optimize for it, i.e. group tests by runtime classes
source: https://www.gregorykapfhammer.com/research/papers/parry2023/
improvements
pre-merge-detection v1.5 (the present)
approach:
● gather the exact set of changed tests
● run these 5 times in random order
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh
Links
● KubeVirt resources
○ Initial pull request (draft): https://github.com/kubevirt/project-infra/pull/3930
○ Presentation Squash The Flakes @ FOSDEM ‘24:
https://archive.fosdem.org/2024/schedule/event/fosdem-2024-1805-squash-the-flakes-ho
w-to-minimize-the-impact-of-flaky-tests/
● CANNIER
○ paper: https://www.gregorykapfhammer.com/research/papers/parry2023/
○ implementation: https://github.com/flake-it/cannier-framework/
Q&A
Any questions?
Any suggestions for improvement?
Who else is trying to tackle this problem?
What have you done to solve this?
download slides:
slides.pdf
Thank you for attending!
Further questions?
Feel free to send questions and comments to:
��
dhiller@redhat.com
kubernetes.slack.com/
@dhiller
@dhiller@fosstodon.org
��
dhiller.dev
kubevirt.io
Virtualization for Kubernetes
KubeVirt welcomes all kinds of contributions!
● Weekly community meeting every Wed 3PM CET
● Links:
● KubeVirt website
● KubeVirt user guide
● KubeVirt Contribution Guide
● GitHub
● Kubernetes Slack channels
○ #virtualization
○ #kubevirt-dev
contact me:

stackconf 2025 | Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller.pdf