stackconf 2025 | Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller.pdf

Zap the Flakes!
Leveraging AI to Combat Flaky Tests with CANNIER
stackconf ‘25
Daniel Hiller

agenda
● about me
● about ﬂakes
● pre-merge-detection v1
● CANNIER
● pre-merge-detection v2
● Q&A

about me
● Software Engineer | Red Hat OpenShift Virtualization
● KubeVirt | CI & automation in general
● prow.ci.kubevirt.io
● CI-Health
● Flaky Tests
kubevirt.io- Virtualization for Kubernetes

about ﬂakes
source: https://prow.ci.kubevirt.io/pr-history/?org=kubevirt&repo=kubevirt&pr=9445

about ﬂakes
a ﬂake
is a test that
without any code change
will either fail or pass in successive runs

about flakes
“... In terms of severity, of the 91% of developers who claimed to deal
with flaky tests at least a few times a year,
… 23% [of developers] thought that they were
a serious problem. …”
source: “A survey of flaky tests”

about flakes
“… test flakiness was a frequently encountered problem,
with … 15% [of developers] dealing with it daily”
source: “A survey of flaky tests”

in CI automated testing MUST give a reliable signal of stability
any failed test run signals that the product is unstable
test runs failed due to ﬂakes do not give this reliable signal
they only waste time
impact of ﬂakes

impact of ﬂakes
Flaky tests cause
● for individual contributors
○ prolonged feedback cycles
○ test trust issues
● for the project community
○ slowdown of merging pull requests - “retest trap”
○ reversal of acceleration eﬀects (i.e. batch testing)
○ waste of CI resources

pre-merge-detection v1
approach:
● look at files touched in pull request
● determine changed files containing e2e tests
● run these 5 times in random order
sources: KubeVirt e2e test runtimes, check-tests-for-flakes job history, automation/repeated_test.sh

check-tests-for-ﬂakes test lane
why: catch ﬂakes before entering
main
(source)
●

why:
● eliminate order-dependencies
● running five times is the boundary to find 88% of flaky tests

initial situation:
● most e2e tests take 10sec - 2mins to run
● 5 times re-run has “only” 88% chance of
detection

problems discovered with v1 approach:
● shotgun approach causes too many tests to be run
○ capping amount of tests re-run required to avoid timeouts
○ CI users complain about failing tests they didn’t touch
● re-run lane takes ~1h on average

CANNIER
CANNIER is an approach
for reducing the time cost of rerunning-based detection techniques
by combining them with machine learning models
source: https://www.gregorykapfhammer.com/research/papers/parry2023/

CANNIER
“... we found that CANNIER was able to reduce the time cost (and therefore
monetary cost) [of re-running tests] by an average of 88% ...”

CANNIER
single model

CANNIER
feature set

CANNIER
feature set (runtime)

CANNIER
feature set (static analysis)

pre-merge-detection v2.0
● CANNIER
○ implemented in Python
○ KubeVirt uses Go
● Runtime Data
○ where to store
○ when to capture
● data science
○ Python has well known frameworks
○ Go state unsure
questions

steps:
● gather the set of changed tests
● per each element
○ run once to get the runtime
data
○ extract the static analysis
data
○ fetch the prediction using
entire feature vector
○ add to the set of re-run
tests if in iDFClass
● re-run the reduced set of tests
algorithm

parts:
● code (Go)
○ test set extraction
○ feature extraction
○ model prediction
○ model generation
● test lane (Bash)
● model deployment (YAML)
implementation

parts:
● model
○ generation
○ deployment
● test lane
● code
parts:
● model
○ …
○ automatic updates
● test lane -> prow external-plugin
○ runs on presubmits
○ runs on postcommits (probably with a
larger test set)
○ adds helpful feedback
● code
improvements

prow external-plugin
● runs on presubmits
● adds helpful feedback
● gives advice according to the feature set what can be improved

● components of feature vector provide insightful advice to the contributor
○ i.e. high cyclomatic complexity advises to reduction etc.
○ therefore it’s valuable to attach the analysis data to the PR
● possibly increase number of reruns (>5)
○ reduced runtime overall leaves time for more reruns
○ feature vector contains runtimes, thus we can estimate the total re-run time better and
optimize for it, i.e. group tests by runtime classes
improvements

pre-merge-detection v1.5 (the present)
approach:
● gather the exact set of changed tests
● run these 5 times in random order

Links
● KubeVirt resources
○ Initial pull request (draft): https://github.com/kubevirt/project-infra/pull/3930
○ Presentation Squash The Flakes @ FOSDEM ‘24:
https://archive.fosdem.org/2024/schedule/event/fosdem-2024-1805-squash-the-flakes-ho
w-to-minimize-the-impact-of-flaky-tests/
● CANNIER
○ paper: https://www.gregorykapfhammer.com/research/papers/parry2023/
○ implementation: https://github.com/flake-it/cannier-framework/

Q&A
Any questions?
Any suggestions for improvement?
Who else is trying to tackle this problem?
What have you done to solve this?
download slides:
slides.pdf

Thank you for attending!
Further questions?
Feel free to send questions and comments to:
��
dhiller@redhat.com
kubernetes.slack.com/
@dhiller
@dhiller@fosstodon.org
��
dhiller.dev
kubevirt.io
Virtualization for Kubernetes
KubeVirt welcomes all kinds of contributions!
● Weekly community meeting every Wed 3PM CET
● Links:
● KubeVirt website
● KubeVirt user guide
● KubeVirt Contribution Guide
● GitHub
● Kubernetes Slack channels
○ #virtualization
○ #kubevirt-dev
contact me:

stackconf 2025 | Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller.pdf

More Related Content

Recently uploaded(20)

Featured(20)

stackconf 2025 | Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller.pdf