Joel Maher
Open Source hacker for the Mozilla project.
Stockwell
Reduce the impact of intermittent tests
Teammates
Geoff Brown (:gbrown)
Joel Maher (:jmaher)
William Lachance (:wlach)
What is the impact?
Sheriff time to star failures
Developer distractions
more tooling needed
increased load on limited resources
How do intermittents impact your job?
What has changed?
Hired sheriffs
More platforms
More configurations (e10s, asan)
More tests and test suites
Many changes in Firefox
What do we run?
19 build/config types
1.05M possible tests/push
490K tests run/push on average
11 failures / push (OF=11.0)
How many intermittents?
Between 700 and 950 bugs / week
For 6 months (april-september):
7332 bugs occurred / 249279 failures
3310 bugs occurred <10 times
6018(82%) low frequency = 14% failures
560(7%) high frequency = 68% failures
What is intermittent?
High frequency >=50 times/week
Medium frequency 10<x<50 times/week
Low frequency <=10 times/week
What is your definition of intermittent?
What fails?
test timeouts
test failures
harness/task timeouts
Firefox crash/leak/assertion/hang
harness/infrastructure
Bad tests?
Majority of fixes are test fixes
178 mochitests do not run with --repeat
many uses of setTimeout()
poor use of api's
old tests written for old Firefox
Do we care?
Talked to dozens of engineers
Everyone wants to help
Not all intermittents have a clear owner
Engineers have deliverables
Engineers don't want to waste time
What prevents you from fixing intermittent tests?
Experiments in Q4
quarantine jobs
test-lint jobs
manual triage
OrangeFactor enhancements
Quarantine jobs
Always orange, long run times
Difficult to hack manifests
Leaks/Crashes/etc. still in other jobs
These would be ignored, unclear of value
Test Lint
Run extra tests on new/edited test cases
Did this for mochitests- 178 failures
Improves trust in tests
Will deploy in Q1 for mochitest
What causes you to not trust tests?
Manual Triage
In 2 weeks dropped OF from 23 to 11
Many patterns between bugs
Added info to make bugs actionable
Will continue to do this in Q1
Orange Factor++
bugzilla comments improved
relative frequency
ranking and priority
updated dev.tree-alerts to highlight number of high/mid/low frequency bugs
The Master Plan
Accept the fact that intermittents are here to stay
Develop a positive relationship with intermittent failures
Intermittent test failures are not seen on treeherder
On January 4, 2018- what would you expect to see?
Q1 Plan
P1 intermittents >=30 times/week
Make triaging easier
doing it full time
finding test owners
component filters on OrangeFactor
Increase confidence in tests/bugs
test-lint jobs
more data in bugs
More Experiments
More Experiments?
Don't you have enough data?
What experiments should we be doing?
Q1+ - More Experiments
Dashboards - more data for you
Triage bugs by component in OF
Disabled bugs in your component
New bugs in your component
Q1+ - More Experiments
Triage++
Identify common actionable data
List of data to include in new bugs
Create tools for getting common data
Identify spikes in occurrences faster
Q2+ - More Experiments
Reduce Noise / Better Tests
Improve auto classification
Consider ignoring low frequency failures
Look at rr chaos mode for the lint jobs
Best practices for writing, reviewing
Our Expectations
Assume good intent and common goals
Actionable bug == fix it!
disabling tests can be a good thing
Q&A
Goal: Reduce the impact of intermittents
What is your definition of intermittent?
How do intermittents impact your job?
What prevents you from fixing intermittent tests?
What causes you to not trust tests?
On January 4, 2018- what would you expect to see?
What experiments should we be doing?
By Joel Maher