feat(ci): mark FTR retry green when previously-failing tests recover#269605
feat(ci): mark FTR retry green when previously-failing tests recover#269605TamerlanG wants to merge 40 commits into
Conversation
Adds `retry_result_checker.ts` to `@kbn/failed-test-reporter-cli` with two CLI commands exposed via `scripts/ftr_check_retry_result.js`: - `list-failures <junit-dir>` — prints the name of every failed test found in the JUnit XML files under a directory, one per line. - `check-intersection --junit-dir <dir> --prev-failures-file <file>` — compares the current attempt's failures against a saved list from a previous attempt; exits 0 if the intersection is empty (all previously-failing tests recovered), exits 1 otherwise. Also refactors `failed_tests_reporter_cli.ts` to export `runFailedTestsReporterCli()` rather than executing on import, so the package can export both CLIs from a single entry point without one triggering when the other is called.
On the first automatic retry of a failing FTR step, the step is now marked green if every test that failed in attempt 1 passes in attempt 2 — even if a different (previously-passing) test happens to fail on retry, which would indicate a separate flake unrelated to the original failure. How it works: - End of attempt 1: JUnit XML is parsed and the failing test names are stored in Buildkite metadata. - End of attempt 2: the stored names are retrieved and intersected with the current attempt's failures. An empty intersection overrides the exit code to 0. Skipped for flaky-test-runner runs (KIBANA_FLAKY_TEST_RUNNER_CONFIG). Known limitation: if --bail causes attempt 2 to stop on a different test before reaching the originally-failing test, the intersection will appear empty and the step will be marked green even though the original failing test was never verified.
Plants a deliberately-flaky test pair inside the unused_urls_task FTR
config to validate the retry intersection logic introduced in this PR.
Setup (relies on FTR's --bail and BUILDKITE_RETRY_COUNT):
- Attempt 1: TEST_A fails, --bail stops the run. JUnit records TEST_A.
- Attempt 2: TEST_A passes (recovered). TEST_B now fails, --bail stops.
JUnit records TEST_B.
Stored prev failures: {TEST_A}. Current failures: {TEST_B}.
Intersection is empty → ftr_configs.sh overrides exit code to 0 and
the step turns green.
Expected outcome in CI: red attempt 1, red attempt 2 internally, but
the step ends green because no previously-failing test failed again.
DELETE before merging.
|
I think one other improvement I could do is utilize job annotations to show this information better. I'll play around with that idea next, maybe in some other PR. |
|
Pinging @elastic/kibana-operations (Team:Operations) |
Surface per-config status, failures, and retry recovery on the job detail page via a job-scoped Buildkite annotation, so that retry/pass /fail outcomes aren't buried in the step logs.
…MERGE" This reverts commit 1be5cd8.
There was a problem hiding this comment.
Nice approach overall — the intersection logic is small and well-tested, and routing the new CLI through @kbn/failed-test-reporter-cli keeps the JUnit parsing in one place. Two concrete concerns left inline:
- The
retry_validation_delete_before_merge.tsfixture and itsloadTestFilewiring are still in the diff. The PR is no longer in draft, so this needs to come out (or be put behind an env flag) before merge. - An empty failure-intersection on attempt 2 can mean recovered or never ran (the
--bailopen question, plus the FTR-crashed/missing-junit-dir case). Suggest tightening the green check to require that previously-failing tests are actually observed in attempt 2's JUnit (passed or failed) before promoting to green — that resolves the open question without dropping--bail.
Low-priority nits:
node scripts/ftr_check_retry_result list-failures "$junitDir" 2>/dev/null || truesilently swallows any error from the script. Acceptable as a fail-open, but worth at least logging to stderr (without failing the step) so we can spot regressions in the helper itself.- Consider asserting in
ftr_configs.shthattarget/junit/$JOBactually exists / has at least one XML before declaring recovery, as a complementary safeguard to the executed-set check above.
|
Update: We're using buildkite agent version 3.109.1 which doesn't have the job annotation feature. So created a PR to update that. Depending on how that PR plays out I'll see if I can add the job annotations or not. (if not then it can be on the build level but I'd prefer to not have that) |
|
/ci |
|
Confirmed that the job annotations work, I had to bump buildkite agent for kibana to the latest version.
|
… BEFORE MERGE"" This reverts commit 443e1cf.
Extract inline XML diff logic into collect_config_failures(), remove the redundant failed-configs summary from write_job_annotation() (the table rows already show still-failing/new-failure/recovered per config), and simplify the failedConfigs concatenation.
Replace the intersection-of-failures check with collectPassedTestNames, which requires each previously-failing test to appear as an explicit pass in the retry JUnit output. This closes three false-green gaps: runner crash (empty JUnit dir), beforeAll hook failure (tests reported as skipped rather than failed), and stale XML files from attempt 1 persisting on persistent-workspace agents.
…ify recovered message - Wrap scout upload in set+e so a non-zero exit code does not abort the config loop; log the exit code and continue rather than silently swallowing it - Log a clear message when smart-retry is inactive on a third-or-later manual retry - Update the "recovered on retry" annotation to note that new failures on retry are not counted against recovery
Extract smart-retry logic into ftr_smart_retry.sh (store_failing_tests / apply_smart_retry) and annotation helpers into ftr_job_annotation.sh (collect_config_failures / write_job_annotation). The main script is now a thin orchestrator that sources both and reads as a linear narrative.
…e CLI Add snapshot/list-new-failures subcommands to ftr_check_retry_result so the per-config JUnit attribution diff lives in TypeScript rather than bash (find/sort/comm/cp plumbing). Add --prev-failures-stdin to check-intersection so the temp-file handshake in apply_smart_retry becomes a plain pipe. The bash loop now reads as two CLI calls per config instead of managing temp files and directory diffs inline.
Moving annotation logic (write_job_annotation, per-config status table, per-config failing test names, snapshot/list-new-failures CLI subcommands) to a separate PR so this branch focuses purely on the retry mechanism.
There was a problem hiding this comment.
Prior-review concerns from this PR have been addressed in the current diff: the temporary FTR fixture is gone, the silent-green hole is closed by requiring explicit passes via collectPassedTestNames, and the stale attempt-1 XML scenario is handled (and covered by tests in retry_result_checker.test.ts). Two small cleanup notes inline — dead retry_recovered shell variable + stale header comment, and the now-unused computeIntersection helper. Non-blocking.
Generated by Claude Reviewer for issue #269605 · ● 9.7M
…etry_recovered - Drop computeIntersection function and its tests — unused in production code - Initialize retry_recovered=false in ftr_configs.sh so the variable is bound under set -u before ftr_smart_retry.sh sets it on recovery - Revert Scout reporter error handling to original form
…MERGE" This reverts commit 1be5cd8.
…exist anymore" This reverts commit 5201524.
… BEFORE MERGE"" This reverts commit 69c8326.
…t_retry.sh Both variables are only used within ftr_smart_retry.sh, so they belong there.
There was a problem hiding this comment.
One concrete regression risk from the recent move of FAILED_TESTS_KEY into ftr_smart_retry.sh at module scope — inline. Other prior-review concerns (temp fixture, silent-green hole, stale XMLs, computeIntersection) look resolved in the current diff; the previously-flagged dead retry_recovered variable was kept intentionally per commit c3712f999 so I won't re-flag.
Generated by Claude Reviewer for issue #269605 · ● 9.3M
| # Reads/writes globals: exitCode, failedConfigs, | ||
| # FAILED_CONFIGS_KEY, JOB, BUILDKITE_RETRY_COUNT. | ||
|
|
||
| FAILED_TESTS_KEY="${BUILDKITE_STEP_ID}${FTR_CONFIG_GROUP_KEY}_failed_tests" |
There was a problem hiding this comment.
Correctness/regression: this line runs at source-time (top of ftr_configs.sh line 6) which is before the defensive FTR_CONFIG_GROUP_KEY=${FTR_CONFIG_GROUP_KEY:-} assignment on line 9 of ftr_configs.sh. Because ftr_configs.sh enables set -euo pipefail before sourcing, expanding ${FTR_CONFIG_GROUP_KEY} (no :- default) under set -u will abort the entire step with FTR_CONFIG_GROUP_KEY: unbound variable in any run where that env var is not set on the agent — exactly the case the existing FTR_CONFIG_GROUP_KEY=${FTR_CONFIG_GROUP_KEY:-} line was written to guard against. The standard "FTR Configs" steps in .buildkite/pipeline-utils/ci-stats/pick_test_group_run_order/steps.ts:82 set it explicitly, but other callers (e.g. parallel-job paths handled at lines 10–13 of ftr_configs.sh) and local runs do not.
Easiest fix: add a :- default so source-time evaluation is safe.
| FAILED_TESTS_KEY="${BUILDKITE_STEP_ID}${FTR_CONFIG_GROUP_KEY}_failed_tests" | |
| FAILED_TESTS_KEY="${BUILDKITE_STEP_ID:-}${FTR_CONFIG_GROUP_KEY:-}_failed_tests" |
…E BEFORE MERGE"" This reverts commit 2193bfb.
|
Apologies, I think I am going back on this. But let's put this behind an ENV flag One fear is about what happens if we experience an issue with ES that causes every test to timeout. I think we need to figure out some guardrails for this and possibly even add an exit code to avoid a retry at all. |
There was a problem hiding this comment.
One concrete concern on the new BAIL_ARG gating in ftr_configs.sh — left inline.
Generated by Claude Reviewer for issue #269605 · ● 12.6M
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
💔 Build Failed
Failed CI StepsTest Failures
Metrics [docs]
History
|
|
latest build failed because I have the thing turned off by default. I got a test that deliberately fails on second retry. |


Why
Flaky tests are a recurring source of CI noise. When a test fails once but passes on the automatic retry, the whole thing is clearly a flake — yet today the retry still counts any failure (even a different test's flakiness) and keeps the step red. This wastes engineer time investigating failures that aren't real.
What this does
On the first automatic retry of a failing FTR step, instead of failing on any test failure, we now ask: did every test that failed in attempt 1 explicitly pass in attempt 2?
How it works
<skipped/>rather than failed, so they look "recovered" when they weren'tIf all previously-failing tests explicitly passed, the step exit code is overridden to 0.
A CLI (
scripts/ftr_check_retry_result.js) in@kbn/failed-test-reporter-clihandles the XML parsing and verification logic.Bail behaviour
--bailhas been removed from all FTR runs. Without it, every config runs to completion on both attempt 1 and the retry, so the smart-retry check always has a full picture of what's still failing rather than potentially missing the original test because a different failure caused an early exit.