Skip to content

feat(ci): mark FTR retry green when previously-failing tests recover#269605

Open
TamerlanG wants to merge 40 commits into
elastic:mainfrom
TamerlanG:ftr/smart-retry
Open

feat(ci): mark FTR retry green when previously-failing tests recover#269605
TamerlanG wants to merge 40 commits into
elastic:mainfrom
TamerlanG:ftr/smart-retry

Conversation

@TamerlanG
Copy link
Copy Markdown
Contributor

@TamerlanG TamerlanG commented May 16, 2026

Why

Flaky tests are a recurring source of CI noise. When a test fails once but passes on the automatic retry, the whole thing is clearly a flake — yet today the retry still counts any failure (even a different test's flakiness) and keeps the step red. This wastes engineer time investigating failures that aren't real.

What this does

On the first automatic retry of a failing FTR step, instead of failing on any test failure, we now ask: did every test that failed in attempt 1 explicitly pass in attempt 2?

  • If yes → the original failures recovered → CI goes green, even if a different unrelated test happened to be flaky on retry.
  • If no → at least one original failure did not pass → CI stays red.

How it works

  1. At the end of attempt 1, the failing test names are read from the JUnit XML and stored in Buildkite metadata.
  2. At the end of attempt 2, each previously-failing test is checked for an explicit pass in the retry JUnit output — not just absence of failure. This guards against three false-green scenarios:
    • Runner crash — the JUnit directory is empty so no failures appear, but nothing passed either
    • beforeAll hook failure — tests are reported as <skipped/> rather than failed, so they look "recovered" when they weren't
    • Stale XML files — on persistent-workspace agents, attempt-1 XML files can persist alongside attempt-2 files

If all previously-failing tests explicitly passed, the step exit code is overridden to 0.

A CLI (scripts/ftr_check_retry_result.js) in @kbn/failed-test-reporter-cli handles the XML parsing and verification logic.

Bail behaviour

--bail has been removed from all FTR runs. Without it, every config runs to completion on both attempt 1 and the retry, so the smart-retry check always has a full picture of what's still failing rather than potentially missing the original test because a different failure caused an early exit.

TamerlanG added 3 commits May 16, 2026 15:06
Adds `retry_result_checker.ts` to `@kbn/failed-test-reporter-cli` with
two CLI commands exposed via `scripts/ftr_check_retry_result.js`:

- `list-failures <junit-dir>` — prints the name of every failed test
  found in the JUnit XML files under a directory, one per line.
- `check-intersection --junit-dir <dir> --prev-failures-file <file>` —
  compares the current attempt's failures against a saved list from a
  previous attempt; exits 0 if the intersection is empty (all
  previously-failing tests recovered), exits 1 otherwise.

Also refactors `failed_tests_reporter_cli.ts` to export
`runFailedTestsReporterCli()` rather than executing on import, so the
package can export both CLIs from a single entry point without one
triggering when the other is called.
On the first automatic retry of a failing FTR step, the step is now
marked green if every test that failed in attempt 1 passes in attempt 2
— even if a different (previously-passing) test happens to fail on
retry, which would indicate a separate flake unrelated to the original
failure.

How it works:
- End of attempt 1: JUnit XML is parsed and the failing test names are
  stored in Buildkite metadata.
- End of attempt 2: the stored names are retrieved and intersected with
  the current attempt's failures. An empty intersection overrides the
  exit code to 0.

Skipped for flaky-test-runner runs (KIBANA_FLAKY_TEST_RUNNER_CONFIG).

Known limitation: if --bail causes attempt 2 to stop on a different
test before reaching the originally-failing test, the intersection will
appear empty and the step will be marked green even though the original
failing test was never verified.
@TamerlanG TamerlanG added Team:Operations Kibana-Operations Team release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting labels May 16, 2026
Plants a deliberately-flaky test pair inside the unused_urls_task FTR
config to validate the retry intersection logic introduced in this PR.

Setup (relies on FTR's --bail and BUILDKITE_RETRY_COUNT):
- Attempt 1: TEST_A fails, --bail stops the run. JUnit records TEST_A.
- Attempt 2: TEST_A passes (recovered). TEST_B now fails, --bail stops.
  JUnit records TEST_B.

Stored prev failures: {TEST_A}.  Current failures: {TEST_B}.
Intersection is empty → ftr_configs.sh overrides exit code to 0 and
the step turns green.

Expected outcome in CI: red attempt 1, red attempt 2 internally, but
the step ends green because no previously-failing test failed again.

DELETE before merging.
@TamerlanG
Copy link
Copy Markdown
Contributor Author

Seems to work according to this build.

I added a mock FTR suite in 54726fb

image

@TamerlanG
Copy link
Copy Markdown
Contributor Author

I think one other improvement I could do is utilize job annotations to show this information better. I'll play around with that idea next, maybe in some other PR.

@TamerlanG TamerlanG marked this pull request as ready for review May 16, 2026 16:32
@TamerlanG TamerlanG requested review from a team as code owners May 16, 2026 16:32
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

Pinging @elastic/kibana-operations (Team:Operations)

@TamerlanG TamerlanG removed the request for review from a team May 16, 2026 16:37
TamerlanG and others added 3 commits May 18, 2026 09:51
Surface per-config status, failures, and retry recovery on the job
detail page via a job-scoped Buildkite annotation, so that retry/pass
/fail outcomes aren't buried in the step logs.
@TamerlanG TamerlanG added the reviewer:claude PR review and comments with Claude label May 18, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach overall — the intersection logic is small and well-tested, and routing the new CLI through @kbn/failed-test-reporter-cli keeps the JUnit parsing in one place. Two concrete concerns left inline:

  1. The retry_validation_delete_before_merge.ts fixture and its loadTestFile wiring are still in the diff. The PR is no longer in draft, so this needs to come out (or be put behind an env flag) before merge.
  2. An empty failure-intersection on attempt 2 can mean recovered or never ran (the --bail open question, plus the FTR-crashed/missing-junit-dir case). Suggest tightening the green check to require that previously-failing tests are actually observed in attempt 2's JUnit (passed or failed) before promoting to green — that resolves the open question without dropping --bail.

Low-priority nits:

  • node scripts/ftr_check_retry_result list-failures "$junitDir" 2>/dev/null || true silently swallows any error from the script. Acceptable as a fail-open, but worth at least logging to stderr (without failing the step) so we can spot regressions in the helper itself.
  • Consider asserting in ftr_configs.sh that target/junit/$JOB actually exists / has at least one XML before declaring recovery, as a complementary safeguard to the executed-set check above.

Comment thread src/platform/test/api_integration/apis/unused_urls_task/index.ts
@TamerlanG
Copy link
Copy Markdown
Contributor Author

Update: We're using buildkite agent version 3.109.1 which doesn't have the job annotation feature. So created a PR to update that. Depending on how that PR plays out I'll see if I can add the job annotations or not. (if not then it can be on the build level but I'd prefer to not have that)

@TamerlanG
Copy link
Copy Markdown
Contributor Author

/ci

@TamerlanG
Copy link
Copy Markdown
Contributor Author

Confirmed that the job annotations work, I had to bump buildkite agent for kibana to the latest version.

image

TamerlanG and others added 10 commits May 26, 2026 16:29
Extract inline XML diff logic into collect_config_failures(), remove the
redundant failed-configs summary from write_job_annotation() (the table
rows already show still-failing/new-failure/recovered per config), and
simplify the failedConfigs concatenation.
Replace the intersection-of-failures check with collectPassedTestNames,
which requires each previously-failing test to appear as an explicit pass
in the retry JUnit output. This closes three false-green gaps: runner
crash (empty JUnit dir), beforeAll hook failure (tests reported as skipped
rather than failed), and stale XML files from attempt 1 persisting on
persistent-workspace agents.
…ify recovered message

- Wrap scout upload in set+e so a non-zero exit code does not abort the
  config loop; log the exit code and continue rather than silently swallowing it
- Log a clear message when smart-retry is inactive on a third-or-later manual retry
- Update the "recovered on retry" annotation to note that new failures on retry
  are not counted against recovery
Extract smart-retry logic into ftr_smart_retry.sh (store_failing_tests /
apply_smart_retry) and annotation helpers into ftr_job_annotation.sh
(collect_config_failures / write_job_annotation). The main script is now
a thin orchestrator that sources both and reads as a linear narrative.
…e CLI

Add snapshot/list-new-failures subcommands to ftr_check_retry_result so
the per-config JUnit attribution diff lives in TypeScript rather than bash
(find/sort/comm/cp plumbing). Add --prev-failures-stdin to check-intersection
so the temp-file handshake in apply_smart_retry becomes a plain pipe.

The bash loop now reads as two CLI calls per config instead of managing
temp files and directory diffs inline.
Moving annotation logic (write_job_annotation, per-config status table,
per-config failing test names, snapshot/list-new-failures CLI subcommands)
to a separate PR so this branch focuses purely on the retry mechanism.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior-review concerns from this PR have been addressed in the current diff: the temporary FTR fixture is gone, the silent-green hole is closed by requiring explicit passes via collectPassedTestNames, and the stale attempt-1 XML scenario is handled (and covered by tests in retry_result_checker.test.ts). Two small cleanup notes inline — dead retry_recovered shell variable + stale header comment, and the now-unused computeIntersection helper. Non-blocking.

Generated by Claude Reviewer for issue #269605 · ● 9.7M

Comment thread .buildkite/scripts/steps/test/ftr_smart_retry.sh
TamerlanG added 8 commits May 27, 2026 16:46
…etry_recovered

- Drop computeIntersection function and its tests — unused in production code
- Initialize retry_recovered=false in ftr_configs.sh so the variable is bound
  under set -u before ftr_smart_retry.sh sets it on recovery
- Revert Scout reporter error handling to original form
…t_retry.sh

Both variables are only used within ftr_smart_retry.sh, so they belong there.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concrete regression risk from the recent move of FAILED_TESTS_KEY into ftr_smart_retry.sh at module scope — inline. Other prior-review concerns (temp fixture, silent-green hole, stale XMLs, computeIntersection) look resolved in the current diff; the previously-flagged dead retry_recovered variable was kept intentionally per commit c3712f999 so I won't re-flag.

Generated by Claude Reviewer for issue #269605 · ● 9.3M

# Reads/writes globals: exitCode, failedConfigs,
# FAILED_CONFIGS_KEY, JOB, BUILDKITE_RETRY_COUNT.

FAILED_TESTS_KEY="${BUILDKITE_STEP_ID}${FTR_CONFIG_GROUP_KEY}_failed_tests"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness/regression: this line runs at source-time (top of ftr_configs.sh line 6) which is before the defensive FTR_CONFIG_GROUP_KEY=${FTR_CONFIG_GROUP_KEY:-} assignment on line 9 of ftr_configs.sh. Because ftr_configs.sh enables set -euo pipefail before sourcing, expanding ${FTR_CONFIG_GROUP_KEY} (no :- default) under set -u will abort the entire step with FTR_CONFIG_GROUP_KEY: unbound variable in any run where that env var is not set on the agent — exactly the case the existing FTR_CONFIG_GROUP_KEY=${FTR_CONFIG_GROUP_KEY:-} line was written to guard against. The standard "FTR Configs" steps in .buildkite/pipeline-utils/ci-stats/pick_test_group_run_order/steps.ts:82 set it explicitly, but other callers (e.g. parallel-job paths handled at lines 10–13 of ftr_configs.sh) and local runs do not.

Easiest fix: add a :- default so source-time evaluation is safe.

Suggested change
FAILED_TESTS_KEY="${BUILDKITE_STEP_ID}${FTR_CONFIG_GROUP_KEY}_failed_tests"
FAILED_TESTS_KEY="${BUILDKITE_STEP_ID:-}${FTR_CONFIG_GROUP_KEY:-}_failed_tests"

@tylersmalley
Copy link
Copy Markdown
Member

Apologies, I think I am going back on this. But let's put this behind an ENV flag FTR_SMART_RETRY_ENABLED. That way we can test specific commits, as well as disable the functionality if we find an issue without reverting the entire change.

One fear is about what happens if we experience an issue with ES that causes every test to timeout. I think we need to figure out some guardrails for this and possibly even add an exit code to avoid a retry at all.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concrete concern on the new BAIL_ARG gating in ftr_configs.sh — left inline.

Generated by Claude Reviewer for issue #269605 · ● 12.6M

Comment thread .buildkite/scripts/steps/test/ftr_configs.sh Outdated
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@kibanamachine
Copy link
Copy Markdown
Contributor

kibanamachine commented May 31, 2026

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #109 / unused_urls_task retry-validation TEST_A: intentionally fails on attempt 1, passes on attempt 2
  • [job] [logs] FTR Configs #109 / unused_urls_task retry-validation TEST_B: passes on attempt 1, intentionally fails on attempt 2

Metrics [docs]

✅ unchanged

History

@TamerlanG
Copy link
Copy Markdown
Contributor Author

latest build failed because I have the thing turned off by default. I got a test that deliberately fails on second retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes reviewer:claude PR review and comments with Claude Team:Operations Kibana-Operations Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants