|
| 1 | +--- |
| 2 | +name: flaky-test-investigator |
| 3 | +description: Investigate Scout and FTR flaky test failures in Kibana. Use when triaging a failed-test issue, a Buildkite-reported failure, a test path that has been failing intermittently, or any time the user asks to look at a flaky test, deflake a test, or stabilize a test. |
| 4 | +disable-model-invocation: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Flaky Test Investigator |
| 8 | + |
| 9 | +Investigate a flaky Scout or FTR test failure and determine what should be done about it. |
| 10 | + |
| 11 | +- The outcome should be an accurate diagnosis, not a quick fix that treats the symptom. |
| 12 | +- Valid outcomes include "this is a real product bug, escalate to the owning team", "this is environmental and will likely self-resolve", or "there isn't enough data to draw a confident conclusion". |
| 13 | + |
| 14 | +## Required input |
| 15 | + |
| 16 | +A link to a GitHub issue with the `failed-test` label is required. If none is provided, ask for one before proceeding. |
| 17 | + |
| 18 | +Ignore any prior root-cause analyses or fix proposals posted by automations; treat them as if they weren't there and reach your own conclusion. Failure-notification comments from `kibanamachine` are still useful as a history signal. |
| 19 | + |
| 20 | +## Investigation |
| 21 | + |
| 22 | +### Understand the test environment |
| 23 | + |
| 24 | +- **Is it failing on `kibana-on-merge` (local pipeline), on Cloud pipelines, or both?** |
| 25 | + - _Why it matters:_ tells you whether the test is compatible with Elastic Cloud at all, and whether the failure is more likely environmental (more common on Cloud) or a defect in the test itself (more common when both fail). |
| 26 | + - Recommended: learn more about local versus Elastic Cloud pipelines in `references/pipelines.md` |
| 27 | +- **Did it fail in Buildkite builds with many other unrelated test failures?** |
| 28 | + - _Why it matters:_ broad failure across unrelated tests points to an environment or infrastructure problem, not a problem with this test. |
| 29 | +- **Understand the test server configuration.** Are Scout tests using the **default** or a **custom** test server configuration? Do FTR tests belong to a test config that defines custom server arguments that aren't supported on e.g., Elastic Cloud? |
| 30 | + - _Why it matters:_ custom server configurations are a common source of flakiness — they diverge from the configurations used by the broader test suite, so issues affecting only them won't surface elsewhere. They also tend to be less actively maintained. |
| 31 | +- **For Scout, which lane and neighbors shared servers with the failure?** Scout configs in the same Playwright lane share Kibana/Elasticsearch test servers — state can leak between configs. Map the job's `step_key` (e.g. `scout_test_lane_4`) to its config list by downloading `.scout/test_lane_loads.json` from the build's `Scout Test Run Builder` job (`step_key: build_scout_tests`). The same key is scheduled separately per `<arch>/<server-config>`; the job `name` (e.g. "Scout Lane #4 - stateful-classic / default") disambiguates the physical lane. Parallel configs (`parallel.playwright.config.ts`, `workers > 1`) also have multiple workers competing for the same servers, which can surface as transient timeouts under load. |
| 32 | + - _Why it matters:_ if the same neighbor configs and arch/server-config combo recur across failing builds, suspect lane pollution rather than a test bug. Resource pressure in parallel configs can look test-specific but isn't on its own a reason to drop `workers` to 1. |
| 33 | + |
| 34 | +### Inspect the failure artifacts |
| 35 | + |
| 36 | +Before you go deep on scope or root-cause hypotheses, look at the artifacts the CI run produced. For UI tests in particular, the screenshot at the moment of failure often resolves the diagnosis in under a minute — and skipping this step is a common reason an investigation ends up in the wrong tier of fix. |
| 37 | + |
| 38 | +For every failure, try to retrieve: |
| 39 | + |
| 40 | +- **Screenshot at the failure point.** What is actually on the page? Is the awaited element present but the selector wrong? Is a loading indicator still visible? Is there an error toast or unexpected modal? Is the page blank (app crash) or on a different route than expected? |
| 41 | +- **DOM / HTML snapshot at the failure point.** Confirms whether the element the test was looking for actually existed in the DOM (selector issue vs. rendering issue vs. product missing the element entirely). |
| 42 | +- **Server logs** (`kibana.log`, `elasticsearch.log` when present). Cross-reference the failure timestamp with any errors in the logs — a server-side 500 or unexpected warning is strong evidence the failure is a product bug, not a test bug. |
| 43 | +- **Full session trace** when the framework supports it (Scout / Playwright). Lets you scrub through every step, locator query, network call, and DOM snapshot. |
| 44 | + |
| 45 | +How to actually find and download each artifact type is framework-specific — see "Retrieve failure artifacts" below. |
| 46 | + |
| 47 | +Things to specifically check in the artifacts before forming a root-cause hypothesis: |
| 48 | + |
| 49 | +- **Did the expected element render at all?** If yes and the selector missed it → flaky selector (Tier 2 fix territory). If no → real rendering / race / data issue (Tier 1 territory). |
| 50 | +- **Is there an error visible in the UI** (toast, banner, console error in the HTML report)? If yes → product side, not test side. |
| 51 | +- **Is the page in an unexpected state** (different URL, different user's data, different space)? → cleanup or isolation issue, often points at `afterEach` / `afterAll`. |
| 52 | +- **Does the screenshot timestamp match the failure timestamp**? Stale artifacts from a prior step can mislead. |
| 53 | + |
| 54 | +If artifacts are not available (expired, not uploaded, no `read_artifacts` token), say so in the report rather than fabricating a hypothesis. "Screenshot would have resolved this; not available" is a valid open question. |
| 55 | + |
| 56 | +### Retrieve failure artifacts |
| 57 | + |
| 58 | +The standard recipe is **list → filter by path → download by ID**, always scoped to the failed job's UUID. Two Buildkite gotchas to know about first: |
| 59 | + |
| 60 | +- **Failed-attempt jobs are hidden by default.** `/builds/<n>` returns only the latest attempt; append `?include_retried_jobs=true` to find the original failing job (the one cited in `failed-test` comments). `retried` and `retried_in_job_id` link the two. |
| 61 | +- **Per-job artifacts use a different endpoint than build-wide artifacts.** If a build retried to green, failure artifacts only live on the failed job's listing (`bk artifacts list <build> -p <pipeline> --job-uuid <jobId>`). Don't conclude "no screenshot uploaded" until you've checked there. |
| 62 | + |
| 63 | +**Scout** (`@kbn/scout-reporting`, not standard Playwright output — `playwright-report/`, `trace.zip`, and video are NOT published): |
| 64 | + |
| 65 | +- `.scout/reports/scout-playwright-test-failures-<runId>/test-failures-summary.json` — maps test name → HTML report. Start here. |
| 66 | +- `.scout/reports/scout-playwright-test-failures-<runId>/<testId>.html` — self-contained: error, stdout, embedded screenshot. Usually sufficient on its own. |
| 67 | +- `.scout/reports/scout-playwright-test-failures-<runId>/scout-failures-<runId>.ndjson` — one record per failure (`id` = `<testId>`, `owner`, `location`, `error.*`) for programmatic use. |
| 68 | +- `**/.scout/test-artifacts/<test-slug>/test-failed-<N>.png` — plain Playwright screenshot; the PNG doesn't carry `<testId>`, so correlate via spec path. |
| 69 | + |
| 70 | +**FTR** (a single content `<hash>` links every artifact for one failure): |
| 71 | + |
| 72 | +- `target/test_failures/<jobId>_<hash>.{json,log,html}` — `.json` is source of truth; full Kibana/ES stdout lives in `system-out` (there is no separate `kibana.log`). Pull this first. |
| 73 | +- `<test-root>/screenshots/failure/*-<hash>.png` and `<test-root>/failure_debug/html/*-<hash>.html` — UI tests only; fetch only when the failure is UI-side. |
| 74 | +- `.es/*.log` — transport/cluster-shaped failures. |
| 75 | + |
| 76 | +`target/test_failures/` is shared with Scout; filter by `.jobName` (e.g. `FTR Configs #90` vs `Scout Lane #12`) to keep only FTR. On Cloud FTR pipelines the layout differs: one self-contained HTML per failure at `<config-path-with-underscores>-<unix-timestamp>/html/<contentHash>.html` — no `target/test_failures/`, screenshot, or DOM artifacts. |
| 77 | + |
| 78 | +### Understand the scope |
| 79 | + |
| 80 | +Work through all of these questions: |
| 81 | + |
| 82 | +- **How often is the test failing? Are there time spans when it failed most?** Place the test on the spectrum from "fails very occasionally" (e.g. twice this year) to "fails on every CI run". |
| 83 | + - _Why it matters:_ concentrated failures point to a specific cause tied to that window (a bad commit, an infrastructure incident, a dependency change). |
| 84 | +- **When did the test last fail?** |
| 85 | + - _Why it matters:_ if the last failure was 2–3 weeks ago and there are no new comments on the `failed-test` issue, the flakiness may have already resolved itself — intentionally or as a side effect of unrelated changes. |
| 86 | +- **Are other tests in the same suite or config failing with similar or identical errors?** |
| 87 | + - _Why it matters:_ shared failure modes point to shared building blocks (page objects, fixtures, setup) and usually call for a structural change rather than a per-test patch. |
| 88 | +- **Did it fail on a specific version branch?** |
| 89 | + - _Why it matters:_ if the failure isn't happening on `main`, compare the branches to identify what's different. The branch that passes tells you what `main` is missing (or what it added). |
| 90 | +- **When did it first fail, and when did it last pass?** |
| 91 | + - _Why it matters:_ narrows down the Kibana commit or PR that may have introduced the flakiness. |
| 92 | +- **Has this issue or a related one been closed and reopened before?** |
| 93 | + - _Why it matters:_ a reopen is the single strongest signal that the previous diagnosis did not hold. Don't repeat the previous line of reasoning. |
| 94 | +- **Is there a chain of fix attempts on this test or test file?** Look for multiple PRs in the last 12 months whose titles mention this test or area (e.g. "address flaky X", "fix flaky X", "another attempt at X"). |
| 95 | + - _Why it matters:_ a significant share of "fix" PRs are followed within months by another fix PR on the same area. If you are about to be the third or fourth attempt, the previous shape was almost certainly wrong. Do not repeat it. |
| 96 | +- **What did the previous fix change, and what did it claim to address?** |
| 97 | + - _Why it matters:_ if it touched only test code and the test recurred, weigh the product side more heavily this time. If it touched product code and still recurred, the real bug is likely deeper than the previous diff captured. |
| 98 | + |
| 99 | +### Does the test follow best practices? |
| 100 | + |
| 101 | +Common best-practice violations that cause flakiness: |
| 102 | + |
| 103 | +- **Pick the right test type** (`docs/extend/scout/best-practices#pick-the-right-test-type`). UI tests are notoriously more flaky than component, API, and Jest unit/integration tests. |
| 104 | +- **Prefer APIs for setup and teardown** (`docs/extend/scout/ui-best-practices#prefer-kibana-apis-over-ui-for-setup-and-teardown`). Driving setup/teardown through the UI is slower and flakier. |
| 105 | +- **Wait for UI updates after actions** (`docs/extend/scout/ui-best-practices#wait-for-ui-updates-when-the-next-action-requires-it`). Confirm the action produced the expected result and the UI has rendered before continuing. |
| 106 | +- **Wait for complex UI to finish rendering** (`docs/extend/scout/ui-best-practices#wait-for-complex-components-to-fully-render`). |
| 107 | + |
| 108 | +Scout and FTR tests should also follow the general best practices in `docs/extend/scout/best-practices.md`, the UI best practices in `docs/extend/scout/ui-best-practices.md`, and the API best practices in `docs/extend/scout/api-best-practices.md`. |
| 109 | + |
| 110 | +### Investigation pitfalls |
| 111 | + |
| 112 | +Watch out for these pitfalls when investigating the failure: |
| 113 | + |
| 114 | +- **Ignoring the bigger picture**: ensure you have as much data as you can about the test environment and related failures (in the same test file, test config or elsewhere). |
| 115 | +- **Recommending a timeout bump as the primary fix**: timeout bumps consistently fail to hold in Kibana. Investigate what never happened — confirm the slow operation is intrinsic to the product (e.g. index creation, SLO calculation) rather than a missing `waitForResponse`/`waitForSelector` upstream. |
| 116 | +- **Never recommend wrapping the assertion in `retry()` as the fix.** Wrapping the assertion in `retry()` without addressing the underlying cause frequently recurs. Acceptable only as a temporary unblock; in that case the recommendation must explicitly say "this is a stopgap" and link a follow-up issue for the real fix. |
| 117 | +- **Never recommend only test-side async hooks (`await`, `waitFor`, `waitUntil`) when there is evidence of a production-side race.** This is the most common pattern that looks like a fix but isn't — popular precisely because it appears principled, but it lets the test wait _longer_ without fixing the race. |
| 118 | +- **Weakening assertions**: don't recommend making assertions more lenient or narrowing their scope just to make the test pass — this hides regressions instead of catching them. Generic test-only refactors that loosen assertions often regress. |
| 119 | +- **Reducing coverage surface**: don't recommend stripping tags to skip the test in certain environments (e.g. Cloud) or project types (e.g. serverless Security) unless you have a real reason it shouldn't run there. "It's flaky here" is not a real reason. |
| 120 | +- **Trusting flaky-test-runner alone**: a green 30/30 or 60/60 run does not prove a fix held. The runner runs tests in isolation, which isn't always the case (Scout test runs share the same test servers for multiple test configs). |
| 121 | +- **Assuming "fix the test, not the product"**: always ask first whether the product could be at fault. Test-only fixes are meaningfully less durable than fixes that change production code. |
| 122 | +- **Reporting false certainty**: "I don't know, here are the two plausible explanations and what would distinguish them" is more useful to the owning team than a confident wrong answer. |
| 123 | + |
| 124 | +### Is a fix worth it? |
| 125 | + |
| 126 | +Consider alternatives before recommending a code fix. Once you have a diagnosis, the right next step is not always a code change. Consider: |
| 127 | + |
| 128 | +- **Delete the test.** Do other tests already cover what this one is testing? |
| 129 | +- **Refactor or downgrade the test.** See "Pick the right test type" in `docs/extend/scout/best-practices.md`. A functional test can often become an API, component, or Jest unit/integration test. |
| 130 | +- **Update the tags.** Are the test's tags still appropriate? Should it run on Cloud? Should it be excluded from certain serverless solution types (e.g. Security)? |
| 131 | +- **Escalate to the owning team.** If this is a recurring offender or you suspect a product bug, the most useful conclusion may be a writeup handed to the owners, not a fix attempt. |
| 132 | + |
| 133 | +## Reporting |
| 134 | + |
| 135 | +When you report your conclusion, include these details: |
| 136 | + |
| 137 | +1. **What the test does** (one paragraph) |
| 138 | +2. **What failed and when** (most recent failure + count of failures over time) |
| 139 | +3. **Where it ran** (Cloud or local pipelines, or both) |
| 140 | +4. **Root cause hypothesis** (a few sentences describing the outcome of the investigation) |
| 141 | +5. **Evidence supporting that hypothesis**, and evidence against it (if you considered alternative hypotheses, include them alongside their confidence level) |
| 142 | +6. **Failure screenshot description**: what did you observe in the failure screenshot? |
| 143 | +7. **Recommended next step** (this won't always be a code change). If you are recommending a code fix, give an honest note on expected durability |
| 144 | +8. **Open questions** the investigation could not resolve |
0 commit comments