[Flaky Test Fixer] Use flaky-test-investigator skill in Failed Test Investigator workflow (elastic#271443)

csr · web-flow · commit eb743555eb8f · 2026-05-28T14:06:02.000+02:00
This PR updates the existing Failed Test Investigator workflow to use the [newly introduced](elastic#269071) `flaky-test-investigator` skill. The skill is being picked up correctly: see sample output [here](elastic/sdh-kibana-automated-issue-triage#18 (comment)). ### Local testing You can test the workflow by invoking it locally: ``` gh aw trial ./.github/workflows/failed-test-investigator.md \ --logical-repo elastic/kibana \ --trigger-context elastic#270988 ``` The command above will create a private repository and invoke the workflow for you. You'll first need to add a `LITELLM_API_KEY` secret to the private/test repository that `gh aw trial` creates for you for the workflow to work (generate it via LiteLLM, for example).
diff --git a/.github/workflows/failed-test-investigator.md b/.github/workflows/failed-test-investigator.md
@@ -87,20 +87,9 @@ Investigate a failed-test issue, classify the failure, and propose a fix when ap
 - **`issues` trigger**: use the triggering issue (non-PR, labeled `failed-test`).
 - **`workflow_dispatch`**: use issue `${{ github.event.inputs.issue_number }}`. Fetch it explicitly before analysis, and post the final comment there.
 
-## Where did the test run?
-
-The test's **target** (e.g. `local-stateful-classic`, `cloud-serverless-security_complete`) tells you where it ran:
-
-- **`cloud-*`** — ran against a real Elastic Cloud project (serverless) or deployment (stateful). Pipeline names: `appex-qa-{serverless|stateful}-kibana-{ftr|scout}-tests`.
-- **`local-*`** — ran on the agent's local machine. `kibana-on-merge` and `kibana-pull-request` are local (no Elastic Cloud API calls), so the environment is more stable and less prone to network/env flakiness.
-
 ## Investigate
 
-1. Read the issue title, body, labels, and all comments.
-2. Parse test metadata if present: location (test file path), config path, code owners, target.
-3. Look at all the failures reported in the issue. The very same test could have been failing with different error messages, for different reasons, on different pipelines, and on different branches.
-4. Inspect the relevant test file and nearby helpers/fixtures. For Scout, start from the reported location; otherwise infer from the title.
-5. Check recent git history and blame on the test file and related product code.
+Investigate the test failure(s) using the `flaky-test-investigator` skill.
 
 Every conclusion must cite specific evidence. Do not guess.
 
@@ -150,53 +139,32 @@ No other side-effects beyond posting the comment and updating the label.
 
 ## Comment format
 
-Post exactly one comment with two main parts:
-
-- **Visible section**: a very concise summary that would inform a developer with a quick glance. Highlight main findings. Keep it high-signal and to the point.
-- **Collapsed `<details>` section**: full long-form context for the downstream auto-fix agent (and any human who wants to audit the call).
-
-The visible section is a _distillation_ of the collapsed one. Do not repeat content verbatim across both: the visible bullets summarize, the collapsed block holds the full evidence the summary was derived from.
-
-### Visible (top), in this order:
-
-1. **One-line bold headline** stating the result kind and one identifying detail. Consistent with `classification` but not templated. Example: `**Likely test-design fix** — missing waitForAlertsToPopulate() in building_block_alerts.spec.ts`.
-
-2. **A 3–5 sentence prose paragraph** (no headings, no bullets) covering: what broke and where (name the test file/name), the most likely root cause, and any evidence-backed author attribution with `@username` so they get notified on first read.
-
-3. **One-line action hint**: the proposed fix, recommended action, or missing evidence. Skip if the paragraph already covers it.
-
-4. **Findings bullets** — exactly these four, in this order, with one concrete value each. Downstream tooling parses these directly; preserve keys, casing, and `` - `key`: value `` shape:
-
-   - `classification`: `test-design` | `test-environment` | `application` | `external` | `inconclusive`
-   - `confidence`: `high` | `medium` | `low`
-   - `test.type`: `scout` (if `scout-playwright` label) | `ftr` | `jest` | `unknown`
-   - `test.file`: repo-relative path, or `unknown`
-
-5. **Suspected root cause** — 2–4 short bullets, each tied to a specific piece of evidence. Skip the section entirely when `classification` is `external` or `inconclusive` and there is nothing concrete to assert.
-
-6. **Key references** — at most 3 Markdown links: the failing test file, the failing CI run, and the implicated commit (when one exists). Skip any of the three that are not applicable; skip the section entirely when none apply.
-
-### Collapsed (`<details>`):
-
-This section is the full context for agents and humans to dive deep into the findings. Verify all information. Wrap it in a single `<details>` block. The blank lines around `</summary>` and `</details>` are required for the inner markdown to render.
-
-```
-<details>
-<summary>See full details</summary>
+Post exactly one comment. Keep the visible portion very short and easy to read:
 
-#### Full root-cause analysis
+1. **One-line bold headline** stating the result kind and one identifying detail.
+2. **Diagnosis** (≤5 concise bullet points): what broke and where, the most likely root cause.
+3. **Next steps** (≤5 concise bullet points).
 
-The long-form version of the visible "Suspected root cause" bullets. Walk through the evidence chain step by step. Cite the specific log lines, stack frames, blame results, or related PRs that led to the conclusion.
+Put the full `flaky-test-investigator` skill output inside a collapsed `<details><summary>Investigation details</summary> ... </details>` block (not in the visible portion). Open the block with a `#### Findings` subsection containing exactly these four bullets in this order — downstream tooling parses them, so preserve keys, casing, and `` - `key`: value `` shape. These bullets must live **inside `<details>`**, never in the visible portion:
 
-#### Evidence used
+- `classification`: `test-design` | `test-environment` | `application` | `external` | `inconclusive`
+- `confidence`: `high` | `medium` | `low`
+- `test.type`: `scout` (if `scout-playwright` label) | `ftr` | `jest` | `unknown`
+- `test.file`: repo-relative path, or `unknown`
 
-A complete list of the evidence consulted: issue comments, file paths, commits, CI runs, blame output, related PRs. Each item should be a Markdown link, not a bare path or SHA.
+The skill's "Reporting" subsections should also be inside the collapsible section:
 
-#### Suggested patch
+- What the test does
+- What failed and when
+- Where it ran
+- Root cause hypothesis
+- Evidence
+- Failure screenshot
+- Recommended next step
+- Open questions
 
-Only when justified by the evidence: a small diff-style snippet showing the suggested edit. Include the exact file, function, assertion, wait condition, fixture, selector, API, or behavior to change. Omit this section entirely when no defensible patch can be proposed.
+Blank lines around `</summary>` and `</details>` are required for the inner markdown to render.
 
-</details>
-```
+End the comment with this footer line (verbatim, on its own line after the `</details>` block):
 
-Use `####` headings inside the details block (not `###`) so they nest below the comment's own structure. Any of the three subsections may be omitted when there is nothing meaningful to put in it.
+`<sup>AI-generated, share feedback in [#appex-qa](https://elastic.slack.com/archives/C04HT4P1YS3)</sup>`