feat: add e2e triage CI workflow with Slack integration#741
feat: add e2e triage CI workflow with Slack integration#741alishakawaguchi merged 32 commits intomainfrom
Conversation
Make sha and failed_agents optional for workflow_dispatch triggers. When omitted, these values are derived from the run URL via the GitHub API, reducing friction when triggering triage from the UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 4a44db7b807d
- Consolidate two gh API calls into one (headSha + jobs in single request) - Extract duplicated CSV-to-JSON jq pattern into csv_to_json function - Add "null" guard to agents_json validation - Use shallow clone (fetch-depth: 1) for triage jobs - Add server-side error logging in HTTP handler - Fix gosec nolint placement and noctx lint errors in tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 0f803598ba36
There was a problem hiding this comment.
Pull request overview
Adds a Slack-triggered E2E triage path that bridges Slack thread replies (triage e2e) to a new GitHub Actions triage workflow, so failing CI runs can be triaged and reported back to Slack with minimal manual steps.
Changes:
- Introduce
.github/workflows/e2e-triage.yml(workflow_dispatch + repository_dispatch) to run/e2e:triage-ciper failed agent and post Slack thread updates. - Add
cmd/e2e-triage-dispatch/HTTP service plusinternal/slacktriage/helpers to validate Slack events, parse parent alert metadata, and dispatch GitHub events. - Add machine-readable
meta:data to E2E Slack alerts, plus docs and a runner script for the triage workflow.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/run-e2e-triage.sh | Runner script invoked by the triage workflow to execute the Claude E2E triage command and tee logs to artifacts. |
| internal/slacktriage/parent_message.go | Parses the meta: line from Slack alerts into structured metadata for dispatch. |
| internal/slacktriage/normalize.go | Normalizes Slack reply text and checks for the exact triage trigger phrase. |
| internal/slacktriage/dispatch.go | Builds the repository_dispatch payload from parsed metadata + Slack thread info. |
| internal/slacktriage/*_test.go | Unit tests for trigger normalization, parent metadata parsing, and dispatch payload creation. |
| cmd/e2e-triage-dispatch/main.go | Slack event receiver: verifies signatures, fetches parent message, parses metadata, dispatches to GitHub. |
| cmd/e2e-triage-dispatch/main_test.go | Handler + dispatcher unit tests (signature verification, ignore cases, dispatch path). |
| .github/workflows/e2e.yml | Adds machine-readable meta: metadata to the Slack failure alert. |
| .github/workflows/e2e-triage.yml | New triage workflow that validates payload, derives sha/agents when needed, runs triage, posts Slack updates, uploads artifacts. |
| docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md | Design doc describing the Slack→GitHub triage system and contract. |
| docs/architecture/slack-e2e-triage.md | Architecture/runbook-style overview for operating the Slack-triggered triage. |
| README.md | Documents Slack-triggered E2E triage and points to the architecture doc. |
Adds push-triggered test mode that runs with the vogon canary agent (no API costs) when workflow-related files change on this branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: eec73e0fab92
This reverts commit f8a82d6. Entire-Checkpoint: 363c74b4a8c5
The triage workflow was checking out the failed run's SHA, which doesn't contain the triage script. Now checks out the workflow's own branch and passes the target SHA as an env var instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: d4ec0a1e350d
Use --allowedTools with explicit per-command scoping instead of --dangerously-skip-permissions. Each gh command is locked to the specific repo, workflow, and agent. No generic shell access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 4004148a4e05
Instead of giving Claude shell access to gh/scripts, download artifacts in the script before invoking Claude. Claude only gets Read, Grep, and Glob — pure analysis, no shell execution. Also improve job summary to show helpful message when log is empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: ca4f43d851a5
…aries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 4fc3a7119ed8
…lack triage Replace the Go-based Slack Events API dispatch service with a lightweight Cloudflare Worker that bridges Slack links to GitHub workflow_dispatch. The e2e.yml alert now posts via bot token (chat.postMessage) to capture thread context, then includes a clickable "Run Triage" link. - Add workers/e2e-triage-trigger/ (Cloudflare Worker) - Switch e2e.yml Slack alert from webhook to bot token + curl - Remove repository_dispatch trigger from e2e-triage.yml - Delete cmd/e2e-triage-dispatch/ and internal/slacktriage/ - Update docs and README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: b4154940fa73
The e2e-triage-trigger worker belongs in the infra repo (cloudflare/workers/e2e-triage-trigger/), not the CLI repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: c67902090531
|
bugbot run |
…orkflow jq capture() crashes the pipeline when a failed job name doesn't match the expected (agent) pattern. Wrap in try-catch to gracefully skip non-matching jobs. Add concurrency group to e2e-triage workflow to prevent duplicate runs from Slack retries or re-dispatches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 6db4ca1d788d
The tee to stdout made the "Run triage" step log unreadable since GitHub Actions logs render markdown as plain text. The rendered report is already written to $GITHUB_STEP_SUMMARY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 748308902ac5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 9835d73e7909
Migrate triage to claude-code-action, add plan generation step after triage, post "Fix It" link to Slack, and create e2e-fix.yml workflow that applies plans and opens draft PRs via claude-code-action. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: b0e6e608e807
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: f8fee1d345e0
The claude-code-action can succeed without producing an execution_file. Add the same non-empty guard used by the triage extraction step, and gate downstream steps (Slack post, artifact upload) on plan_output succeeding rather than just the plan step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 7f2ce308957d
Revert Slack notification from chat.postMessage API back to the original incoming webhook approach for minimal changes. Remove premature architecture doc, design plan, and README section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: eeb2507079c5
…riage tools Revert e2e.yml jq change to match main. Delete unused run-e2e-triage.sh script. Add `rerun` boolean input to e2e-triage.yml that installs agent CLIs and enables Bash tools for flaky detection via test re-runs. Update plan generation step with EnterPlanMode, Write, and Bash(mise:*) tools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 2f2c12c4092e
|
bugbot run |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Shell injection in Python URL-encoding command
- Replaced bash variable interpolation
'$RUN_URL'withos.environ['RUN_URL']in the Python command to avoid shell injection via crafted URL input.
- Replaced bash variable interpolation
- ✅ Fixed: Duplicated Slack helper script across two workflows
- Extracted the duplicated inline Slack helper heredoc from both workflows into a shared
scripts/post-slack-message.shscript, and moved checkout steps before Slack notification steps so the script is available.
- Extracted the duplicated inline Slack helper heredoc from both workflows into a shared
Or push these changes by commenting:
@cursor push dfe21a7917
Preview (dfe21a7917)
diff --git a/.github/workflows/e2e-fix.yml b/.github/workflows/e2e-fix.yml
--- a/.github/workflows/e2e-fix.yml
+++ b/.github/workflows/e2e-fix.yml
@@ -43,38 +43,11 @@
SLACK_CHANNEL: ${{ inputs.slack_channel }}
SLACK_THREAD_TS: ${{ inputs.slack_thread_ts }}
steps:
- - name: Write Slack helper
- if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
- shell: bash
- run: |
- set -euo pipefail
+ - name: Checkout repository
+ uses: actions/checkout@v6
+ with:
+ fetch-depth: 0
- helper="$RUNNER_TEMP/post-slack-message.sh"
- cat > "$helper" <<'EOF'
- #!/usr/bin/env bash
- set -euo pipefail
-
- text="${1:?message is required}"
- payload="$(jq -n \
- --arg channel "$SLACK_CHANNEL" \
- --arg thread_ts "$SLACK_THREAD_TS" \
- --arg text "$text" \
- '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
-
- if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
- -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
- -H 'Content-type: application/json; charset=utf-8' \
- --data "$payload")"; then
- echo "warning: slack notification failed" >&2
- exit 0
- fi
-
- if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
- echo "warning: slack notification returned non-ok response" >&2
- fi
- EOF
- chmod +x "$helper"
-
- name: Post fix started
if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
shell: bash
@@ -83,13 +56,8 @@
run: |
set -euo pipefail
- "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E fix for \`${FAILED_AGENTS}\`."
+ scripts/post-slack-message.sh "Starting E2E fix for \`${FAILED_AGENTS}\`."
- - name: Checkout repository
- uses: actions/checkout@v6
- with:
- fetch-depth: 0
-
- name: Setup mise
uses: jdx/mise-action@v4
@@ -157,7 +125,7 @@
message="E2E fix complete — changes applied but no PR was created. Check the <${RUN_URL}|workflow run> for details."
fi
- "$RUNNER_TEMP/post-slack-message.sh" "$message"
+ scripts/post-slack-message.sh "$message"
- name: Post failure to Slack
if: failure() && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != ''
@@ -169,4 +137,4 @@
message="E2E fix failed. Check the <${RUN_URL}|workflow run> for details."
- "$RUNNER_TEMP/post-slack-message.sh" "$message"
+ scripts/post-slack-message.sh "$message"
diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -129,51 +129,19 @@
matrix:
agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }}
steps:
- - name: Write Slack helper
- if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
- shell: bash
- run: |
- set -euo pipefail
+ - name: Checkout repository
+ uses: actions/checkout@v6
+ with:
+ fetch-depth: 1
- helper="$RUNNER_TEMP/post-slack-message.sh"
- cat > "$helper" <<'EOF'
- #!/usr/bin/env bash
- set -euo pipefail
-
- text="${1:?message is required}"
- payload="$(jq -n \
- --arg channel "$SLACK_CHANNEL" \
- --arg thread_ts "$SLACK_THREAD_TS" \
- --arg text "$text" \
- '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
-
- if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
- -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
- -H 'Content-type: application/json; charset=utf-8' \
- --data "$payload")"; then
- echo "warning: slack notification failed" >&2
- exit 0
- fi
-
- if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
- echo "warning: slack notification returned non-ok response" >&2
- fi
- EOF
- chmod +x "$helper"
-
- name: Post triage started
if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
shell: bash
run: |
set -euo pipefail
- "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>."
+ scripts/post-slack-message.sh "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>."
- - name: Checkout repository
- uses: actions/checkout@v6
- with:
- fetch-depth: 1
-
- name: Setup mise
uses: jdx/mise-action@v4
@@ -317,7 +285,7 @@
message="E2E triage failed for \`$E2E_AGENT\`."
fi
- "$RUNNER_TEMP/post-slack-message.sh" "$message"
+ scripts/post-slack-message.sh "$message"
- name: Post fix plan to Slack
if: steps.plan_output.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != ''
@@ -332,7 +300,7 @@
summary="$(head -20 "$PLAN_FILE" | sed '/^$/d' | head -5)"
# Construct Fix It URL
- encoded_run_url="$(python3 -c "import urllib.parse; print(urllib.parse.quote('$RUN_URL', safe=''))")"
+ encoded_run_url="$(python3 -c "import urllib.parse, os; print(urllib.parse.quote(os.environ['RUN_URL'], safe=''))")"
fix_url="https://e2e-triage.entireio.workers.dev/fix?triage_run_id=${TRIAGE_RUN_ID}&run_url=${encoded_run_url}&failed_agents=${E2E_AGENT}&slack_channel=${SLACK_CHANNEL}&slack_thread_ts=${SLACK_THREAD_TS}"
message="Fix plan ready for \`$E2E_AGENT\`:
@@ -340,7 +308,7 @@
<${fix_url}|Fix It> — applies the plan and creates a draft PR"
- "$RUNNER_TEMP/post-slack-message.sh" "$message"
+ scripts/post-slack-message.sh "$message"
- name: Upload triage output
if: always()
diff --git a/scripts/post-slack-message.sh b/scripts/post-slack-message.sh
new file mode 100644
--- /dev/null
+++ b/scripts/post-slack-message.sh
@@ -1,0 +1,21 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+text="${1:?message is required}"
+payload="$(jq -n \
+ --arg channel "$SLACK_CHANNEL" \
+ --arg thread_ts "$SLACK_THREAD_TS" \
+ --arg text "$text" \
+ '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
+
+if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
+ -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
+ -H 'Content-type: application/json; charset=utf-8' \
+ --data "$payload")"; then
+ echo "warning: slack notification failed" >&2
+ exit 0
+fi
+
+if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
+ echo "warning: slack notification returned non-ok response" >&2
+fiThis Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Extract duplicated Slack post-message heredoc from e2e-triage.yml and e2e-fix.yml into scripts/post-slack-message.sh. Fix shell injection in Python URL-encoding by reading RUN_URL from os.environ instead of interpolating into the code string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 91d990ef6fb1


Context
When E2E tests fail on
main, a Slack alert is posted but there's no easy way to kick off triage. This adds a triage workflow that can be triggered viaworkflow_dispatch(manually or via a Cloudflare Worker). Triage results post back to the same Slack thread.Testing limitations
The
claude-code-actionrequires workflows to be on the default branch (main) to run. This means the triage and fix workflows cannot be tested end-to-end until this PR is merged. The workflow YAML has been validated for syntax and all unit/integration/canary tests pass.Follow-up work
e2e.ymlto add a "Run Triage" link to the Slack failure alert (separate PR to keep this one focused)e2e.ymlSlack notification is unchanged — still uses the incoming webhookSummary
e2e-triage.yml— triages E2E failures per agent usingclaude-code-action(read-only analysis by default)e2e-fix.yml— applies fix plans from triage, runs verification, creates draft PRreruntoggle to triage workflow — when enabled, installs agent CLIs and re-runs failing tests to detect flaky vs real bugs (costs API tokens)EnterPlanMode,Write, andBash(mise:*)toolschat.postMessagefor Slack thread replies (incoming webhooks don't supportthread_ts)/e2e:implementskillSecret/config changes needed
SLACK_BOT_TOKEN— bot token withchat:writescope (for triage/fix thread replies)ANTHROPIC_API_KEY— already exists (used by claude-code-action)E2E_SLACK_WEBHOOK_URLis unchanged (used bye2e.ymlfor top-level alerts)Test plan
mise run fmt && mise run lint && mise run test:ci— all pass (51 canary tests)maine2e-triage.ymlviaworkflow_dispatchwith a failed run URLrerun: trueto verify flaky detection pathSLACK_BOT_TOKENsecret)🤖 Generated with Claude Code
Note
Medium Risk
Adds new GitHub Actions workflows that invoke
claude-code-action, download artifacts, and push branches/create PRs withcontents/pull-requestswrite permissions; misconfiguration could cause unintended repo writes or noisy Slack posting.Overview
Introduces an on-demand E2E triage → fix pipeline via new
workflow_dispatchGitHub Actions.e2e-triage.ymlbuilds an agent matrix (auto-detecting failed agents/SHA from a run URL), runs/e2e:triage-ci(optionally re-running tests after installing agent CLIs), persists triage + plan artifacts, and posts threaded Slack updates including a generated Fix It link.e2e-fix.ymlconsumes those plan artifacts, usesclaude-code-actionto apply the specified fixes, runsfmt/lint/canary verification, then pushes afix/e2e-*branch and opens a draft PR, with Slack success/failure notifications. Also updates/e2e:implementguidance to require entering plan mode before making changes.Written by Cursor Bugbot for commit 8551955. Configure here.