From 7ed0393072c0bbdf44c657913770b51e83d54ddc Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:15:36 -0700 Subject: [PATCH 01/32] docs: add slack-triggered e2e triage design --- ...03-17-slack-triggered-e2e-triage-design.md | 221 ++++++++++++++++++ 1 file changed, 221 insertions(+) create mode 100644 docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md diff --git a/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md b/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md new file mode 100644 index 000000000..fe8b1a2e8 --- /dev/null +++ b/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md @@ -0,0 +1,221 @@ +# Slack-Triggered E2E Triage Design + +## Goal + +Allow a human to reply `triage e2e` in the Slack thread for an E2E failure alert and have GitHub Actions run the repo's existing Claude E2E triage workflow based on [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md). + +## Scope + +This design covers: + +- Slack trigger detection for a single exact-match phrase: `triage e2e` +- Hand-off from Slack to GitHub Actions +- A new GitHub Actions workflow that runs triage against an existing failed CI run +- Posting triage status and results back into the originating Slack thread + +This design does not cover: + +- Automatic remediation or code changes +- Running the full E2E fix pipeline +- General-purpose Slack command routing +- Local rerun verification beyond what the existing skill supports for CI run references + +## Existing Context + +The repo already has the core ingredients needed for the triage operation: + +- [`.github/workflows/e2e.yml`](../../.github/workflows/e2e.yml) posts Slack alerts when E2E runs on `main` fail +- [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md) defines the triage procedure +- [`.claude/plugins/e2e/commands/triage-ci.md`](../../.claude/plugins/e2e/commands/triage-ci.md) exposes the procedure as the `/e2e:triage-ci` command +- [`scripts/download-e2e-artifacts.sh`](../../scripts/download-e2e-artifacts.sh) already supports artifact download from a GitHub Actions run reference + +The missing piece is the Slack-to-GitHub bridge. + +## Architecture + +The system is composed of three narrow responsibilities: + +1. Slack app + - Listen for new thread replies + - Normalize reply text + - Trigger only when the reply text is exactly `triage e2e` + - Validate that the parent message is an E2E failure alert for this repository + +2. Dispatch bridge + - Read structured data from the parent Slack alert + - Build a `repository_dispatch` payload for this repository + - Send the dispatch event to GitHub + +3. GitHub Action + - Receive the dispatch payload + - Check out the repository at the failed commit SHA + - Install and authenticate Claude CLI + - Load the local plugin directory at [`.claude/plugins/e2e`](../../.claude/plugins/e2e) + - Invoke `/e2e:triage-ci` with the CI run URL and failed agent + - Upload artifacts and post results back to the Slack thread + +This keeps Slack focused on intent capture and routing while GitHub Actions remains the execution environment for triage. + +## Trigger Contract + +The Slack app should send a structured `repository_dispatch` event with custom type `slack_e2e_triage_requested`. + +Recommended payload: + +```json +{ + "trigger_text": "triage e2e", + "repo": "entireio/cli", + "branch": "main", + "sha": "447cde1aeee938448c3edbae78242c950dc35cf0", + "run_url": "https://github.com/entireio/cli/actions/runs/123456789", + "run_id": "123456789", + "failed_agents": ["cursor-cli"], + "slack_channel": "C123456", + "slack_thread_ts": "1742230000.123456", + "slack_user": "U123456" +} +``` + +Workflow-side validation rules: + +- Reject if `trigger_text` is not exactly `triage e2e` +- Reject if `run_url` or `slack_thread_ts` is missing +- Reject if the target repo or branch is unexpected +- Treat `failed_agents` as the source of truth for which agent-specific triage jobs to run + +## Slack Message Requirements + +The current Slack failure notification in [`.github/workflows/e2e.yml`](../../.github/workflows/e2e.yml) already includes the run details link, commit SHA, actor, and failed agent list. That is enough for a first version if the Slack app parses the parent message. + +However, the safer design is to make the alert payload more machine-friendly so the Slack app does not need to scrape display text. Two acceptable options: + +- Add stable metadata in the Slack message text or blocks for `run_url`, `sha`, and `failed_agents` +- Store a compact JSON blob in a Slack block element or message metadata if the chosen Slack app framework supports it + +The first version can parse the existing message format, but the implementation should isolate that parsing into one small component because it is brittle compared to a structured payload. + +## GitHub Workflow Design + +Add a new workflow at [`.github/workflows/e2e-triage.yml`](../../.github/workflows/e2e-triage.yml). + +### Triggers + +- `repository_dispatch` with type `slack_e2e_triage_requested` +- `workflow_dispatch` for manual testing and debugging + +### High-Level Job Flow + +1. Validate dispatch payload +2. Post "triage started" reply to the Slack thread +3. Check out repository at the failed `sha` +4. Set up `mise` +5. Install Claude CLI and any required dependencies +6. Authenticate Claude using a GitHub Actions secret +7. Run the E2E triage command: + +```bash +claude --plugin-dir .claude/plugins/e2e -p "/e2e:triage-ci --agent " +``` + +8. Capture output to files for artifact upload +9. Post a Slack thread reply with a concise summary and a link to the triage workflow run +10. Upload triage artifacts regardless of success or failure + +### Agent Fan-Out + +If the alert has multiple failed agents, the workflow should fan out one matrix job per failed agent. This keeps results isolated and simplifies failure attribution in Slack and in GitHub artifacts. + +### Concurrency + +Use concurrency keyed by CI `run_id` or Slack thread timestamp so repeated `triage e2e` replies do not start duplicate work for the same failure thread. + +## Invocation Model + +This design intentionally uses the existing CI-run path in [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md): + +- The workflow passes the original GitHub Actions run URL to `/e2e:triage-ci` +- The skill downloads artifacts via [`scripts/download-e2e-artifacts.sh`](../../scripts/download-e2e-artifacts.sh) +- The triage workflow analyzes the failed run's artifacts instead of starting fresh E2E reruns + +That keeps cost and runtime bounded for the first version. + +If local rerun verification is later required for Slack-triggered triage, that should be added as a deliberate extension to the workflow and possibly to the skill behavior for CI-driven contexts. + +## Slack Responses + +Recommended thread messages: + +- Start: + - `Starting E2E triage for cursor-cli from CI run .` +- Success: + - Short classification summary per agent and a link to the GitHub triage workflow +- Failure: + - Short failure reason and a link to the GitHub triage workflow + +Slack replies should stay short. The full triage report belongs in workflow logs and uploaded artifacts. + +## Error Handling + +### Slack App + +- Ignore non-thread replies +- Ignore messages whose normalized text is not exactly `triage e2e` +- Refuse to trigger if the parent message is not recognized as an E2E failure alert from this repository +- Reply in-thread with a short failure message if dispatch fails + +### GitHub Workflow + +- Fail fast on malformed dispatch payloads +- Fail with a clear Slack reply if checkout or Claude setup fails +- Fail with a clear Slack reply if CI artifact download fails +- Always upload raw triage output as artifacts + +## Security + +The Slack app should use a GitHub token scoped only to dispatch workflows on this repository. + +The workflow should: + +- Use the minimum required GitHub permissions +- Store Claude authentication in GitHub Actions secrets +- Avoid echoing secrets or full auth state into logs + +The Slack app should validate Slack request signatures before processing events. + +## Testing Strategy + +### Slack App + +- Unit test normalization for exact-match `triage e2e` +- Unit test parent-message validation +- Unit test extraction of `run_url`, `sha`, and `failed_agents` +- Unit test dispatch payload construction + +### GitHub Workflow + +- Add `workflow_dispatch` inputs mirroring the dispatch payload for manual testing +- Smoke test against a known failed E2E run URL +- Verify success path posts to Slack thread +- Verify invalid payload path exits early and reports clearly + +### Non-Goals for Testing + +- Do not run real E2E reruns as part of this workflow +- Do not test code-fixing behavior in this first version + +## Recommended Implementation Order + +1. Add the new GitHub Actions workflow with manual `workflow_dispatch` +2. Prove the workflow can run `/e2e:triage-ci` against a known failed CI run URL +3. Add Slack thread notification hooks for started/succeeded/failed states +4. Build the Slack app that validates the thread reply and sends `repository_dispatch` +5. Tighten the original E2E Slack alert format if parsing proves brittle + +## Open Decisions Resolved + +- Trigger phrase: exact match `triage e2e` +- Execution environment: GitHub Actions +- Triage source of truth: [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md) +- Invocation surface: [`.claude/plugins/e2e/commands/triage-ci.md`](../../.claude/plugins/e2e/commands/triage-ci.md) +- Initial scope: artifact-based triage of an existing failed CI run, not automatic fixing From 7acb6cd2025a6dde6a36eaba59ec3ad25a896665 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:22:29 -0700 Subject: [PATCH 02/32] ci: add structured metadata to e2e slack alerts --- .github/workflows/e2e.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 18d39a5ff..98246ea14 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -125,7 +125,9 @@ jobs: run: | failed=$(gh api repos/${{ github.repository }}/actions/runs/${{ github.run_id }}/jobs \ --jq '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent] | join(", ")') + failed_csv="${failed//, /,}" echo "agents=$failed" >> "$GITHUB_OUTPUT" + echo "agents_csv=$failed_csv" >> "$GITHUB_OUTPUT" - name: Notify Slack of E2E failure uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a # v2.1.1 @@ -145,6 +147,10 @@ jobs: { "type": "context", "elements": [ + { + "type": "mrkdwn", + "text": "meta: repo=${{ github.repository }} branch=${{ github.ref_name }} run_id=${{ github.run_id }} run_url=${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} sha=${{ github.sha }} agents=${{ steps.failed.outputs.agents_csv }}" + }, { "type": "mrkdwn", "text": "Commit: <${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}> by ${{ github.actor }}" From c06183f904e8d3c62fcaacdceaf3043b7d4efbe6 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:29:58 -0700 Subject: [PATCH 03/32] ci: add e2e triage runner script --- scripts/run-e2e-triage.sh | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100755 scripts/run-e2e-triage.sh diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh new file mode 100755 index 000000000..d64aefd5a --- /dev/null +++ b/scripts/run-e2e-triage.sh @@ -0,0 +1,13 @@ +#!/usr/bin/env bash + +set -euo pipefail + +: "${RUN_URL:?RUN_URL is required}" +: "${E2E_AGENT:?E2E_AGENT is required}" +: "${TRIAGE_OUTPUT_FILE:?TRIAGE_OUTPUT_FILE is required}" + +mkdir -p "$(dirname "$TRIAGE_OUTPUT_FILE")" + +claude --plugin-dir .claude/plugins/e2e \ + -p "/e2e:triage-ci ${RUN_URL} --agent ${E2E_AGENT}" \ + 2>&1 | tee "$TRIAGE_OUTPUT_FILE" From 0f8a8350b888cef92ad313b34bd2327e5dddd0cf Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:32:00 -0700 Subject: [PATCH 04/32] ci: add e2e triage workflow --- .github/workflows/e2e-triage.yml | 133 +++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 .github/workflows/e2e-triage.yml diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml new file mode 100644 index 000000000..54be3ca4a --- /dev/null +++ b/.github/workflows/e2e-triage.yml @@ -0,0 +1,133 @@ +name: E2E Triage + +on: + repository_dispatch: + types: + - slack_e2e_triage_requested + workflow_dispatch: + inputs: + run_url: + description: GitHub Actions run URL to triage + required: true + type: string + sha: + description: Commit SHA that failed + required: true + type: string + failed_agents: + description: Comma-separated list of failed agents + required: true + type: string + slack_channel: + description: Slack channel ID for the originating thread + required: false + type: string + slack_thread_ts: + description: Slack thread timestamp for replies + required: false + type: string + +permissions: + actions: read + contents: read + +jobs: + matrix-setup: + runs-on: ubuntu-latest + outputs: + agents: ${{ steps.set.outputs.agents }} + run_url: ${{ steps.set.outputs.run_url }} + sha: ${{ steps.set.outputs.sha }} + slack_channel: ${{ steps.set.outputs.slack_channel }} + slack_thread_ts: ${{ steps.set.outputs.slack_thread_ts }} + steps: + - name: Validate payload and build matrix + id: set + shell: bash + env: + EVENT_NAME: ${{ github.event_name }} + RUN_URL_INPUT: ${{ inputs.run_url }} + SHA_INPUT: ${{ inputs.sha }} + FAILED_AGENTS_INPUT: ${{ inputs.failed_agents }} + SLACK_CHANNEL_INPUT: ${{ inputs.slack_channel }} + SLACK_THREAD_TS_INPUT: ${{ inputs.slack_thread_ts }} + run: | + set -euo pipefail + + if [ "$EVENT_NAME" = "workflow_dispatch" ]; then + run_url="$RUN_URL_INPUT" + sha="$SHA_INPUT" + slack_channel="${SLACK_CHANNEL_INPUT:-}" + slack_thread_ts="${SLACK_THREAD_TS_INPUT:-}" + failed_agents_raw="$FAILED_AGENTS_INPUT" + agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" + else + run_url="$(jq -r '.client_payload.run_url // empty' "$GITHUB_EVENT_PATH")" + sha="$(jq -r '.client_payload.sha // empty' "$GITHUB_EVENT_PATH")" + slack_channel="$(jq -r '.client_payload.slack_channel // empty' "$GITHUB_EVENT_PATH")" + slack_thread_ts="$(jq -r '.client_payload.slack_thread_ts // empty' "$GITHUB_EVENT_PATH")" + if jq -e '.client_payload.failed_agents | type == "array"' "$GITHUB_EVENT_PATH" >/dev/null 2>&1; then + agents_json="$(jq -c '.client_payload.failed_agents' "$GITHUB_EVENT_PATH")" + else + failed_agents_raw="$(jq -r '.client_payload.failed_agents // empty' "$GITHUB_EVENT_PATH")" + agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" + fi + fi + + if [ -z "$run_url" ]; then + echo "run_url is required" >&2 + exit 1 + fi + if [ -z "$sha" ]; then + echo "sha is required" >&2 + exit 1 + fi + if [ -z "$agents_json" ] || [ "$agents_json" = "[]" ]; then + echo "failed_agents is required" >&2 + exit 1 + fi + + echo "run_url=$run_url" >> "$GITHUB_OUTPUT" + echo "sha=$sha" >> "$GITHUB_OUTPUT" + echo "slack_channel=$slack_channel" >> "$GITHUB_OUTPUT" + echo "slack_thread_ts=$slack_thread_ts" >> "$GITHUB_OUTPUT" + echo "agents=$agents_json" >> "$GITHUB_OUTPUT" + + triage: + needs: [matrix-setup] + runs-on: ubuntu-latest + timeout-minutes: 45 + strategy: + fail-fast: false + matrix: + agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }} + steps: + - name: Checkout repository + uses: actions/checkout@v6 + with: + ref: ${{ needs.matrix-setup.outputs.sha }} + + - name: Setup mise + uses: jdx/mise-action@v4 + + - name: Install Claude CLI + run: | + curl -fsSL https://claude.ai/install.sh | bash + echo "$HOME/.local/bin" >> "$GITHUB_PATH" + + - name: Run triage + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + GH_TOKEN: ${{ github.token }} + RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} + E2E_AGENT: ${{ matrix.agent }} + TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log + run: scripts/run-e2e-triage.sh + + - name: Upload triage output + if: always() + uses: actions/upload-artifact@v7 + with: + name: e2e-triage-${{ matrix.agent }} + path: e2e-triage-artifacts/ + retention-days: 7 From 16bd3518429cd2a15cd77b084f0eb49d2d608363 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:35:14 -0700 Subject: [PATCH 05/32] ci: harden e2e triage dispatch validation --- .github/workflows/e2e-triage.yml | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 54be3ca4a..1128ae7bc 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -46,6 +46,7 @@ jobs: shell: bash env: EVENT_NAME: ${{ github.event_name }} + REPO_NAME: ${{ github.repository }} RUN_URL_INPUT: ${{ inputs.run_url }} SHA_INPUT: ${{ inputs.sha }} FAILED_AGENTS_INPUT: ${{ inputs.failed_agents }} @@ -62,6 +63,9 @@ jobs: failed_agents_raw="$FAILED_AGENTS_INPUT" agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" else + trigger_text="$(jq -r '.client_payload.trigger_text // empty' "$GITHUB_EVENT_PATH")" + repo="$(jq -r '.client_payload.repo // empty' "$GITHUB_EVENT_PATH")" + branch="$(jq -r '.client_payload.branch // empty' "$GITHUB_EVENT_PATH")" run_url="$(jq -r '.client_payload.run_url // empty' "$GITHUB_EVENT_PATH")" sha="$(jq -r '.client_payload.sha // empty' "$GITHUB_EVENT_PATH")" slack_channel="$(jq -r '.client_payload.slack_channel // empty' "$GITHUB_EVENT_PATH")" @@ -72,6 +76,19 @@ jobs: failed_agents_raw="$(jq -r '.client_payload.failed_agents // empty' "$GITHUB_EVENT_PATH")" agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" fi + + if [ "$trigger_text" != "triage e2e" ]; then + echo "trigger_text must be exactly 'triage e2e'" >&2 + exit 1 + fi + if [ "$repo" != "$REPO_NAME" ]; then + echo "repo must match $REPO_NAME" >&2 + exit 1 + fi + if [ "$branch" != "main" ]; then + echo "branch must be main" >&2 + exit 1 + fi fi if [ -z "$run_url" ]; then @@ -106,6 +123,7 @@ jobs: uses: actions/checkout@v6 with: ref: ${{ needs.matrix-setup.outputs.sha }} + fetch-depth: 0 - name: Setup mise uses: jdx/mise-action@v4 From 9666eadcc62a6b6ae0b71b21f130717fe74f80b4 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:38:05 -0700 Subject: [PATCH 06/32] ci: add slack replies to e2e triage workflow --- .github/workflows/e2e-triage.yml | 146 +++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 1128ae7bc..d5523ecfd 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -114,11 +114,48 @@ jobs: needs: [matrix-setup] runs-on: ubuntu-latest timeout-minutes: 45 + env: + SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }} + SLACK_CHANNEL: ${{ needs.matrix-setup.outputs.slack_channel }} + SLACK_THREAD_TS: ${{ needs.matrix-setup.outputs.slack_thread_ts }} strategy: fail-fast: false matrix: agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }} steps: + - name: Post triage started + if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + env: + RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} + run: | + set -euo pipefail + + post_slack_message() { + local text="$1" + local payload response + + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack start notification failed" >&2 + return 0 + fi + + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack start notification returned non-ok response" >&2 + fi + } + + post_slack_message "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>." + - name: Checkout repository uses: actions/checkout@v6 with: @@ -134,6 +171,7 @@ jobs: echo "$HOME/.local/bin" >> "$GITHUB_PATH" - name: Run triage + id: triage env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GH_TOKEN: ${{ github.token }} @@ -142,6 +180,114 @@ jobs: TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log run: scripts/run-e2e-triage.sh + - name: Summarize triage output + id: summary + if: always() + shell: bash + env: + TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log + run: | + set -euo pipefail + + summary="" + if [ -f "$TRIAGE_OUTPUT_FILE" ]; then + summary="$( + (grep '^## ' "$TRIAGE_OUTPUT_FILE" 2>/dev/null | head -n 3 | sed 's/^## //' | awk ' + NF { + if (out != "") { + out = out " | " + } + out = out $0 + } + END { + print out + } + ') || true + )" + fi + + { + echo 'summary<> "$GITHUB_OUTPUT" + + - name: Post triage success + if: ${{ always() && steps.triage.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + env: + TRIAGE_SUMMARY: ${{ steps.summary.outputs.summary }} + run: | + set -euo pipefail + + post_slack_message() { + local text="$1" + local payload response + + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack success notification failed" >&2 + return 0 + fi + + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack success notification returned non-ok response" >&2 + fi + } + + message="E2E triage complete for \`$E2E_AGENT\`." + if [ -n "$TRIAGE_SUMMARY" ]; then + message="$message $TRIAGE_SUMMARY" + fi + + post_slack_message "$message" + + - name: Post triage failure + if: ${{ always() && steps.triage.outcome == 'failure' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + env: + TRIAGE_SUMMARY: ${{ steps.summary.outputs.summary }} + run: | + set -euo pipefail + + post_slack_message() { + local text="$1" + local payload response + + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack failure notification failed" >&2 + return 0 + fi + + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack failure notification returned non-ok response" >&2 + fi + } + + message="E2E triage failed for \`$E2E_AGENT\`." + if [ -n "$TRIAGE_SUMMARY" ]; then + message="$message $TRIAGE_SUMMARY" + fi + + post_slack_message "$message" + - name: Upload triage output if: always() uses: actions/upload-artifact@v7 From f8a13d920e89dade4019e171c1b3c50c572f7599 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:39:07 -0700 Subject: [PATCH 07/32] ci: fix slack triage start notification env --- .github/workflows/e2e-triage.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index d5523ecfd..024c3591a 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -128,6 +128,7 @@ jobs: shell: bash env: RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} + E2E_AGENT: ${{ matrix.agent }} run: | set -euo pipefail From b9194721d1871819b095ee54be9d005af882fb4d Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:41:06 -0700 Subject: [PATCH 08/32] ci: dedupe e2e triage slack notifications --- .github/workflows/e2e-triage.yml | 130 ++++++++++--------------------- 1 file changed, 40 insertions(+), 90 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 024c3591a..aa3637894 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -115,6 +115,8 @@ jobs: runs-on: ubuntu-latest timeout-minutes: 45 env: + RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} + E2E_AGENT: ${{ matrix.agent }} SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }} SLACK_CHANNEL: ${{ needs.matrix-setup.outputs.slack_channel }} SLACK_THREAD_TS: ${{ needs.matrix-setup.outputs.slack_thread_ts }} @@ -123,39 +125,45 @@ jobs: matrix: agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }} steps: - - name: Post triage started + - name: Write Slack helper if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} shell: bash - env: - RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} - E2E_AGENT: ${{ matrix.agent }} run: | set -euo pipefail - post_slack_message() { - local text="$1" - local payload response + helper="$RUNNER_TEMP/post-slack-message.sh" + cat > "$helper" <<'EOF' + #!/usr/bin/env bash + set -euo pipefail - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg thread_ts "$SLACK_THREAD_TS" \ - --arg text "$text" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + text="${1:?message is required}" + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack notification failed" >&2 + exit 0 + fi - if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")"; then - echo "warning: slack start notification failed" >&2 - return 0 - fi + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack notification returned non-ok response" >&2 + fi + EOF + chmod +x "$helper" - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "warning: slack start notification returned non-ok response" >&2 - fi - } + - name: Post triage started + if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + run: | + set -euo pipefail - post_slack_message "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>." + "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>." - name: Checkout repository uses: actions/checkout@v6 @@ -176,8 +184,6 @@ jobs: env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GH_TOKEN: ${{ github.token }} - RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} - E2E_AGENT: ${{ matrix.agent }} TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log run: scripts/run-e2e-triage.sh @@ -213,81 +219,25 @@ jobs: echo EOF } >> "$GITHUB_OUTPUT" - - name: Post triage success - if: ${{ always() && steps.triage.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + - name: Post triage completion + if: ${{ always() && (steps.triage.outcome == 'success' || steps.triage.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} shell: bash env: + TRIAGE_OUTCOME: ${{ steps.triage.outcome }} TRIAGE_SUMMARY: ${{ steps.summary.outputs.summary }} run: | set -euo pipefail - post_slack_message() { - local text="$1" - local payload response - - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg thread_ts "$SLACK_THREAD_TS" \ - --arg text "$text" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" - - if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")"; then - echo "warning: slack success notification failed" >&2 - return 0 - fi - - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "warning: slack success notification returned non-ok response" >&2 - fi - } - - message="E2E triage complete for \`$E2E_AGENT\`." - if [ -n "$TRIAGE_SUMMARY" ]; then - message="$message $TRIAGE_SUMMARY" + if [ "$TRIAGE_OUTCOME" = "success" ]; then + message="E2E triage complete for \`$E2E_AGENT\`." + else + message="E2E triage failed for \`$E2E_AGENT\`." fi - - post_slack_message "$message" - - - name: Post triage failure - if: ${{ always() && steps.triage.outcome == 'failure' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} - shell: bash - env: - TRIAGE_SUMMARY: ${{ steps.summary.outputs.summary }} - run: | - set -euo pipefail - - post_slack_message() { - local text="$1" - local payload response - - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg thread_ts "$SLACK_THREAD_TS" \ - --arg text "$text" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" - - if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")"; then - echo "warning: slack failure notification failed" >&2 - return 0 - fi - - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "warning: slack failure notification returned non-ok response" >&2 - fi - } - - message="E2E triage failed for \`$E2E_AGENT\`." if [ -n "$TRIAGE_SUMMARY" ]; then message="$message $TRIAGE_SUMMARY" fi - post_slack_message "$message" + "$RUNNER_TEMP/post-slack-message.sh" "$message" - name: Upload triage output if: always() From 04541ff12be782e9b60d1f72df1734c5da1302c1 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:44:51 -0700 Subject: [PATCH 09/32] feat: add slack triage parsing helpers --- internal/slacktriage/dispatch.go | 34 ++++++++ internal/slacktriage/dispatch_test.go | 42 ++++++++++ internal/slacktriage/normalize.go | 15 ++++ internal/slacktriage/normalize_test.go | 57 +++++++++++++ internal/slacktriage/parent_message.go | 91 +++++++++++++++++++++ internal/slacktriage/parent_message_test.go | 51 ++++++++++++ 6 files changed, 290 insertions(+) create mode 100644 internal/slacktriage/dispatch.go create mode 100644 internal/slacktriage/dispatch_test.go create mode 100644 internal/slacktriage/normalize.go create mode 100644 internal/slacktriage/normalize_test.go create mode 100644 internal/slacktriage/parent_message.go create mode 100644 internal/slacktriage/parent_message_test.go diff --git a/internal/slacktriage/dispatch.go b/internal/slacktriage/dispatch.go new file mode 100644 index 000000000..a70ee0e30 --- /dev/null +++ b/internal/slacktriage/dispatch.go @@ -0,0 +1,34 @@ +package slacktriage + +// DispatchPayload is the structured payload sent to GitHub repository_dispatch. +type DispatchPayload struct { + TriggerText string `json:"trigger_text"` + Repo string `json:"repo"` + Branch string `json:"branch"` + SHA string `json:"sha"` + RunURL string `json:"run_url"` + RunID string `json:"run_id"` + FailedAgents []string `json:"failed_agents"` + SlackChannel string `json:"slack_channel"` + SlackThreadTS string `json:"slack_thread_ts"` + SlackUser string `json:"slack_user"` +} + +// NewDispatchPayload creates a pure data payload for the repository_dispatch bridge. +func NewDispatchPayload(meta ParentMessageMetadata, slackChannel, slackThreadTS, slackUser string) DispatchPayload { + failedAgents := make([]string, len(meta.FailedAgents)) + copy(failedAgents, meta.FailedAgents) + + return DispatchPayload{ + TriggerText: TriageTriggerText, + Repo: meta.Repo, + Branch: meta.Branch, + SHA: meta.SHA, + RunURL: meta.RunURL, + RunID: meta.RunID, + FailedAgents: failedAgents, + SlackChannel: slackChannel, + SlackThreadTS: slackThreadTS, + SlackUser: slackUser, + } +} diff --git a/internal/slacktriage/dispatch_test.go b/internal/slacktriage/dispatch_test.go new file mode 100644 index 000000000..06cced40b --- /dev/null +++ b/internal/slacktriage/dispatch_test.go @@ -0,0 +1,42 @@ +package slacktriage + +import "testing" + +func TestNewDispatchPayload(t *testing.T) { + t.Parallel() + + meta := ParentMessageMetadata{ + Repo: "entireio/cli", + Branch: "main", + RunID: "123", + RunURL: "https://github.com/entireio/cli/actions/runs/123", + SHA: "abc123", + FailedAgents: []string{"cursor-cli", "copilot-cli"}, + } + + got := NewDispatchPayload(meta, "C123", "1742230000.123456", "U456") + + if got.TriggerText != TriageTriggerText { + t.Fatalf("TriggerText = %q, want %q", got.TriggerText, TriageTriggerText) + } + if got.Repo != meta.Repo || got.Branch != meta.Branch || got.RunID != meta.RunID || got.RunURL != meta.RunURL || got.SHA != meta.SHA { + t.Fatalf("payload metadata mismatch: got %+v want %+v", got, meta) + } + if got.SlackChannel != "C123" { + t.Fatalf("SlackChannel = %q, want %q", got.SlackChannel, "C123") + } + if got.SlackThreadTS != "1742230000.123456" { + t.Fatalf("SlackThreadTS = %q, want %q", got.SlackThreadTS, "1742230000.123456") + } + if got.SlackUser != "U456" { + t.Fatalf("SlackUser = %q, want %q", got.SlackUser, "U456") + } + if len(got.FailedAgents) != len(meta.FailedAgents) { + t.Fatalf("FailedAgents len = %d, want %d", len(got.FailedAgents), len(meta.FailedAgents)) + } + for i := range meta.FailedAgents { + if got.FailedAgents[i] != meta.FailedAgents[i] { + t.Fatalf("FailedAgents[%d] = %q, want %q", i, got.FailedAgents[i], meta.FailedAgents[i]) + } + } +} diff --git a/internal/slacktriage/normalize.go b/internal/slacktriage/normalize.go new file mode 100644 index 000000000..2469fcda5 --- /dev/null +++ b/internal/slacktriage/normalize.go @@ -0,0 +1,15 @@ +package slacktriage + +import "strings" + +const TriageTriggerText = "triage e2e" + +// NormalizeTrigger lowercases, trims, and collapses internal whitespace. +func NormalizeTrigger(text string) string { + return strings.Join(strings.Fields(strings.ToLower(text)), " ") +} + +// IsTriageTrigger reports whether text normalizes to the triage trigger phrase. +func IsTriageTrigger(text string) bool { + return NormalizeTrigger(text) == TriageTriggerText +} diff --git a/internal/slacktriage/normalize_test.go b/internal/slacktriage/normalize_test.go new file mode 100644 index 000000000..a9e3d205e --- /dev/null +++ b/internal/slacktriage/normalize_test.go @@ -0,0 +1,57 @@ +package slacktriage + +import "testing" + +func TestNormalizeTrigger(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + in string + want string + }{ + { + name: "preserves_exact_trigger", + in: "triage e2e", + want: "triage e2e", + }, + { + name: "lowercases_and_trims", + in: " Triage E2E ", + want: "triage e2e", + }, + { + name: "collapses_internal_whitespace", + in: "triage e2e", + want: "triage e2e", + }, + } + + for _, tt := range tests { + tt := tt + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + if got := NormalizeTrigger(tt.in); got != tt.want { + t.Fatalf("NormalizeTrigger(%q) = %q, want %q", tt.in, got, tt.want) + } + }) + } +} + +func TestIsTriageTrigger(t *testing.T) { + t.Parallel() + + if !IsTriageTrigger(" Triage E2E ") { + t.Fatal("expected normalized trigger to match") + } + + for _, in := range []string{"triage", "triage e2e now", "triage-e2e"} { + in := in + t.Run(in, func(t *testing.T) { + t.Parallel() + if IsTriageTrigger(in) { + t.Fatalf("IsTriageTrigger(%q) = true, want false", in) + } + }) + } +} diff --git a/internal/slacktriage/parent_message.go b/internal/slacktriage/parent_message.go new file mode 100644 index 000000000..23d3b2483 --- /dev/null +++ b/internal/slacktriage/parent_message.go @@ -0,0 +1,91 @@ +package slacktriage + +import ( + "errors" + "fmt" + "strings" +) + +// ParentMessageMetadata captures the parsed machine-readable Slack alert metadata. +type ParentMessageMetadata struct { + Repo string + Branch string + RunID string + RunURL string + SHA string + FailedAgents []string +} + +// ParseParentMessageMetadata extracts the stable meta line from a Slack failure alert. +func ParseParentMessageMetadata(body string) (ParentMessageMetadata, error) { + const metaPrefix = "meta:" + + for _, line := range strings.Split(body, "\n") { + trimmed := strings.TrimSpace(line) + if !strings.HasPrefix(trimmed, metaPrefix) { + continue + } + + fields := strings.Fields(strings.TrimSpace(strings.TrimPrefix(trimmed, metaPrefix))) + values := make(map[string]string, len(fields)) + for _, field := range fields { + key, value, ok := strings.Cut(field, "=") + if !ok || key == "" || value == "" { + return ParentMessageMetadata{}, fmt.Errorf("invalid meta field %q", field) + } + if _, exists := values[key]; exists { + return ParentMessageMetadata{}, fmt.Errorf("duplicate meta field %q", key) + } + values[key] = value + } + + metadata := ParentMessageMetadata{ + Repo: values["repo"], + Branch: values["branch"], + RunID: values["run_id"], + RunURL: values["run_url"], + SHA: values["sha"], + } + if agents, ok := values["agents"]; ok && agents != "" { + metadata.FailedAgents = splitAndTrimCSV(agents) + } + + if err := metadata.validate(); err != nil { + return ParentMessageMetadata{}, err + } + return metadata, nil + } + + return ParentMessageMetadata{}, errors.New("meta line not found") +} + +func (m ParentMessageMetadata) validate() error { + switch { + case m.Repo == "": + return errors.New("repo is required") + case m.Branch == "": + return errors.New("branch is required") + case m.RunID == "": + return errors.New("run_id is required") + case m.RunURL == "": + return errors.New("run_url is required") + case m.SHA == "": + return errors.New("sha is required") + case len(m.FailedAgents) == 0: + return errors.New("failed_agents is required") + default: + return nil + } +} + +func splitAndTrimCSV(value string) []string { + parts := strings.Split(value, ",") + out := make([]string, 0, len(parts)) + for _, part := range parts { + trimmed := strings.TrimSpace(part) + if trimmed != "" { + out = append(out, trimmed) + } + } + return out +} diff --git a/internal/slacktriage/parent_message_test.go b/internal/slacktriage/parent_message_test.go new file mode 100644 index 000000000..1c42766fb --- /dev/null +++ b/internal/slacktriage/parent_message_test.go @@ -0,0 +1,51 @@ +package slacktriage + +import ( + "testing" +) + +func TestParseParentMessageMetadata(t *testing.T) { + t.Parallel() + + body := "E2E Tests Failed on `main`\n\nFailed agents: *cursor-cli*\n\nmeta: repo=entireio/cli branch=main run_id=123 run_url=https://github.com/entireio/cli/actions/runs/123 sha=abc123 agents=cursor-cli,copilot-cli\nCommit: by alisha" + + got, err := ParseParentMessageMetadata(body) + if err != nil { + t.Fatalf("ParseParentMessageMetadata() error = %v", err) + } + + wantAgents := []string{"cursor-cli", "copilot-cli"} + if got.Repo != "entireio/cli" { + t.Fatalf("Repo = %q, want %q", got.Repo, "entireio/cli") + } + if got.Branch != "main" { + t.Fatalf("Branch = %q, want %q", got.Branch, "main") + } + if got.RunID != "123" { + t.Fatalf("RunID = %q, want %q", got.RunID, "123") + } + if got.RunURL != "https://github.com/entireio/cli/actions/runs/123" { + t.Fatalf("RunURL = %q, want %q", got.RunURL, "https://github.com/entireio/cli/actions/runs/123") + } + if got.SHA != "abc123" { + t.Fatalf("SHA = %q, want %q", got.SHA, "abc123") + } + if len(got.FailedAgents) != len(wantAgents) { + t.Fatalf("FailedAgents len = %d, want %d", len(got.FailedAgents), len(wantAgents)) + } + for i := range wantAgents { + if got.FailedAgents[i] != wantAgents[i] { + t.Fatalf("FailedAgents[%d] = %q, want %q", i, got.FailedAgents[i], wantAgents[i]) + } + } +} + +func TestParseParentMessageMetadata_IgnoresHumanReadableBody(t *testing.T) { + t.Parallel() + + body := "E2E Tests Failed on `main`\n\nFailed agents: *cursor-cli*\n\nCommit: by alisha" + + if _, err := ParseParentMessageMetadata(body); err == nil { + t.Fatal("ParseParentMessageMetadata() error = nil, want error") + } +} From 5f0adc96bfff8fd897a547448ccb8b09a113007b Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:55:49 -0700 Subject: [PATCH 10/32] feat: add slack app for e2e triage dispatch --- cmd/e2e-triage-dispatch/main.go | 419 +++++++++++++++++++++++++++ cmd/e2e-triage-dispatch/main_test.go | 350 ++++++++++++++++++++++ 2 files changed, 769 insertions(+) create mode 100644 cmd/e2e-triage-dispatch/main.go create mode 100644 cmd/e2e-triage-dispatch/main_test.go diff --git a/cmd/e2e-triage-dispatch/main.go b/cmd/e2e-triage-dispatch/main.go new file mode 100644 index 000000000..a1d210b8c --- /dev/null +++ b/cmd/e2e-triage-dispatch/main.go @@ -0,0 +1,419 @@ +package main + +import ( + "bytes" + "context" + "crypto/hmac" + "crypto/sha256" + "encoding/hex" + "encoding/json" + "errors" + "fmt" + "io" + "log" + "net/http" + "net/url" + "os" + "strconv" + "strings" + "time" + + "github.com/entireio/cli/internal/slacktriage" +) + +const ( + defaultAddr = ":8080" + defaultGitHubAPIBaseURL = "https://api.github.com" + defaultSlackAPIBaseURL = "https://slack.com/api" + defaultSlackEventType = "slack_e2e_triage_requested" + defaultRequestTolerance = 5 * time.Minute + slackTimestampHeader = "X-Slack-Request-Timestamp" + slackSignatureHeader = "X-Slack-Signature" + slackEventTypeURLVerify = "url_verification" + slackEventTypeCallback = "event_callback" + slackInnerEventTypeMessage = "message" +) + +// Config holds runtime settings loaded from the environment. +type Config struct { + Addr string + SigningSecret string + SlackBotToken string + GitHubToken string + AllowedRepo string + GitHubEventType string + SlackAPIBaseURL string + GitHubAPIBaseURL string + RequestTolerance time.Duration +} + +func main() { + cfg, err := loadConfigFromEnv() + if err != nil { + log.Fatal(err) + } + + handler := newHandler( + cfg, + newSlackHTTPClient(cfg.SlackBotToken, cfg.SlackAPIBaseURL), + newGitHubHTTPDispatcher(cfg.GitHubToken, cfg.GitHubAPIBaseURL, cfg.GitHubEventType, cfg.AllowedRepo), + time.Now, + ) + + mux := http.NewServeMux() + mux.Handle("/slack/events", handler) + + log.Fatal(http.ListenAndServe(cfg.Addr, mux)) +} + +func loadConfigFromEnv() (Config, error) { + cfg := Config{ + Addr: getEnvDefault("ADDR", defaultAddr), + SigningSecret: os.Getenv("SLACK_SIGNING_SECRET"), + SlackBotToken: os.Getenv("SLACK_BOT_TOKEN"), + GitHubToken: os.Getenv("GITHUB_TOKEN"), + AllowedRepo: getEnvFirst("ALLOWED_REPOSITORY", "GITHUB_REPOSITORY"), + GitHubEventType: getEnvDefault("GITHUB_EVENT_TYPE", defaultSlackEventType), + SlackAPIBaseURL: getEnvDefault("SLACK_API_BASE_URL", defaultSlackAPIBaseURL), + GitHubAPIBaseURL: getEnvDefault("GITHUB_API_BASE_URL", defaultGitHubAPIBaseURL), + RequestTolerance: defaultRequestTolerance, + } + + if tolerance := os.Getenv("SLACK_REQUEST_TOLERANCE"); tolerance != "" { + d, err := time.ParseDuration(tolerance) + if err != nil { + return Config{}, fmt.Errorf("parse SLACK_REQUEST_TOLERANCE: %w", err) + } + cfg.RequestTolerance = d + } + + switch { + case cfg.SigningSecret == "": + return Config{}, errors.New("SLACK_SIGNING_SECRET is required") + case cfg.SlackBotToken == "": + return Config{}, errors.New("SLACK_BOT_TOKEN is required") + case cfg.GitHubToken == "": + return Config{}, errors.New("GITHUB_TOKEN is required") + case cfg.AllowedRepo == "": + return Config{}, errors.New("ALLOWED_REPOSITORY or GITHUB_REPOSITORY is required") + default: + return cfg, nil + } +} + +func getEnvDefault(key, fallback string) string { + if value := os.Getenv(key); value != "" { + return value + } + return fallback +} + +func getEnvFirst(keys ...string) string { + for _, key := range keys { + if value := os.Getenv(key); value != "" { + return value + } + } + return "" +} + +type SlackMessageFetcher interface { + FetchParentMessage(ctx context.Context, channel, threadTS string) (string, error) +} + +type GitHubDispatcher interface { + DispatchRepositoryEvent(ctx context.Context, payload slacktriage.DispatchPayload) error +} + +type triageHandler struct { + cfg Config + slack SlackMessageFetcher + github GitHubDispatcher + now func() time.Time + maxBodyLen int64 +} + +func newHandler(cfg Config, slack SlackMessageFetcher, github GitHubDispatcher, now func() time.Time) *triageHandler { + if now == nil { + now = time.Now + } + if cfg.RequestTolerance <= 0 { + cfg.RequestTolerance = defaultRequestTolerance + } + + return &triageHandler{ + cfg: cfg, + slack: slack, + github: github, + now: now, + maxBodyLen: 1 << 20, + } +} + +func (h *triageHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodPost { + http.Error(w, http.StatusText(http.StatusMethodNotAllowed), http.StatusMethodNotAllowed) + return + } + defer r.Body.Close() + + body, err := io.ReadAll(io.LimitReader(r.Body, h.maxBodyLen)) + if err != nil { + http.Error(w, "read request body", http.StatusBadRequest) + return + } + + if err := h.verifyRequest(r, body); err != nil { + http.Error(w, http.StatusText(http.StatusUnauthorized), http.StatusUnauthorized) + return + } + + var envelope slackEnvelope + if err := json.Unmarshal(body, &envelope); err != nil { + http.Error(w, "decode slack payload", http.StatusBadRequest) + return + } + + switch envelope.Type { + case slackEventTypeURLVerify: + writeJSON(w, http.StatusOK, map[string]string{"challenge": envelope.Challenge}) + case slackEventTypeCallback: + if err := h.handleEvent(r.Context(), envelope.Event); err != nil { + http.Error(w, "process slack event", http.StatusInternalServerError) + return + } + w.WriteHeader(http.StatusOK) + default: + w.WriteHeader(http.StatusOK) + } +} + +func (h *triageHandler) verifyRequest(r *http.Request, body []byte) error { + timestamp := r.Header.Get(slackTimestampHeader) + signature := r.Header.Get(slackSignatureHeader) + if timestamp == "" || signature == "" { + return errors.New("missing slack signature headers") + } + + parsed, err := strconv.ParseInt(timestamp, 10, 64) + if err != nil { + return fmt.Errorf("invalid slack timestamp: %w", err) + } + + requestTime := time.Unix(parsed, 0) + now := h.now() + if absDuration(now.Sub(requestTime)) > h.cfg.RequestTolerance { + return errors.New("stale slack request") + } + + mac := hmac.New(sha256.New, []byte(h.cfg.SigningSecret)) + if _, err := mac.Write([]byte("v0:" + timestamp + ":" + string(body))); err != nil { + return fmt.Errorf("sign slack request: %w", err) + } + expected := "v0=" + hex.EncodeToString(mac.Sum(nil)) + if !hmac.Equal([]byte(expected), []byte(signature)) { + return errors.New("invalid slack signature") + } + + return nil +} + +func (h *triageHandler) handleEvent(ctx context.Context, event slackEvent) error { + if event.Type != slackInnerEventTypeMessage { + return nil + } + if event.Subtype != "" || event.BotID != "" { + return nil + } + if event.ThreadTS == "" || event.ThreadTS == event.Ts { + return nil + } + if !slacktriage.IsTriageTrigger(event.Text) { + return nil + } + if h.slack == nil { + return errors.New("slack fetcher is not configured") + } + if h.github == nil { + return errors.New("github dispatcher is not configured") + } + if event.Channel == "" { + return errors.New("channel is required") + } + + parentBody, err := h.slack.FetchParentMessage(ctx, event.Channel, event.ThreadTS) + if err != nil { + return err + } + + metadata, err := slacktriage.ParseParentMessageMetadata(parentBody) + if err != nil { + return err + } + if h.cfg.AllowedRepo != "" && metadata.Repo != h.cfg.AllowedRepo { + return nil + } + + payload := slacktriage.NewDispatchPayload(metadata, event.Channel, event.ThreadTS, event.User) + return h.github.DispatchRepositoryEvent(ctx, payload) +} + +type slackEnvelope struct { + Type string `json:"type"` + Challenge string `json:"challenge,omitempty"` + Event slackEvent `json:"event,omitempty"` +} + +type slackEvent struct { + Type string `json:"type,omitempty"` + Subtype string `json:"subtype,omitempty"` + BotID string `json:"bot_id,omitempty"` + Channel string `json:"channel,omitempty"` + User string `json:"user,omitempty"` + Text string `json:"text,omitempty"` + Ts string `json:"ts,omitempty"` + ThreadTS string `json:"thread_ts,omitempty"` +} + +func writeJSON(w http.ResponseWriter, status int, value any) { + w.Header().Set("Content-Type", "application/json; charset=utf-8") + w.WriteHeader(status) + _ = json.NewEncoder(w).Encode(value) +} + +func absDuration(d time.Duration) time.Duration { + if d < 0 { + return -d + } + return d +} + +type slackHTTPClient struct { + token string + baseURL string + client *http.Client +} + +func newSlackHTTPClient(token, baseURL string) *slackHTTPClient { + if baseURL == "" { + baseURL = defaultSlackAPIBaseURL + } + return &slackHTTPClient{ + token: token, + baseURL: strings.TrimRight(baseURL, "/"), + client: &http.Client{Timeout: 10 * time.Second}, + } +} + +func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threadTS string) (string, error) { + endpoint, err := url.Parse(c.baseURL + "/conversations.replies") + if err != nil { + return "", err + } + + query := endpoint.Query() + query.Set("channel", channel) + query.Set("ts", threadTS) + query.Set("inclusive", "true") + query.Set("limit", "1") + endpoint.RawQuery = query.Encode() + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint.String(), nil) + if err != nil { + return "", err + } + req.Header.Set("Authorization", "Bearer "+c.token) + + resp, err := c.client.Do(req) + if err != nil { + return "", err + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("slack conversations.replies returned %s", resp.Status) + } + + var payload struct { + OK bool `json:"ok"` + Error string `json:"error,omitempty"` + Messages []struct { + Text string `json:"text"` + } `json:"messages"` + } + if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil { + return "", err + } + if !payload.OK { + if payload.Error == "" { + payload.Error = "unknown error" + } + return "", fmt.Errorf("slack conversations.replies error: %s", payload.Error) + } + if len(payload.Messages) == 0 { + return "", errors.New("slack conversations.replies returned no messages") + } + return payload.Messages[0].Text, nil +} + +type githubHTTPDispatcher struct { + token string + baseURL string + eventType string + repository string + client *http.Client +} + +func newGitHubHTTPDispatcher(token, baseURL, eventType, repository string) *githubHTTPDispatcher { + if baseURL == "" { + baseURL = defaultGitHubAPIBaseURL + } + if eventType == "" { + eventType = defaultSlackEventType + } + return &githubHTTPDispatcher{ + token: token, + baseURL: strings.TrimRight(baseURL, "/"), + eventType: eventType, + repository: repository, + client: &http.Client{Timeout: 10 * time.Second}, + } +} + +func (d *githubHTTPDispatcher) DispatchRepositoryEvent(ctx context.Context, payload slacktriage.DispatchPayload) error { + body, err := json.Marshal(struct { + EventType string `json:"event_type"` + ClientPayload slacktriage.DispatchPayload `json:"client_payload"` + }{ + EventType: d.eventType, + ClientPayload: payload, + }) + if err != nil { + return err + } + + endpoint := fmt.Sprintf("%s/repos/%s/dispatches", d.baseURL, d.repository) + req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewReader(body)) + if err != nil { + return err + } + req.Header.Set("Authorization", "Bearer "+d.token) + req.Header.Set("Accept", "application/vnd.github+json") + req.Header.Set("Content-Type", "application/json") + + resp, err := d.client.Do(req) + if err != nil { + return err + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusNoContent { + responseBody, _ := io.ReadAll(io.LimitReader(resp.Body, 4096)) + if len(responseBody) > 0 { + return fmt.Errorf("github dispatch failed: %s: %s", resp.Status, strings.TrimSpace(string(responseBody))) + } + return fmt.Errorf("github dispatch failed: %s", resp.Status) + } + + return nil +} diff --git a/cmd/e2e-triage-dispatch/main_test.go b/cmd/e2e-triage-dispatch/main_test.go new file mode 100644 index 000000000..3c06bae02 --- /dev/null +++ b/cmd/e2e-triage-dispatch/main_test.go @@ -0,0 +1,350 @@ +package main + +import ( + "context" + "crypto/hmac" + "crypto/sha256" + "encoding/hex" + "encoding/json" + "fmt" + "io" + "net/http" + "net/http/httptest" + "os" + "strings" + "testing" + "time" + + "github.com/entireio/cli/internal/slacktriage" +) + +const testSigningSecret = "test-signing-secret" + +func TestLoadConfigFromEnv_LoadsAllowedRepo(t *testing.T) { + os.Clearenv() + t.Setenv("SLACK_SIGNING_SECRET", "secret") + t.Setenv("SLACK_BOT_TOKEN", "bot") + t.Setenv("GITHUB_TOKEN", "gh") + t.Setenv("ALLOWED_REPOSITORY", "entireio/cli") + + cfg, err := loadConfigFromEnv() + if err != nil { + t.Fatalf("loadConfigFromEnv() error = %v", err) + } + if cfg.AllowedRepo != "entireio/cli" { + t.Fatalf("AllowedRepo = %q, want %q", cfg.AllowedRepo, "entireio/cli") + } +} + +func TestLoadConfigFromEnv_RequiresAllowedRepo(t *testing.T) { + os.Clearenv() + t.Setenv("SLACK_SIGNING_SECRET", "secret") + t.Setenv("SLACK_BOT_TOKEN", "bot") + t.Setenv("GITHUB_TOKEN", "gh") + + if _, err := loadConfigFromEnv(); err == nil { + t.Fatal("loadConfigFromEnv() error = nil, want error") + } +} + +func TestHandler_URLVerification(t *testing.T) { + t.Parallel() + + handler := newTestHandler(t) + body := `{"type":"url_verification","challenge":"abc123"}` + req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + + var got struct { + Challenge string `json:"challenge"` + } + if err := json.Unmarshal(rr.Body.Bytes(), &got); err != nil { + t.Fatalf("unmarshal response: %v", err) + } + if got.Challenge != "abc123" { + t.Fatalf("challenge = %q, want %q", got.Challenge, "abc123") + } +} + +func TestHandler_RejectsBadSignature(t *testing.T) { + t.Parallel() + + handler := newTestHandler(t) + req := httptest.NewRequest(http.MethodPost, "/slack/events", strings.NewReader(`{"type":"event_callback"}`)) + req.Header.Set("X-Slack-Request-Timestamp", fmt.Sprintf("%d", fixedNow().Unix())) + req.Header.Set("X-Slack-Signature", "v0=deadbeef") + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusUnauthorized { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusUnauthorized) + } +} + +func TestHandler_IgnoresNonThreadReplies(t *testing.T) { + t.Parallel() + + fetcher := &fakeSlackFetcher{} + dispatcher := &fakeGitHubDispatcher{} + handler := newHandlerForTest(t, fetcher, dispatcher) + + body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.222"}}` + req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + if fetcher.calls != 0 { + t.Fatalf("fetch calls = %d, want 0", fetcher.calls) + } + if dispatcher.calls != 0 { + t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) + } +} + +func TestHandler_IgnoresNonTriggerReplies(t *testing.T) { + t.Parallel() + + fetcher := &fakeSlackFetcher{} + dispatcher := &fakeGitHubDispatcher{} + handler := newHandlerForTest(t, fetcher, dispatcher) + + body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"hello world","ts":"111.222","thread_ts":"111.111"}}` + req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + if fetcher.calls != 0 { + t.Fatalf("fetch calls = %d, want 0", fetcher.calls) + } + if dispatcher.calls != 0 { + t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) + } +} + +func TestHandler_DispatchesValidTriggerReply(t *testing.T) { + t.Parallel() + + fetcher := &fakeSlackFetcher{ + body: "E2E Tests Failed\nmeta: repo=entireio/cli branch=main run_id=123 run_url=https://github.com/entireio/cli/actions/runs/123 sha=abc123 agents=cursor-cli,copilot-cli", + } + dispatcher := &fakeGitHubDispatcher{} + handler := newHandlerForTest(t, fetcher, dispatcher) + + body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` + req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + if fetcher.calls != 1 { + t.Fatalf("fetch calls = %d, want 1", fetcher.calls) + } + if dispatcher.calls != 1 { + t.Fatalf("dispatch calls = %d, want 1", dispatcher.calls) + } + + if fetcher.channel != "C123" || fetcher.threadTS != "111.111" { + t.Fatalf("fetch args = (%q, %q), want (%q, %q)", fetcher.channel, fetcher.threadTS, "C123", "111.111") + } + + got := dispatcher.payloads[0] + if got.TriggerText != slacktriage.TriageTriggerText { + t.Fatalf("trigger = %q, want %q", got.TriggerText, slacktriage.TriageTriggerText) + } + if got.Repo != "entireio/cli" || got.Branch != "main" || got.RunID != "123" || got.RunURL != "https://github.com/entireio/cli/actions/runs/123" || got.SHA != "abc123" { + t.Fatalf("unexpected payload metadata: %+v", got) + } + if got.SlackChannel != "C123" || got.SlackThreadTS != "111.111" || got.SlackUser != "U123" { + t.Fatalf("unexpected slack metadata: %+v", got) + } +} + +func TestHandler_IgnoresBotAndSystemMessages(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + body string + }{ + { + name: "subtype", + body: `{"type":"event_callback","event":{"type":"message","subtype":"bot_message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}`, + }, + { + name: "bot_id", + body: `{"type":"event_callback","event":{"type":"message","bot_id":"B123","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}`, + }, + } + + for _, tt := range tests { + tt := tt + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + fetcher := &fakeSlackFetcher{} + dispatcher := &fakeGitHubDispatcher{} + handler := newHandlerForTest(t, fetcher, dispatcher) + + req := signedRequest(t, http.MethodPost, "/slack/events", tt.body, testSigningSecret, fixedNow()) + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + if fetcher.calls != 0 { + t.Fatalf("fetch calls = %d, want 0", fetcher.calls) + } + if dispatcher.calls != 0 { + t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) + } + }) + } +} + +func TestGitHubDispatcher_UsesConfiguredRepository(t *testing.T) { + t.Parallel() + + var gotPath string + dispatcher := newGitHubHTTPDispatcher("token", "https://api.github.com", defaultSlackEventType, "entireio/cli") + dispatcher.client = &http.Client{ + Transport: roundTripFunc(func(r *http.Request) (*http.Response, error) { + gotPath = r.URL.Path + if r.Method != http.MethodPost { + t.Fatalf("method = %s, want POST", r.Method) + } + return &http.Response{ + StatusCode: http.StatusNoContent, + Body: io.NopCloser(strings.NewReader("")), + Header: make(http.Header), + }, nil + }), + } + payload := slacktriage.DispatchPayload{ + Repo: "other/repo", + TriggerText: slacktriage.TriageTriggerText, + Branch: "main", + SHA: "abc123", + RunURL: "https://github.com/entireio/cli/actions/runs/123", + RunID: "123", + FailedAgents: []string{"cursor-cli"}, + } + + if err := dispatcher.DispatchRepositoryEvent(context.Background(), payload); err != nil { + t.Fatalf("DispatchRepositoryEvent() error = %v", err) + } + if gotPath != "/repos/entireio/cli/dispatches" { + t.Fatalf("request path = %q, want %q", gotPath, "/repos/entireio/cli/dispatches") + } +} + +func TestHandler_RejectsMismatchedParentRepo(t *testing.T) { + t.Parallel() + + fetcher := &fakeSlackFetcher{ + body: "E2E Tests Failed\nmeta: repo=other/repo branch=main run_id=123 run_url=https://github.com/other/repo/actions/runs/123 sha=abc123 agents=cursor-cli", + } + dispatcher := &fakeGitHubDispatcher{} + handler := newHandlerForTest(t, fetcher, dispatcher) + + body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` + req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + + rr := httptest.NewRecorder() + handler.ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) + } + if dispatcher.calls != 0 { + t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) + } +} + +func fixedNow() time.Time { + return time.Unix(1_700_000_000, 0).UTC() +} + +func signedRequest(t *testing.T, method, target, body, secret string, now time.Time) *http.Request { + t.Helper() + + req := httptest.NewRequest(method, target, strings.NewReader(body)) + req.Header.Set("X-Slack-Request-Timestamp", fmt.Sprintf("%d", now.Unix())) + req.Header.Set("X-Slack-Signature", slackSignature(secret, fmt.Sprintf("%d", now.Unix()), body)) + return req +} + +func slackSignature(secret, timestamp, body string) string { + mac := hmac.New(sha256.New, []byte(secret)) + _, _ = mac.Write([]byte("v0:" + timestamp + ":" + body)) + return "v0=" + hex.EncodeToString(mac.Sum(nil)) +} + +func newTestHandler(t *testing.T) http.Handler { + t.Helper() + return newHandler(Config{ + SigningSecret: testSigningSecret, + AllowedRepo: "entireio/cli", + }, &fakeSlackFetcher{}, &fakeGitHubDispatcher{}, func() time.Time { + return fixedNow() + }) +} + +func newHandlerForTest(t *testing.T, slack *fakeSlackFetcher, github *fakeGitHubDispatcher) http.Handler { + t.Helper() + return newHandler(Config{ + SigningSecret: testSigningSecret, + AllowedRepo: "entireio/cli", + }, slack, github, func() time.Time { + return fixedNow() + }) +} + +type fakeSlackFetcher struct { + calls int + channel string + threadTS string + body string +} + +func (f *fakeSlackFetcher) FetchParentMessage(_ context.Context, channel, threadTS string) (string, error) { + f.calls++ + f.channel = channel + f.threadTS = threadTS + return f.body, nil +} + +type fakeGitHubDispatcher struct { + calls int + payloads []slacktriage.DispatchPayload +} + +func (f *fakeGitHubDispatcher) DispatchRepositoryEvent(_ context.Context, payload slacktriage.DispatchPayload) error { + f.calls++ + f.payloads = append(f.payloads, payload) + return nil +} + +type roundTripFunc func(*http.Request) (*http.Response, error) + +func (f roundTripFunc) RoundTrip(r *http.Request) (*http.Response, error) { + return f(r) +} From 2d278f62d4583546141a324bedf7ff9ddab03e86 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 15:57:18 -0700 Subject: [PATCH 11/32] docs: add slack-triggered e2e triage runbook --- README.md | 6 +++ docs/architecture/slack-e2e-triage.md | 58 +++++++++++++++++++++++++++ 2 files changed, 64 insertions(+) create mode 100644 docs/architecture/slack-e2e-triage.md diff --git a/README.md b/README.md index 7d104a051..4fb6dfa22 100644 --- a/README.md +++ b/README.md @@ -104,6 +104,12 @@ entire disable Removes the git hooks. Your code and commit history remain untouched. +## E2E Triage + +E2E failure alerts can be triaged from Slack by replying `triage e2e` in the failure thread. The workflow is documented in [docs/architecture/slack-e2e-triage.md](/Users/alisha/Projects/wt/e2e-triage-ci-job/docs/architecture/slack-e2e-triage.md). + +The Slack bridge is handled by `cmd/e2e-triage-dispatch`, and the triage job itself runs in [`.github/workflows/e2e-triage.yml`](/Users/alisha/Projects/wt/e2e-triage-ci-job/.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL, commit SHA, and failed agents. + ## Key Concepts ### Sessions diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md new file mode 100644 index 000000000..ca7c36b1e --- /dev/null +++ b/docs/architecture/slack-e2e-triage.md @@ -0,0 +1,58 @@ +# Slack-Triggered E2E Triage + +This flow lets a human reply `triage e2e` in the thread of an E2E failure alert and have GitHub Actions run the existing triage workflow. + +## Flow + +1. `.github/workflows/e2e.yml` posts the failure alert and includes machine-readable metadata. +2. `cmd/e2e-triage-dispatch` listens for Slack thread replies, validates the reply text, fetches the parent alert, and dispatches GitHub. +3. `.github/workflows/e2e-triage.yml` checks out the failed SHA, runs the Claude triage skill, and posts results back to the Slack thread. + +The trigger is the exact normalized text `triage e2e`. + +## Slack Setup + +Slack app requirements: + +- Event subscription for `message.channels` so the app receives public channel thread replies +- `channels:history` so the app can read the parent E2E alert message +- `chat:write` so the app can post status updates back into the thread + +If you want private-channel support, add the equivalent `groups:history` event and scope as well. + +## GitHub And Runtime Config + +`cmd/e2e-triage-dispatch` uses these environment variables: + +- `SLACK_SIGNING_SECRET` +- `SLACK_BOT_TOKEN` +- `GITHUB_TOKEN` +- `ALLOWED_REPOSITORY` or `GITHUB_REPOSITORY` +- `ADDR` optional, defaults to `:8080` +- `GITHUB_EVENT_TYPE` optional, defaults to `slack_e2e_triage_requested` +- `SLACK_API_BASE_URL` optional, defaults to `https://slack.com/api` +- `GITHUB_API_BASE_URL` optional, defaults to `https://api.github.com` +- `SLACK_REQUEST_TOLERANCE` optional, defaults to `5m` + +The GitHub Actions workflow uses these secrets: + +- `ANTHROPIC_API_KEY` for Claude triage +- `SLACK_BOT_TOKEN` for start and completion replies +- `GITHUB_TOKEN` for the repository dispatch and repository checkout + +## Manual Fallback + +If Slack dispatch is unavailable, you can run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`. + +Required inputs: + +- `run_url` +- `sha` +- `failed_agents` + +Optional inputs: + +- `slack_channel` +- `slack_thread_ts` + +This is the fallback path for ad hoc triage when you already have the failed run URL and commit SHA. From 575b6b7a75d616cafd31e68587e1050d249d1aa1 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 16:01:46 -0700 Subject: [PATCH 12/32] chore: fix slack triage lint findings --- cmd/e2e-triage-dispatch/main.go | 41 ++++++++++++++++++-------- cmd/e2e-triage-dispatch/main_test.go | 34 ++++++++++----------- internal/slacktriage/normalize_test.go | 2 -- 3 files changed, 43 insertions(+), 34 deletions(-) diff --git a/cmd/e2e-triage-dispatch/main.go b/cmd/e2e-triage-dispatch/main.go index a1d210b8c..295b5dfd6 100644 --- a/cmd/e2e-triage-dispatch/main.go +++ b/cmd/e2e-triage-dispatch/main.go @@ -63,7 +63,12 @@ func main() { mux := http.NewServeMux() mux.Handle("/slack/events", handler) - log.Fatal(http.ListenAndServe(cfg.Addr, mux)) + srv := &http.Server{ + Addr: cfg.Addr, + Handler: mux, + ReadHeaderTimeout: 5 * time.Second, + } + log.Fatal(srv.ListenAndServe()) } func loadConfigFromEnv() (Config, error) { @@ -243,19 +248,22 @@ func (h *triageHandler) handleEvent(ctx context.Context, event slackEvent) error parentBody, err := h.slack.FetchParentMessage(ctx, event.Channel, event.ThreadTS) if err != nil { - return err + return fmt.Errorf("fetch parent message: %w", err) } metadata, err := slacktriage.ParseParentMessageMetadata(parentBody) if err != nil { - return err + return fmt.Errorf("parse parent message metadata: %w", err) } if h.cfg.AllowedRepo != "" && metadata.Repo != h.cfg.AllowedRepo { return nil } payload := slacktriage.NewDispatchPayload(metadata, event.Channel, event.ThreadTS, event.User) - return h.github.DispatchRepositoryEvent(ctx, payload) + if err := h.github.DispatchRepositoryEvent(ctx, payload); err != nil { + return fmt.Errorf("dispatch repository event: %w", err) + } + return nil } type slackEnvelope struct { @@ -278,7 +286,9 @@ type slackEvent struct { func writeJSON(w http.ResponseWriter, status int, value any) { w.Header().Set("Content-Type", "application/json; charset=utf-8") w.WriteHeader(status) - _ = json.NewEncoder(w).Encode(value) + if err := json.NewEncoder(w).Encode(value); err != nil { + log.Printf("write json response: %v", err) + } } func absDuration(d time.Duration) time.Duration { @@ -308,7 +318,7 @@ func newSlackHTTPClient(token, baseURL string) *slackHTTPClient { func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threadTS string) (string, error) { endpoint, err := url.Parse(c.baseURL + "/conversations.replies") if err != nil { - return "", err + return "", fmt.Errorf("parse slack conversations.replies url: %w", err) } query := endpoint.Query() @@ -320,13 +330,14 @@ func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threa req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint.String(), nil) if err != nil { - return "", err + return "", fmt.Errorf("build slack conversations.replies request: %w", err) } req.Header.Set("Authorization", "Bearer "+c.token) + //nolint:gosec // Slack API base URL is operator-configured. resp, err := c.client.Do(req) if err != nil { - return "", err + return "", fmt.Errorf("call slack conversations.replies: %w", err) } defer resp.Body.Close() @@ -342,7 +353,7 @@ func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threa } `json:"messages"` } if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil { - return "", err + return "", fmt.Errorf("decode slack conversations.replies response: %w", err) } if !payload.OK { if payload.Error == "" { @@ -389,26 +400,30 @@ func (d *githubHTTPDispatcher) DispatchRepositoryEvent(ctx context.Context, payl ClientPayload: payload, }) if err != nil { - return err + return fmt.Errorf("marshal github dispatch payload: %w", err) } endpoint := fmt.Sprintf("%s/repos/%s/dispatches", d.baseURL, d.repository) req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewReader(body)) if err != nil { - return err + return fmt.Errorf("build github dispatch request: %w", err) } req.Header.Set("Authorization", "Bearer "+d.token) req.Header.Set("Accept", "application/vnd.github+json") req.Header.Set("Content-Type", "application/json") + //nolint:gosec // GitHub API base URL and repository are operator-configured. resp, err := d.client.Do(req) if err != nil { - return err + return fmt.Errorf("call github dispatch endpoint: %w", err) } defer resp.Body.Close() if resp.StatusCode != http.StatusNoContent { - responseBody, _ := io.ReadAll(io.LimitReader(resp.Body, 4096)) + responseBody, readErr := io.ReadAll(io.LimitReader(resp.Body, 4096)) + if readErr != nil { + return fmt.Errorf("github dispatch failed: %s (read response): %w", resp.Status, readErr) + } if len(responseBody) > 0 { return fmt.Errorf("github dispatch failed: %s: %s", resp.Status, strings.TrimSpace(string(responseBody))) } diff --git a/cmd/e2e-triage-dispatch/main_test.go b/cmd/e2e-triage-dispatch/main_test.go index 3c06bae02..d3891a52d 100644 --- a/cmd/e2e-triage-dispatch/main_test.go +++ b/cmd/e2e-triage-dispatch/main_test.go @@ -6,11 +6,11 @@ import ( "crypto/sha256" "encoding/hex" "encoding/json" - "fmt" "io" "net/http" "net/http/httptest" "os" + "strconv" "strings" "testing" "time" @@ -52,7 +52,7 @@ func TestHandler_URLVerification(t *testing.T) { handler := newTestHandler(t) body := `{"type":"url_verification","challenge":"abc123"}` - req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + req := signedRequest(t, body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -77,7 +77,7 @@ func TestHandler_RejectsBadSignature(t *testing.T) { handler := newTestHandler(t) req := httptest.NewRequest(http.MethodPost, "/slack/events", strings.NewReader(`{"type":"event_callback"}`)) - req.Header.Set("X-Slack-Request-Timestamp", fmt.Sprintf("%d", fixedNow().Unix())) + req.Header.Set("X-Slack-Request-Timestamp", strconv.FormatInt(fixedNow().Unix(), 10)) req.Header.Set("X-Slack-Signature", "v0=deadbeef") rr := httptest.NewRecorder() @@ -96,7 +96,7 @@ func TestHandler_IgnoresNonThreadReplies(t *testing.T) { handler := newHandlerForTest(t, fetcher, dispatcher) body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.222"}}` - req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + req := signedRequest(t, body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -120,7 +120,7 @@ func TestHandler_IgnoresNonTriggerReplies(t *testing.T) { handler := newHandlerForTest(t, fetcher, dispatcher) body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"hello world","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + req := signedRequest(t, body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -146,7 +146,7 @@ func TestHandler_DispatchesValidTriggerReply(t *testing.T) { handler := newHandlerForTest(t, fetcher, dispatcher) body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + req := signedRequest(t, body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -195,7 +195,6 @@ func TestHandler_IgnoresBotAndSystemMessages(t *testing.T) { } for _, tt := range tests { - tt := tt t.Run(tt.name, func(t *testing.T) { t.Parallel() @@ -203,7 +202,7 @@ func TestHandler_IgnoresBotAndSystemMessages(t *testing.T) { dispatcher := &fakeGitHubDispatcher{} handler := newHandlerForTest(t, fetcher, dispatcher) - req := signedRequest(t, http.MethodPost, "/slack/events", tt.body, testSigningSecret, fixedNow()) + req := signedRequest(t, tt.body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -266,7 +265,7 @@ func TestHandler_RejectsMismatchedParentRepo(t *testing.T) { handler := newHandlerForTest(t, fetcher, dispatcher) body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, http.MethodPost, "/slack/events", body, testSigningSecret, fixedNow()) + req := signedRequest(t, body, fixedNow()) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) @@ -283,12 +282,13 @@ func fixedNow() time.Time { return time.Unix(1_700_000_000, 0).UTC() } -func signedRequest(t *testing.T, method, target, body, secret string, now time.Time) *http.Request { +func signedRequest(t *testing.T, body string, now time.Time) *http.Request { t.Helper() - req := httptest.NewRequest(method, target, strings.NewReader(body)) - req.Header.Set("X-Slack-Request-Timestamp", fmt.Sprintf("%d", now.Unix())) - req.Header.Set("X-Slack-Signature", slackSignature(secret, fmt.Sprintf("%d", now.Unix()), body)) + req := httptest.NewRequest(http.MethodPost, "/slack/events", strings.NewReader(body)) + timestamp := strconv.FormatInt(now.Unix(), 10) + req.Header.Set("X-Slack-Request-Timestamp", timestamp) + req.Header.Set("X-Slack-Signature", slackSignature(testSigningSecret, timestamp, body)) return req } @@ -303,9 +303,7 @@ func newTestHandler(t *testing.T) http.Handler { return newHandler(Config{ SigningSecret: testSigningSecret, AllowedRepo: "entireio/cli", - }, &fakeSlackFetcher{}, &fakeGitHubDispatcher{}, func() time.Time { - return fixedNow() - }) + }, &fakeSlackFetcher{}, &fakeGitHubDispatcher{}, fixedNow) } func newHandlerForTest(t *testing.T, slack *fakeSlackFetcher, github *fakeGitHubDispatcher) http.Handler { @@ -313,9 +311,7 @@ func newHandlerForTest(t *testing.T, slack *fakeSlackFetcher, github *fakeGitHub return newHandler(Config{ SigningSecret: testSigningSecret, AllowedRepo: "entireio/cli", - }, slack, github, func() time.Time { - return fixedNow() - }) + }, slack, github, fixedNow) } type fakeSlackFetcher struct { diff --git a/internal/slacktriage/normalize_test.go b/internal/slacktriage/normalize_test.go index a9e3d205e..3767533e0 100644 --- a/internal/slacktriage/normalize_test.go +++ b/internal/slacktriage/normalize_test.go @@ -28,7 +28,6 @@ func TestNormalizeTrigger(t *testing.T) { } for _, tt := range tests { - tt := tt t.Run(tt.name, func(t *testing.T) { t.Parallel() if got := NormalizeTrigger(tt.in); got != tt.want { @@ -46,7 +45,6 @@ func TestIsTriageTrigger(t *testing.T) { } for _, in := range []string{"triage", "triage e2e now", "triage-e2e"} { - in := in t.Run(in, func(t *testing.T) { t.Parallel() if IsTriageTrigger(in) { From 21edabecc979974306d0f73822e14ad4d56e23aa Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Tue, 17 Mar 2026 16:05:03 -0700 Subject: [PATCH 13/32] docs: fix slack triage documentation links --- README.md | 4 ++-- docs/architecture/slack-e2e-triage.md | 4 +++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 4fb6dfa22..355aa5663 100644 --- a/README.md +++ b/README.md @@ -106,9 +106,9 @@ Removes the git hooks. Your code and commit history remain untouched. ## E2E Triage -E2E failure alerts can be triaged from Slack by replying `triage e2e` in the failure thread. The workflow is documented in [docs/architecture/slack-e2e-triage.md](/Users/alisha/Projects/wt/e2e-triage-ci-job/docs/architecture/slack-e2e-triage.md). +E2E failure alerts can be triaged from Slack by replying `triage e2e` in the failure thread. The workflow is documented in [docs/architecture/slack-e2e-triage.md](docs/architecture/slack-e2e-triage.md). -The Slack bridge is handled by `cmd/e2e-triage-dispatch`, and the triage job itself runs in [`.github/workflows/e2e-triage.yml`](/Users/alisha/Projects/wt/e2e-triage-ci-job/.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL, commit SHA, and failed agents. +The Slack bridge is handled by `cmd/e2e-triage-dispatch`, and the triage job itself runs in [`.github/workflows/e2e-triage.yml`](.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL, commit SHA, and failed agents. ## Key Concepts diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md index ca7c36b1e..54842dbd9 100644 --- a/docs/architecture/slack-e2e-triage.md +++ b/docs/architecture/slack-e2e-triage.md @@ -38,7 +38,9 @@ The GitHub Actions workflow uses these secrets: - `ANTHROPIC_API_KEY` for Claude triage - `SLACK_BOT_TOKEN` for start and completion replies -- `GITHUB_TOKEN` for the repository dispatch and repository checkout +- The built-in `${{ github.token }}` for repository dispatch and repository checkout + +The externally deployed `cmd/e2e-triage-dispatch` service uses `GITHUB_TOKEN` to call the GitHub dispatch API. ## Manual Fallback From 6080e9bf9fc8f412c7d79355e60b8697f5374e5d Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 09:00:32 -0700 Subject: [PATCH 14/32] ci: auto-detect sha and failed_agents from run URL in e2e triage Make sha and failed_agents optional for workflow_dispatch triggers. When omitted, these values are derived from the run URL via the GitHub API, reducing friction when triggering triage from the UI. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 4a44db7b807d --- .github/workflows/e2e-triage.yml | 36 +++++++++++++++++++++++++------- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index aa3637894..fd2feefa1 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -11,12 +11,12 @@ on: required: true type: string sha: - description: Commit SHA that failed - required: true + description: Commit SHA (auto-detected from run if omitted) + required: false type: string failed_agents: - description: Comma-separated list of failed agents - required: true + description: Comma-separated list of failed agents (auto-detected from run if omitted) + required: false type: string slack_channel: description: Slack channel ID for the originating thread @@ -45,6 +45,7 @@ jobs: id: set shell: bash env: + GH_TOKEN: ${{ github.token }} EVENT_NAME: ${{ github.event_name }} REPO_NAME: ${{ github.repository }} RUN_URL_INPUT: ${{ inputs.run_url }} @@ -57,11 +58,32 @@ jobs: if [ "$EVENT_NAME" = "workflow_dispatch" ]; then run_url="$RUN_URL_INPUT" - sha="$SHA_INPUT" + sha="${SHA_INPUT:-}" slack_channel="${SLACK_CHANNEL_INPUT:-}" slack_thread_ts="${SLACK_THREAD_TS_INPUT:-}" - failed_agents_raw="$FAILED_AGENTS_INPUT" - agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" + + # Extract run ID from URL for API calls (if needed) + run_id="" + if [ -z "$sha" ] || [ -z "$FAILED_AGENTS_INPUT" ]; then + run_id=$(echo "$run_url" | grep -oE '/runs/[0-9]+' | grep -oE '[0-9]+') + if [ -z "$run_id" ]; then + echo "Could not extract run ID from run_url: $run_url" >&2 + exit 1 + fi + fi + + # If sha not provided, fetch from run + if [ -z "$sha" ]; then + sha=$(gh run view "$run_id" --repo "$REPO_NAME" --json headSha --jq '.headSha') + fi + + # If failed_agents not provided, fetch from run's failed jobs + if [ -n "$FAILED_AGENTS_INPUT" ]; then + agents_json="$(printf '%s' "$FAILED_AGENTS_INPUT" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" + else + agents_json=$(gh api "repos/$REPO_NAME/actions/runs/$run_id/jobs" \ + --jq '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') + fi else trigger_text="$(jq -r '.client_payload.trigger_text // empty' "$GITHUB_EVENT_PATH")" repo="$(jq -r '.client_payload.repo // empty' "$GITHUB_EVENT_PATH")" From 8452c5b7ee59ef6e424810f345866dde5815f747 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 09:12:44 -0700 Subject: [PATCH 15/32] chore: simplify e2e triage workflow and fix lint issues - Consolidate two gh API calls into one (headSha + jobs in single request) - Extract duplicated CSV-to-JSON jq pattern into csv_to_json function - Add "null" guard to agents_json validation - Use shallow clone (fetch-depth: 1) for triage jobs - Add server-side error logging in HTTP handler - Fix gosec nolint placement and noctx lint errors in tests Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 0f803598ba36 --- .github/workflows/e2e-triage.yml | 31 ++++++++++++++-------------- cmd/e2e-triage-dispatch/main.go | 12 +++++++---- cmd/e2e-triage-dispatch/main_test.go | 4 ++-- 3 files changed, 26 insertions(+), 21 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index fd2feefa1..2e5128982 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -56,33 +56,34 @@ jobs: run: | set -euo pipefail + csv_to_json() { + printf '%s' "$1" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))' + } + if [ "$EVENT_NAME" = "workflow_dispatch" ]; then run_url="$RUN_URL_INPUT" sha="${SHA_INPUT:-}" slack_channel="${SLACK_CHANNEL_INPUT:-}" slack_thread_ts="${SLACK_THREAD_TS_INPUT:-}" - # Extract run ID from URL for API calls (if needed) - run_id="" + # Derive missing values from run URL via GitHub API if [ -z "$sha" ] || [ -z "$FAILED_AGENTS_INPUT" ]; then run_id=$(echo "$run_url" | grep -oE '/runs/[0-9]+' | grep -oE '[0-9]+') if [ -z "$run_id" ]; then echo "Could not extract run ID from run_url: $run_url" >&2 exit 1 fi + run_data=$(gh run view "$run_id" --repo "$REPO_NAME" --json headSha,jobs) + if [ -z "$sha" ]; then + sha=$(echo "$run_data" | jq -r '.headSha') + fi + if [ -z "$FAILED_AGENTS_INPUT" ]; then + agents_json=$(echo "$run_data" | jq -c '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') + fi fi - # If sha not provided, fetch from run - if [ -z "$sha" ]; then - sha=$(gh run view "$run_id" --repo "$REPO_NAME" --json headSha --jq '.headSha') - fi - - # If failed_agents not provided, fetch from run's failed jobs if [ -n "$FAILED_AGENTS_INPUT" ]; then - agents_json="$(printf '%s' "$FAILED_AGENTS_INPUT" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" - else - agents_json=$(gh api "repos/$REPO_NAME/actions/runs/$run_id/jobs" \ - --jq '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') + agents_json="$(csv_to_json "$FAILED_AGENTS_INPUT")" fi else trigger_text="$(jq -r '.client_payload.trigger_text // empty' "$GITHUB_EVENT_PATH")" @@ -96,7 +97,7 @@ jobs: agents_json="$(jq -c '.client_payload.failed_agents' "$GITHUB_EVENT_PATH")" else failed_agents_raw="$(jq -r '.client_payload.failed_agents // empty' "$GITHUB_EVENT_PATH")" - agents_json="$(printf '%s' "$failed_agents_raw" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))')" + agents_json="$(csv_to_json "$failed_agents_raw")" fi if [ "$trigger_text" != "triage e2e" ]; then @@ -121,7 +122,7 @@ jobs: echo "sha is required" >&2 exit 1 fi - if [ -z "$agents_json" ] || [ "$agents_json" = "[]" ]; then + if [ -z "$agents_json" ] || [ "$agents_json" = "[]" ] || [ "$agents_json" = "null" ]; then echo "failed_agents is required" >&2 exit 1 fi @@ -191,7 +192,7 @@ jobs: uses: actions/checkout@v6 with: ref: ${{ needs.matrix-setup.outputs.sha }} - fetch-depth: 0 + fetch-depth: 1 - name: Setup mise uses: jdx/mise-action@v4 diff --git a/cmd/e2e-triage-dispatch/main.go b/cmd/e2e-triage-dispatch/main.go index 295b5dfd6..820b5e05f 100644 --- a/cmd/e2e-triage-dispatch/main.go +++ b/cmd/e2e-triage-dispatch/main.go @@ -164,17 +164,20 @@ func (h *triageHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { body, err := io.ReadAll(io.LimitReader(r.Body, h.maxBodyLen)) if err != nil { + log.Printf("read request body: %v", err) http.Error(w, "read request body", http.StatusBadRequest) return } if err := h.verifyRequest(r, body); err != nil { + log.Printf("verify slack request: %v", err) http.Error(w, http.StatusText(http.StatusUnauthorized), http.StatusUnauthorized) return } var envelope slackEnvelope if err := json.Unmarshal(body, &envelope); err != nil { + log.Printf("decode slack payload: %v", err) http.Error(w, "decode slack payload", http.StatusBadRequest) return } @@ -184,6 +187,7 @@ func (h *triageHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { writeJSON(w, http.StatusOK, map[string]string{"challenge": envelope.Challenge}) case slackEventTypeCallback: if err := h.handleEvent(r.Context(), envelope.Event); err != nil { + log.Printf("process slack event: %v", err) http.Error(w, "process slack event", http.StatusInternalServerError) return } @@ -328,14 +332,14 @@ func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threa query.Set("limit", "1") endpoint.RawQuery = query.Encode() + //nolint:gosec // Slack API base URL is operator-configured. req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint.String(), nil) if err != nil { return "", fmt.Errorf("build slack conversations.replies request: %w", err) } req.Header.Set("Authorization", "Bearer "+c.token) - //nolint:gosec // Slack API base URL is operator-configured. - resp, err := c.client.Do(req) + resp, err := c.client.Do(req) //nolint:gosec // taint tracked from operator-configured base URL above if err != nil { return "", fmt.Errorf("call slack conversations.replies: %w", err) } @@ -404,6 +408,7 @@ func (d *githubHTTPDispatcher) DispatchRepositoryEvent(ctx context.Context, payl } endpoint := fmt.Sprintf("%s/repos/%s/dispatches", d.baseURL, d.repository) + //nolint:gosec // GitHub API base URL and repository are operator-configured. req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewReader(body)) if err != nil { return fmt.Errorf("build github dispatch request: %w", err) @@ -412,8 +417,7 @@ func (d *githubHTTPDispatcher) DispatchRepositoryEvent(ctx context.Context, payl req.Header.Set("Accept", "application/vnd.github+json") req.Header.Set("Content-Type", "application/json") - //nolint:gosec // GitHub API base URL and repository are operator-configured. - resp, err := d.client.Do(req) + resp, err := d.client.Do(req) //nolint:gosec // taint tracked from operator-configured base URL above if err != nil { return fmt.Errorf("call github dispatch endpoint: %w", err) } diff --git a/cmd/e2e-triage-dispatch/main_test.go b/cmd/e2e-triage-dispatch/main_test.go index d3891a52d..9f2eb4e5e 100644 --- a/cmd/e2e-triage-dispatch/main_test.go +++ b/cmd/e2e-triage-dispatch/main_test.go @@ -76,7 +76,7 @@ func TestHandler_RejectsBadSignature(t *testing.T) { t.Parallel() handler := newTestHandler(t) - req := httptest.NewRequest(http.MethodPost, "/slack/events", strings.NewReader(`{"type":"event_callback"}`)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/slack/events", strings.NewReader(`{"type":"event_callback"}`)) req.Header.Set("X-Slack-Request-Timestamp", strconv.FormatInt(fixedNow().Unix(), 10)) req.Header.Set("X-Slack-Signature", "v0=deadbeef") @@ -285,7 +285,7 @@ func fixedNow() time.Time { func signedRequest(t *testing.T, body string, now time.Time) *http.Request { t.Helper() - req := httptest.NewRequest(http.MethodPost, "/slack/events", strings.NewReader(body)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/slack/events", strings.NewReader(body)) timestamp := strconv.FormatInt(now.Unix(), 10) req.Header.Set("X-Slack-Request-Timestamp", timestamp) req.Header.Set("X-Slack-Signature", slackSignature(testSigningSecret, timestamp, body)) From f8a82d65f64ae59f5d8f141b3b01dd4e19b75ec1 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 10:25:53 -0700 Subject: [PATCH 16/32] ci: add push trigger for testing e2e triage on feature branch Adds push-triggered test mode that runs with the vogon canary agent (no API costs) when workflow-related files change on this branch. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: eec73e0fab92 --- .github/workflows/e2e-triage.yml | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 2e5128982..cb073f690 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -1,6 +1,14 @@ name: E2E Triage on: + push: + branches: + - alisha/e2e-triage-ci-job + paths: + - .github/workflows/e2e-triage.yml + - scripts/run-e2e-triage.sh + - cmd/e2e-triage-dispatch/** + - internal/slacktriage/** repository_dispatch: types: - slack_e2e_triage_requested @@ -60,7 +68,14 @@ jobs: printf '%s' "$1" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))' } - if [ "$EVENT_NAME" = "workflow_dispatch" ]; then + if [ "$EVENT_NAME" = "push" ]; then + # Push-triggered test mode: use current SHA and vogon canary agent + run_url="${GITHUB_SERVER_URL}/${REPO_NAME}/actions/runs/${GITHUB_RUN_ID}" + sha="${GITHUB_SHA}" + agents_json='["vogon"]' + slack_channel="" + slack_thread_ts="" + elif [ "$EVENT_NAME" = "workflow_dispatch" ]; then run_url="$RUN_URL_INPUT" sha="${SHA_INPUT:-}" slack_channel="${SLACK_CHANNEL_INPUT:-}" From 5a20065835377b44b67ba2880f1f096f87f85a8f Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 10:34:07 -0700 Subject: [PATCH 17/32] Revert "ci: add push trigger for testing e2e triage on feature branch" This reverts commit f8a82d65f64ae59f5d8f141b3b01dd4e19b75ec1. Entire-Checkpoint: 363c74b4a8c5 --- .github/workflows/e2e-triage.yml | 17 +---------------- 1 file changed, 1 insertion(+), 16 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index cb073f690..2e5128982 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -1,14 +1,6 @@ name: E2E Triage on: - push: - branches: - - alisha/e2e-triage-ci-job - paths: - - .github/workflows/e2e-triage.yml - - scripts/run-e2e-triage.sh - - cmd/e2e-triage-dispatch/** - - internal/slacktriage/** repository_dispatch: types: - slack_e2e_triage_requested @@ -68,14 +60,7 @@ jobs: printf '%s' "$1" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))' } - if [ "$EVENT_NAME" = "push" ]; then - # Push-triggered test mode: use current SHA and vogon canary agent - run_url="${GITHUB_SERVER_URL}/${REPO_NAME}/actions/runs/${GITHUB_RUN_ID}" - sha="${GITHUB_SHA}" - agents_json='["vogon"]' - slack_channel="" - slack_thread_ts="" - elif [ "$EVENT_NAME" = "workflow_dispatch" ]; then + if [ "$EVENT_NAME" = "workflow_dispatch" ]; then run_url="$RUN_URL_INPUT" sha="${SHA_INPUT:-}" slack_channel="${SLACK_CHANNEL_INPUT:-}" From 8c4586c67e1866bf7e37dfb2884a033042e91ed2 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 10:47:17 -0700 Subject: [PATCH 18/32] fix: checkout workflow branch instead of target SHA in e2e triage The triage workflow was checking out the failed run's SHA, which doesn't contain the triage script. Now checks out the workflow's own branch and passes the target SHA as an env var instead. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: d4ec0a1e350d --- .github/workflows/e2e-triage.yml | 2 +- scripts/run-e2e-triage.sh | 7 ++++++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 2e5128982..ec11006d7 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -191,7 +191,6 @@ jobs: - name: Checkout repository uses: actions/checkout@v6 with: - ref: ${{ needs.matrix-setup.outputs.sha }} fetch-depth: 1 - name: Setup mise @@ -208,6 +207,7 @@ jobs: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GH_TOKEN: ${{ github.token }} TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log + TRIAGE_SHA: ${{ needs.matrix-setup.outputs.sha }} run: scripts/run-e2e-triage.sh - name: Summarize triage output diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh index d64aefd5a..c19a3c3c7 100755 --- a/scripts/run-e2e-triage.sh +++ b/scripts/run-e2e-triage.sh @@ -8,6 +8,11 @@ set -euo pipefail mkdir -p "$(dirname "$TRIAGE_OUTPUT_FILE")" +triage_args="/e2e:triage-ci ${RUN_URL} --agent ${E2E_AGENT}" +if [ -n "${TRIAGE_SHA:-}" ]; then + triage_args="${triage_args} --sha ${TRIAGE_SHA}" +fi + claude --plugin-dir .claude/plugins/e2e \ - -p "/e2e:triage-ci ${RUN_URL} --agent ${E2E_AGENT}" \ + -p "$triage_args" \ 2>&1 | tee "$TRIAGE_OUTPUT_FILE" From 379aa6e7855d776c823457265b27250506c48ecf Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 11:00:08 -0700 Subject: [PATCH 19/32] fix: add strict tool permissions for claude in CI triage Use --allowedTools with explicit per-command scoping instead of --dangerously-skip-permissions. Each gh command is locked to the specific repo, workflow, and agent. No generic shell access. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 4004148a4e05 --- .github/workflows/e2e-triage.yml | 11 +++++++++++ scripts/run-e2e-triage.sh | 14 +++++++++++++- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index ec11006d7..df3f2e6da 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -242,6 +242,17 @@ jobs: echo EOF } >> "$GITHUB_OUTPUT" + # Print triage output to job summary + if [ -f "$TRIAGE_OUTPUT_FILE" ]; then + { + echo "## E2E Triage: ${E2E_AGENT}" + echo "" + cat "$TRIAGE_OUTPUT_FILE" + } >> "$GITHUB_STEP_SUMMARY" + else + echo "No triage output file found." >> "$GITHUB_STEP_SUMMARY" + fi + - name: Post triage completion if: ${{ always() && (steps.triage.outcome == 'success' || steps.triage.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} shell: bash diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh index c19a3c3c7..1bca4798f 100755 --- a/scripts/run-e2e-triage.sh +++ b/scripts/run-e2e-triage.sh @@ -13,6 +13,18 @@ if [ -n "${TRIAGE_SHA:-}" ]; then triage_args="${triage_args} --sha ${TRIAGE_SHA}" fi -claude --plugin-dir .claude/plugins/e2e \ +repo="${GITHUB_REPOSITORY:-entireio/cli}" + +claude \ + --plugin-dir .claude/plugins/e2e \ + --allowedTools \ + "Bash(scripts/download-e2e-artifacts.sh ${RUN_URL})" \ + "Bash(gh run view * --repo ${repo} --json *)" \ + "Bash(gh run download * --repo ${repo} --dir *)" \ + "Bash(gh run list --workflow e2e.yml --repo ${repo} *)" \ + "Bash(mise run test:e2e --agent ${E2E_AGENT} *)" \ + "Read" \ + "Grep" \ + "Glob" \ -p "$triage_args" \ 2>&1 | tee "$TRIAGE_OUTPUT_FILE" From 9e64c529e1842a498032413851aee0fa0270beda Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 11:11:11 -0700 Subject: [PATCH 20/32] fix: pre-download artifacts and restrict claude to read-only tools MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Instead of giving Claude shell access to gh/scripts, download artifacts in the script before invoking Claude. Claude only gets Read, Grep, and Glob — pure analysis, no shell execution. Also improve job summary to show helpful message when log is empty. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: ca4f43d851a5 --- .github/workflows/e2e-triage.yml | 19 +++++++++++-------- scripts/run-e2e-triage.sh | 12 ++++-------- 2 files changed, 15 insertions(+), 16 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index df3f2e6da..ea3a24890 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -243,15 +243,18 @@ jobs: } >> "$GITHUB_OUTPUT" # Print triage output to job summary - if [ -f "$TRIAGE_OUTPUT_FILE" ]; then - { - echo "## E2E Triage: ${E2E_AGENT}" - echo "" + { + echo "## E2E Triage: ${E2E_AGENT}" + echo "" + if [ -f "$TRIAGE_OUTPUT_FILE" ] && [ -s "$TRIAGE_OUTPUT_FILE" ]; then cat "$TRIAGE_OUTPUT_FILE" - } >> "$GITHUB_STEP_SUMMARY" - else - echo "No triage output file found." >> "$GITHUB_STEP_SUMMARY" - fi + else + echo "No triage output was produced." + echo "" + echo "The triage log file was either not created or is empty." + echo "Check the 'Run triage' step logs for details." + fi + } >> "$GITHUB_STEP_SUMMARY" - name: Post triage completion if: ${{ always() && (steps.triage.outcome == 'success' || steps.triage.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh index 1bca4798f..8939db4e0 100755 --- a/scripts/run-e2e-triage.sh +++ b/scripts/run-e2e-triage.sh @@ -8,21 +8,17 @@ set -euo pipefail mkdir -p "$(dirname "$TRIAGE_OUTPUT_FILE")" -triage_args="/e2e:triage-ci ${RUN_URL} --agent ${E2E_AGENT}" +# Download artifacts before invoking Claude so it only needs read-only access +artifact_path="$(scripts/download-e2e-artifacts.sh "$RUN_URL")" + +triage_args="/e2e:triage-ci ${artifact_path} --agent ${E2E_AGENT}" if [ -n "${TRIAGE_SHA:-}" ]; then triage_args="${triage_args} --sha ${TRIAGE_SHA}" fi -repo="${GITHUB_REPOSITORY:-entireio/cli}" - claude \ --plugin-dir .claude/plugins/e2e \ --allowedTools \ - "Bash(scripts/download-e2e-artifacts.sh ${RUN_URL})" \ - "Bash(gh run view * --repo ${repo} --json *)" \ - "Bash(gh run download * --repo ${repo} --dir *)" \ - "Bash(gh run list --workflow e2e.yml --repo ${repo} *)" \ - "Bash(mise run test:e2e --agent ${E2E_AGENT} *)" \ "Read" \ "Grep" \ "Glob" \ From ba6611a36134e2bc5da95bc3379ae8272912ed24 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 11:35:36 -0700 Subject: [PATCH 21/32] fix: strip ANSI escape codes from triage output for clean GitHub summaries Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 4fc3a7119ed8 --- .github/workflows/e2e-triage.yml | 2 +- scripts/run-e2e-triage.sh | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index ea3a24890..cc7c61471 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -247,7 +247,7 @@ jobs: echo "## E2E Triage: ${E2E_AGENT}" echo "" if [ -f "$TRIAGE_OUTPUT_FILE" ] && [ -s "$TRIAGE_OUTPUT_FILE" ]; then - cat "$TRIAGE_OUTPUT_FILE" + sed 's/\x1b\[[0-9;]*[a-zA-Z]//g; s/\x1b\[[?][0-9]*[a-zA-Z]//g' "$TRIAGE_OUTPUT_FILE" else echo "No triage output was produced." echo "" diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh index 8939db4e0..96bd4b0f9 100755 --- a/scripts/run-e2e-triage.sh +++ b/scripts/run-e2e-triage.sh @@ -18,9 +18,11 @@ fi claude \ --plugin-dir .claude/plugins/e2e \ + --output-format text \ --allowedTools \ "Read" \ "Grep" \ "Glob" \ -p "$triage_args" \ - 2>&1 | tee "$TRIAGE_OUTPUT_FILE" + 2>&1 | sed 's/\x1b\[[0-9;]*[a-zA-Z]//g; s/\x1b\[[?][0-9]*[a-zA-Z]//g' \ + | tee "$TRIAGE_OUTPUT_FILE" From 19cbefefc0e3ed6df1e8e7a26221e175907b322d Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 13:24:18 -0700 Subject: [PATCH 22/32] feat: replace dispatch service with Cloudflare Worker for one-click Slack triage Replace the Go-based Slack Events API dispatch service with a lightweight Cloudflare Worker that bridges Slack links to GitHub workflow_dispatch. The e2e.yml alert now posts via bot token (chat.postMessage) to capture thread context, then includes a clickable "Run Triage" link. - Add workers/e2e-triage-trigger/ (Cloudflare Worker) - Switch e2e.yml Slack alert from webhook to bot token + curl - Remove repository_dispatch trigger from e2e-triage.yml - Delete cmd/e2e-triage-dispatch/ and internal/slacktriage/ - Update docs and README Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: b4154940fa73 --- .github/workflows/e2e-triage.yml | 69 +-- .github/workflows/e2e.yml | 77 ++-- README.md | 4 +- cmd/e2e-triage-dispatch/main.go | 438 -------------------- cmd/e2e-triage-dispatch/main_test.go | 346 ---------------- docs/architecture/slack-e2e-triage.md | 76 ++-- internal/slacktriage/dispatch.go | 34 -- internal/slacktriage/dispatch_test.go | 42 -- internal/slacktriage/normalize.go | 15 - internal/slacktriage/normalize_test.go | 55 --- internal/slacktriage/parent_message.go | 91 ---- internal/slacktriage/parent_message_test.go | 51 --- workers/e2e-triage-trigger/package.json | 12 + workers/e2e-triage-trigger/src/index.ts | 59 +++ workers/e2e-triage-trigger/wrangler.toml | 3 + 15 files changed, 181 insertions(+), 1191 deletions(-) delete mode 100644 cmd/e2e-triage-dispatch/main.go delete mode 100644 cmd/e2e-triage-dispatch/main_test.go delete mode 100644 internal/slacktriage/dispatch.go delete mode 100644 internal/slacktriage/dispatch_test.go delete mode 100644 internal/slacktriage/normalize.go delete mode 100644 internal/slacktriage/normalize_test.go delete mode 100644 internal/slacktriage/parent_message.go delete mode 100644 internal/slacktriage/parent_message_test.go create mode 100644 workers/e2e-triage-trigger/package.json create mode 100644 workers/e2e-triage-trigger/src/index.ts create mode 100644 workers/e2e-triage-trigger/wrangler.toml diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index cc7c61471..f51c1497c 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -1,9 +1,6 @@ name: E2E Triage on: - repository_dispatch: - types: - - slack_e2e_triage_requested workflow_dispatch: inputs: run_url: @@ -46,7 +43,6 @@ jobs: shell: bash env: GH_TOKEN: ${{ github.token }} - EVENT_NAME: ${{ github.event_name }} REPO_NAME: ${{ github.repository }} RUN_URL_INPUT: ${{ inputs.run_url }} SHA_INPUT: ${{ inputs.sha }} @@ -60,60 +56,31 @@ jobs: printf '%s' "$1" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; "")) | map(select(length > 0))' } - if [ "$EVENT_NAME" = "workflow_dispatch" ]; then - run_url="$RUN_URL_INPUT" - sha="${SHA_INPUT:-}" - slack_channel="${SLACK_CHANNEL_INPUT:-}" - slack_thread_ts="${SLACK_THREAD_TS_INPUT:-}" + run_url="$RUN_URL_INPUT" + sha="${SHA_INPUT:-}" + slack_channel="${SLACK_CHANNEL_INPUT:-}" + slack_thread_ts="${SLACK_THREAD_TS_INPUT:-}" - # Derive missing values from run URL via GitHub API - if [ -z "$sha" ] || [ -z "$FAILED_AGENTS_INPUT" ]; then - run_id=$(echo "$run_url" | grep -oE '/runs/[0-9]+' | grep -oE '[0-9]+') - if [ -z "$run_id" ]; then - echo "Could not extract run ID from run_url: $run_url" >&2 - exit 1 - fi - run_data=$(gh run view "$run_id" --repo "$REPO_NAME" --json headSha,jobs) - if [ -z "$sha" ]; then - sha=$(echo "$run_data" | jq -r '.headSha') - fi - if [ -z "$FAILED_AGENTS_INPUT" ]; then - agents_json=$(echo "$run_data" | jq -c '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') - fi - fi - - if [ -n "$FAILED_AGENTS_INPUT" ]; then - agents_json="$(csv_to_json "$FAILED_AGENTS_INPUT")" - fi - else - trigger_text="$(jq -r '.client_payload.trigger_text // empty' "$GITHUB_EVENT_PATH")" - repo="$(jq -r '.client_payload.repo // empty' "$GITHUB_EVENT_PATH")" - branch="$(jq -r '.client_payload.branch // empty' "$GITHUB_EVENT_PATH")" - run_url="$(jq -r '.client_payload.run_url // empty' "$GITHUB_EVENT_PATH")" - sha="$(jq -r '.client_payload.sha // empty' "$GITHUB_EVENT_PATH")" - slack_channel="$(jq -r '.client_payload.slack_channel // empty' "$GITHUB_EVENT_PATH")" - slack_thread_ts="$(jq -r '.client_payload.slack_thread_ts // empty' "$GITHUB_EVENT_PATH")" - if jq -e '.client_payload.failed_agents | type == "array"' "$GITHUB_EVENT_PATH" >/dev/null 2>&1; then - agents_json="$(jq -c '.client_payload.failed_agents' "$GITHUB_EVENT_PATH")" - else - failed_agents_raw="$(jq -r '.client_payload.failed_agents // empty' "$GITHUB_EVENT_PATH")" - agents_json="$(csv_to_json "$failed_agents_raw")" - fi - - if [ "$trigger_text" != "triage e2e" ]; then - echo "trigger_text must be exactly 'triage e2e'" >&2 + # Derive missing values from run URL via GitHub API + if [ -z "$sha" ] || [ -z "$FAILED_AGENTS_INPUT" ]; then + run_id=$(echo "$run_url" | grep -oE '/runs/[0-9]+' | grep -oE '[0-9]+') + if [ -z "$run_id" ]; then + echo "Could not extract run ID from run_url: $run_url" >&2 exit 1 fi - if [ "$repo" != "$REPO_NAME" ]; then - echo "repo must match $REPO_NAME" >&2 - exit 1 + run_data=$(gh run view "$run_id" --repo "$REPO_NAME" --json headSha,jobs) + if [ -z "$sha" ]; then + sha=$(echo "$run_data" | jq -r '.headSha') fi - if [ "$branch" != "main" ]; then - echo "branch must be main" >&2 - exit 1 + if [ -z "$FAILED_AGENTS_INPUT" ]; then + agents_json=$(echo "$run_data" | jq -c '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') fi fi + if [ -n "$FAILED_AGENTS_INPUT" ]; then + agents_json="$(csv_to_json "$FAILED_AGENTS_INPUT")" + fi + if [ -z "$run_url" ]; then echo "run_url is required" >&2 exit 1 diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 98246ea14..12f3c5fcb 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -130,32 +130,51 @@ jobs: echo "agents_csv=$failed_csv" >> "$GITHUB_OUTPUT" - name: Notify Slack of E2E failure - uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a # v2.1.1 - with: - webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }} - webhook-type: incoming-webhook - payload: | - { - "blocks": [ - { - "type": "section", - "text": { - "type": "mrkdwn", - "text": ":red_circle: *E2E Tests Failed* on `main`\n\nFailed agents: *${{ steps.failed.outputs.agents }}*\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run details>" - } - }, - { - "type": "context", - "elements": [ - { - "type": "mrkdwn", - "text": "meta: repo=${{ github.repository }} branch=${{ github.ref_name }} run_id=${{ github.run_id }} run_url=${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} sha=${{ github.sha }} agents=${{ steps.failed.outputs.agents_csv }}" - }, - { - "type": "mrkdwn", - "text": "Commit: <${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}> by ${{ github.actor }}" - } - ] - } - ] - } + env: + SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }} + SLACK_CHANNEL: ${{ vars.E2E_SLACK_CHANNEL }} + FAILED_AGENTS: ${{ steps.failed.outputs.agents }} + RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} + COMMIT_URL: ${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }} + COMMIT_SHA: ${{ github.sha }} + ACTOR: ${{ github.actor }} + shell: bash + run: | + set -euo pipefail + + text=":red_circle: *E2E Tests Failed* on \`main\` + + Failed agents: *${FAILED_AGENTS}* + <${RUN_URL}|View run details> + Commit: <${COMMIT_URL}|${COMMIT_SHA}> by ${ACTOR}" + + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg text "$text" \ + '{channel: $channel, text: $text}')" + + response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")" + + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "::error::Slack API returned non-ok response: $(jq -r '.error // "unknown"' <<<"$response")" + exit 1 + fi + + channel="$(jq -r '.channel' <<<"$response")" + thread_ts="$(jq -r '.ts' <<<"$response")" + + triage_url="https://e2e-triage.entireio.workers.dev/triage?run_url=$(jq -rn --arg u "$RUN_URL" '$u | @uri')&slack_channel=${channel}&slack_thread_ts=${thread_ts}" + + followup="$(jq -n \ + --arg channel "$channel" \ + --arg thread_ts "$thread_ts" \ + --arg text ":mag: <${triage_url}|Run Triage>" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$followup" > /dev/null diff --git a/README.md b/README.md index 355aa5663..48387ae0c 100644 --- a/README.md +++ b/README.md @@ -106,9 +106,9 @@ Removes the git hooks. Your code and commit history remain untouched. ## E2E Triage -E2E failure alerts can be triaged from Slack by replying `triage e2e` in the failure thread. The workflow is documented in [docs/architecture/slack-e2e-triage.md](docs/architecture/slack-e2e-triage.md). +E2E failure alerts post a "Run Triage" link to Slack. Clicking it triggers the triage workflow via a Cloudflare Worker. See [docs/architecture/slack-e2e-triage.md](docs/architecture/slack-e2e-triage.md) for the full architecture. -The Slack bridge is handled by `cmd/e2e-triage-dispatch`, and the triage job itself runs in [`.github/workflows/e2e-triage.yml`](.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL, commit SHA, and failed agents. +The triage job runs in [`.github/workflows/e2e-triage.yml`](.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL. ## Key Concepts diff --git a/cmd/e2e-triage-dispatch/main.go b/cmd/e2e-triage-dispatch/main.go deleted file mode 100644 index 820b5e05f..000000000 --- a/cmd/e2e-triage-dispatch/main.go +++ /dev/null @@ -1,438 +0,0 @@ -package main - -import ( - "bytes" - "context" - "crypto/hmac" - "crypto/sha256" - "encoding/hex" - "encoding/json" - "errors" - "fmt" - "io" - "log" - "net/http" - "net/url" - "os" - "strconv" - "strings" - "time" - - "github.com/entireio/cli/internal/slacktriage" -) - -const ( - defaultAddr = ":8080" - defaultGitHubAPIBaseURL = "https://api.github.com" - defaultSlackAPIBaseURL = "https://slack.com/api" - defaultSlackEventType = "slack_e2e_triage_requested" - defaultRequestTolerance = 5 * time.Minute - slackTimestampHeader = "X-Slack-Request-Timestamp" - slackSignatureHeader = "X-Slack-Signature" - slackEventTypeURLVerify = "url_verification" - slackEventTypeCallback = "event_callback" - slackInnerEventTypeMessage = "message" -) - -// Config holds runtime settings loaded from the environment. -type Config struct { - Addr string - SigningSecret string - SlackBotToken string - GitHubToken string - AllowedRepo string - GitHubEventType string - SlackAPIBaseURL string - GitHubAPIBaseURL string - RequestTolerance time.Duration -} - -func main() { - cfg, err := loadConfigFromEnv() - if err != nil { - log.Fatal(err) - } - - handler := newHandler( - cfg, - newSlackHTTPClient(cfg.SlackBotToken, cfg.SlackAPIBaseURL), - newGitHubHTTPDispatcher(cfg.GitHubToken, cfg.GitHubAPIBaseURL, cfg.GitHubEventType, cfg.AllowedRepo), - time.Now, - ) - - mux := http.NewServeMux() - mux.Handle("/slack/events", handler) - - srv := &http.Server{ - Addr: cfg.Addr, - Handler: mux, - ReadHeaderTimeout: 5 * time.Second, - } - log.Fatal(srv.ListenAndServe()) -} - -func loadConfigFromEnv() (Config, error) { - cfg := Config{ - Addr: getEnvDefault("ADDR", defaultAddr), - SigningSecret: os.Getenv("SLACK_SIGNING_SECRET"), - SlackBotToken: os.Getenv("SLACK_BOT_TOKEN"), - GitHubToken: os.Getenv("GITHUB_TOKEN"), - AllowedRepo: getEnvFirst("ALLOWED_REPOSITORY", "GITHUB_REPOSITORY"), - GitHubEventType: getEnvDefault("GITHUB_EVENT_TYPE", defaultSlackEventType), - SlackAPIBaseURL: getEnvDefault("SLACK_API_BASE_URL", defaultSlackAPIBaseURL), - GitHubAPIBaseURL: getEnvDefault("GITHUB_API_BASE_URL", defaultGitHubAPIBaseURL), - RequestTolerance: defaultRequestTolerance, - } - - if tolerance := os.Getenv("SLACK_REQUEST_TOLERANCE"); tolerance != "" { - d, err := time.ParseDuration(tolerance) - if err != nil { - return Config{}, fmt.Errorf("parse SLACK_REQUEST_TOLERANCE: %w", err) - } - cfg.RequestTolerance = d - } - - switch { - case cfg.SigningSecret == "": - return Config{}, errors.New("SLACK_SIGNING_SECRET is required") - case cfg.SlackBotToken == "": - return Config{}, errors.New("SLACK_BOT_TOKEN is required") - case cfg.GitHubToken == "": - return Config{}, errors.New("GITHUB_TOKEN is required") - case cfg.AllowedRepo == "": - return Config{}, errors.New("ALLOWED_REPOSITORY or GITHUB_REPOSITORY is required") - default: - return cfg, nil - } -} - -func getEnvDefault(key, fallback string) string { - if value := os.Getenv(key); value != "" { - return value - } - return fallback -} - -func getEnvFirst(keys ...string) string { - for _, key := range keys { - if value := os.Getenv(key); value != "" { - return value - } - } - return "" -} - -type SlackMessageFetcher interface { - FetchParentMessage(ctx context.Context, channel, threadTS string) (string, error) -} - -type GitHubDispatcher interface { - DispatchRepositoryEvent(ctx context.Context, payload slacktriage.DispatchPayload) error -} - -type triageHandler struct { - cfg Config - slack SlackMessageFetcher - github GitHubDispatcher - now func() time.Time - maxBodyLen int64 -} - -func newHandler(cfg Config, slack SlackMessageFetcher, github GitHubDispatcher, now func() time.Time) *triageHandler { - if now == nil { - now = time.Now - } - if cfg.RequestTolerance <= 0 { - cfg.RequestTolerance = defaultRequestTolerance - } - - return &triageHandler{ - cfg: cfg, - slack: slack, - github: github, - now: now, - maxBodyLen: 1 << 20, - } -} - -func (h *triageHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { - if r.Method != http.MethodPost { - http.Error(w, http.StatusText(http.StatusMethodNotAllowed), http.StatusMethodNotAllowed) - return - } - defer r.Body.Close() - - body, err := io.ReadAll(io.LimitReader(r.Body, h.maxBodyLen)) - if err != nil { - log.Printf("read request body: %v", err) - http.Error(w, "read request body", http.StatusBadRequest) - return - } - - if err := h.verifyRequest(r, body); err != nil { - log.Printf("verify slack request: %v", err) - http.Error(w, http.StatusText(http.StatusUnauthorized), http.StatusUnauthorized) - return - } - - var envelope slackEnvelope - if err := json.Unmarshal(body, &envelope); err != nil { - log.Printf("decode slack payload: %v", err) - http.Error(w, "decode slack payload", http.StatusBadRequest) - return - } - - switch envelope.Type { - case slackEventTypeURLVerify: - writeJSON(w, http.StatusOK, map[string]string{"challenge": envelope.Challenge}) - case slackEventTypeCallback: - if err := h.handleEvent(r.Context(), envelope.Event); err != nil { - log.Printf("process slack event: %v", err) - http.Error(w, "process slack event", http.StatusInternalServerError) - return - } - w.WriteHeader(http.StatusOK) - default: - w.WriteHeader(http.StatusOK) - } -} - -func (h *triageHandler) verifyRequest(r *http.Request, body []byte) error { - timestamp := r.Header.Get(slackTimestampHeader) - signature := r.Header.Get(slackSignatureHeader) - if timestamp == "" || signature == "" { - return errors.New("missing slack signature headers") - } - - parsed, err := strconv.ParseInt(timestamp, 10, 64) - if err != nil { - return fmt.Errorf("invalid slack timestamp: %w", err) - } - - requestTime := time.Unix(parsed, 0) - now := h.now() - if absDuration(now.Sub(requestTime)) > h.cfg.RequestTolerance { - return errors.New("stale slack request") - } - - mac := hmac.New(sha256.New, []byte(h.cfg.SigningSecret)) - if _, err := mac.Write([]byte("v0:" + timestamp + ":" + string(body))); err != nil { - return fmt.Errorf("sign slack request: %w", err) - } - expected := "v0=" + hex.EncodeToString(mac.Sum(nil)) - if !hmac.Equal([]byte(expected), []byte(signature)) { - return errors.New("invalid slack signature") - } - - return nil -} - -func (h *triageHandler) handleEvent(ctx context.Context, event slackEvent) error { - if event.Type != slackInnerEventTypeMessage { - return nil - } - if event.Subtype != "" || event.BotID != "" { - return nil - } - if event.ThreadTS == "" || event.ThreadTS == event.Ts { - return nil - } - if !slacktriage.IsTriageTrigger(event.Text) { - return nil - } - if h.slack == nil { - return errors.New("slack fetcher is not configured") - } - if h.github == nil { - return errors.New("github dispatcher is not configured") - } - if event.Channel == "" { - return errors.New("channel is required") - } - - parentBody, err := h.slack.FetchParentMessage(ctx, event.Channel, event.ThreadTS) - if err != nil { - return fmt.Errorf("fetch parent message: %w", err) - } - - metadata, err := slacktriage.ParseParentMessageMetadata(parentBody) - if err != nil { - return fmt.Errorf("parse parent message metadata: %w", err) - } - if h.cfg.AllowedRepo != "" && metadata.Repo != h.cfg.AllowedRepo { - return nil - } - - payload := slacktriage.NewDispatchPayload(metadata, event.Channel, event.ThreadTS, event.User) - if err := h.github.DispatchRepositoryEvent(ctx, payload); err != nil { - return fmt.Errorf("dispatch repository event: %w", err) - } - return nil -} - -type slackEnvelope struct { - Type string `json:"type"` - Challenge string `json:"challenge,omitempty"` - Event slackEvent `json:"event,omitempty"` -} - -type slackEvent struct { - Type string `json:"type,omitempty"` - Subtype string `json:"subtype,omitempty"` - BotID string `json:"bot_id,omitempty"` - Channel string `json:"channel,omitempty"` - User string `json:"user,omitempty"` - Text string `json:"text,omitempty"` - Ts string `json:"ts,omitempty"` - ThreadTS string `json:"thread_ts,omitempty"` -} - -func writeJSON(w http.ResponseWriter, status int, value any) { - w.Header().Set("Content-Type", "application/json; charset=utf-8") - w.WriteHeader(status) - if err := json.NewEncoder(w).Encode(value); err != nil { - log.Printf("write json response: %v", err) - } -} - -func absDuration(d time.Duration) time.Duration { - if d < 0 { - return -d - } - return d -} - -type slackHTTPClient struct { - token string - baseURL string - client *http.Client -} - -func newSlackHTTPClient(token, baseURL string) *slackHTTPClient { - if baseURL == "" { - baseURL = defaultSlackAPIBaseURL - } - return &slackHTTPClient{ - token: token, - baseURL: strings.TrimRight(baseURL, "/"), - client: &http.Client{Timeout: 10 * time.Second}, - } -} - -func (c *slackHTTPClient) FetchParentMessage(ctx context.Context, channel, threadTS string) (string, error) { - endpoint, err := url.Parse(c.baseURL + "/conversations.replies") - if err != nil { - return "", fmt.Errorf("parse slack conversations.replies url: %w", err) - } - - query := endpoint.Query() - query.Set("channel", channel) - query.Set("ts", threadTS) - query.Set("inclusive", "true") - query.Set("limit", "1") - endpoint.RawQuery = query.Encode() - - //nolint:gosec // Slack API base URL is operator-configured. - req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint.String(), nil) - if err != nil { - return "", fmt.Errorf("build slack conversations.replies request: %w", err) - } - req.Header.Set("Authorization", "Bearer "+c.token) - - resp, err := c.client.Do(req) //nolint:gosec // taint tracked from operator-configured base URL above - if err != nil { - return "", fmt.Errorf("call slack conversations.replies: %w", err) - } - defer resp.Body.Close() - - if resp.StatusCode != http.StatusOK { - return "", fmt.Errorf("slack conversations.replies returned %s", resp.Status) - } - - var payload struct { - OK bool `json:"ok"` - Error string `json:"error,omitempty"` - Messages []struct { - Text string `json:"text"` - } `json:"messages"` - } - if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil { - return "", fmt.Errorf("decode slack conversations.replies response: %w", err) - } - if !payload.OK { - if payload.Error == "" { - payload.Error = "unknown error" - } - return "", fmt.Errorf("slack conversations.replies error: %s", payload.Error) - } - if len(payload.Messages) == 0 { - return "", errors.New("slack conversations.replies returned no messages") - } - return payload.Messages[0].Text, nil -} - -type githubHTTPDispatcher struct { - token string - baseURL string - eventType string - repository string - client *http.Client -} - -func newGitHubHTTPDispatcher(token, baseURL, eventType, repository string) *githubHTTPDispatcher { - if baseURL == "" { - baseURL = defaultGitHubAPIBaseURL - } - if eventType == "" { - eventType = defaultSlackEventType - } - return &githubHTTPDispatcher{ - token: token, - baseURL: strings.TrimRight(baseURL, "/"), - eventType: eventType, - repository: repository, - client: &http.Client{Timeout: 10 * time.Second}, - } -} - -func (d *githubHTTPDispatcher) DispatchRepositoryEvent(ctx context.Context, payload slacktriage.DispatchPayload) error { - body, err := json.Marshal(struct { - EventType string `json:"event_type"` - ClientPayload slacktriage.DispatchPayload `json:"client_payload"` - }{ - EventType: d.eventType, - ClientPayload: payload, - }) - if err != nil { - return fmt.Errorf("marshal github dispatch payload: %w", err) - } - - endpoint := fmt.Sprintf("%s/repos/%s/dispatches", d.baseURL, d.repository) - //nolint:gosec // GitHub API base URL and repository are operator-configured. - req, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewReader(body)) - if err != nil { - return fmt.Errorf("build github dispatch request: %w", err) - } - req.Header.Set("Authorization", "Bearer "+d.token) - req.Header.Set("Accept", "application/vnd.github+json") - req.Header.Set("Content-Type", "application/json") - - resp, err := d.client.Do(req) //nolint:gosec // taint tracked from operator-configured base URL above - if err != nil { - return fmt.Errorf("call github dispatch endpoint: %w", err) - } - defer resp.Body.Close() - - if resp.StatusCode != http.StatusNoContent { - responseBody, readErr := io.ReadAll(io.LimitReader(resp.Body, 4096)) - if readErr != nil { - return fmt.Errorf("github dispatch failed: %s (read response): %w", resp.Status, readErr) - } - if len(responseBody) > 0 { - return fmt.Errorf("github dispatch failed: %s: %s", resp.Status, strings.TrimSpace(string(responseBody))) - } - return fmt.Errorf("github dispatch failed: %s", resp.Status) - } - - return nil -} diff --git a/cmd/e2e-triage-dispatch/main_test.go b/cmd/e2e-triage-dispatch/main_test.go deleted file mode 100644 index 9f2eb4e5e..000000000 --- a/cmd/e2e-triage-dispatch/main_test.go +++ /dev/null @@ -1,346 +0,0 @@ -package main - -import ( - "context" - "crypto/hmac" - "crypto/sha256" - "encoding/hex" - "encoding/json" - "io" - "net/http" - "net/http/httptest" - "os" - "strconv" - "strings" - "testing" - "time" - - "github.com/entireio/cli/internal/slacktriage" -) - -const testSigningSecret = "test-signing-secret" - -func TestLoadConfigFromEnv_LoadsAllowedRepo(t *testing.T) { - os.Clearenv() - t.Setenv("SLACK_SIGNING_SECRET", "secret") - t.Setenv("SLACK_BOT_TOKEN", "bot") - t.Setenv("GITHUB_TOKEN", "gh") - t.Setenv("ALLOWED_REPOSITORY", "entireio/cli") - - cfg, err := loadConfigFromEnv() - if err != nil { - t.Fatalf("loadConfigFromEnv() error = %v", err) - } - if cfg.AllowedRepo != "entireio/cli" { - t.Fatalf("AllowedRepo = %q, want %q", cfg.AllowedRepo, "entireio/cli") - } -} - -func TestLoadConfigFromEnv_RequiresAllowedRepo(t *testing.T) { - os.Clearenv() - t.Setenv("SLACK_SIGNING_SECRET", "secret") - t.Setenv("SLACK_BOT_TOKEN", "bot") - t.Setenv("GITHUB_TOKEN", "gh") - - if _, err := loadConfigFromEnv(); err == nil { - t.Fatal("loadConfigFromEnv() error = nil, want error") - } -} - -func TestHandler_URLVerification(t *testing.T) { - t.Parallel() - - handler := newTestHandler(t) - body := `{"type":"url_verification","challenge":"abc123"}` - req := signedRequest(t, body, fixedNow()) - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - - var got struct { - Challenge string `json:"challenge"` - } - if err := json.Unmarshal(rr.Body.Bytes(), &got); err != nil { - t.Fatalf("unmarshal response: %v", err) - } - if got.Challenge != "abc123" { - t.Fatalf("challenge = %q, want %q", got.Challenge, "abc123") - } -} - -func TestHandler_RejectsBadSignature(t *testing.T) { - t.Parallel() - - handler := newTestHandler(t) - req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/slack/events", strings.NewReader(`{"type":"event_callback"}`)) - req.Header.Set("X-Slack-Request-Timestamp", strconv.FormatInt(fixedNow().Unix(), 10)) - req.Header.Set("X-Slack-Signature", "v0=deadbeef") - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusUnauthorized { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusUnauthorized) - } -} - -func TestHandler_IgnoresNonThreadReplies(t *testing.T) { - t.Parallel() - - fetcher := &fakeSlackFetcher{} - dispatcher := &fakeGitHubDispatcher{} - handler := newHandlerForTest(t, fetcher, dispatcher) - - body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.222"}}` - req := signedRequest(t, body, fixedNow()) - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - if fetcher.calls != 0 { - t.Fatalf("fetch calls = %d, want 0", fetcher.calls) - } - if dispatcher.calls != 0 { - t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) - } -} - -func TestHandler_IgnoresNonTriggerReplies(t *testing.T) { - t.Parallel() - - fetcher := &fakeSlackFetcher{} - dispatcher := &fakeGitHubDispatcher{} - handler := newHandlerForTest(t, fetcher, dispatcher) - - body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"hello world","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, body, fixedNow()) - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - if fetcher.calls != 0 { - t.Fatalf("fetch calls = %d, want 0", fetcher.calls) - } - if dispatcher.calls != 0 { - t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) - } -} - -func TestHandler_DispatchesValidTriggerReply(t *testing.T) { - t.Parallel() - - fetcher := &fakeSlackFetcher{ - body: "E2E Tests Failed\nmeta: repo=entireio/cli branch=main run_id=123 run_url=https://github.com/entireio/cli/actions/runs/123 sha=abc123 agents=cursor-cli,copilot-cli", - } - dispatcher := &fakeGitHubDispatcher{} - handler := newHandlerForTest(t, fetcher, dispatcher) - - body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, body, fixedNow()) - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - if fetcher.calls != 1 { - t.Fatalf("fetch calls = %d, want 1", fetcher.calls) - } - if dispatcher.calls != 1 { - t.Fatalf("dispatch calls = %d, want 1", dispatcher.calls) - } - - if fetcher.channel != "C123" || fetcher.threadTS != "111.111" { - t.Fatalf("fetch args = (%q, %q), want (%q, %q)", fetcher.channel, fetcher.threadTS, "C123", "111.111") - } - - got := dispatcher.payloads[0] - if got.TriggerText != slacktriage.TriageTriggerText { - t.Fatalf("trigger = %q, want %q", got.TriggerText, slacktriage.TriageTriggerText) - } - if got.Repo != "entireio/cli" || got.Branch != "main" || got.RunID != "123" || got.RunURL != "https://github.com/entireio/cli/actions/runs/123" || got.SHA != "abc123" { - t.Fatalf("unexpected payload metadata: %+v", got) - } - if got.SlackChannel != "C123" || got.SlackThreadTS != "111.111" || got.SlackUser != "U123" { - t.Fatalf("unexpected slack metadata: %+v", got) - } -} - -func TestHandler_IgnoresBotAndSystemMessages(t *testing.T) { - t.Parallel() - - tests := []struct { - name string - body string - }{ - { - name: "subtype", - body: `{"type":"event_callback","event":{"type":"message","subtype":"bot_message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}`, - }, - { - name: "bot_id", - body: `{"type":"event_callback","event":{"type":"message","bot_id":"B123","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}`, - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - t.Parallel() - - fetcher := &fakeSlackFetcher{} - dispatcher := &fakeGitHubDispatcher{} - handler := newHandlerForTest(t, fetcher, dispatcher) - - req := signedRequest(t, tt.body, fixedNow()) - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - if fetcher.calls != 0 { - t.Fatalf("fetch calls = %d, want 0", fetcher.calls) - } - if dispatcher.calls != 0 { - t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) - } - }) - } -} - -func TestGitHubDispatcher_UsesConfiguredRepository(t *testing.T) { - t.Parallel() - - var gotPath string - dispatcher := newGitHubHTTPDispatcher("token", "https://api.github.com", defaultSlackEventType, "entireio/cli") - dispatcher.client = &http.Client{ - Transport: roundTripFunc(func(r *http.Request) (*http.Response, error) { - gotPath = r.URL.Path - if r.Method != http.MethodPost { - t.Fatalf("method = %s, want POST", r.Method) - } - return &http.Response{ - StatusCode: http.StatusNoContent, - Body: io.NopCloser(strings.NewReader("")), - Header: make(http.Header), - }, nil - }), - } - payload := slacktriage.DispatchPayload{ - Repo: "other/repo", - TriggerText: slacktriage.TriageTriggerText, - Branch: "main", - SHA: "abc123", - RunURL: "https://github.com/entireio/cli/actions/runs/123", - RunID: "123", - FailedAgents: []string{"cursor-cli"}, - } - - if err := dispatcher.DispatchRepositoryEvent(context.Background(), payload); err != nil { - t.Fatalf("DispatchRepositoryEvent() error = %v", err) - } - if gotPath != "/repos/entireio/cli/dispatches" { - t.Fatalf("request path = %q, want %q", gotPath, "/repos/entireio/cli/dispatches") - } -} - -func TestHandler_RejectsMismatchedParentRepo(t *testing.T) { - t.Parallel() - - fetcher := &fakeSlackFetcher{ - body: "E2E Tests Failed\nmeta: repo=other/repo branch=main run_id=123 run_url=https://github.com/other/repo/actions/runs/123 sha=abc123 agents=cursor-cli", - } - dispatcher := &fakeGitHubDispatcher{} - handler := newHandlerForTest(t, fetcher, dispatcher) - - body := `{"type":"event_callback","event":{"type":"message","channel":"C123","user":"U123","text":"triage e2e","ts":"111.222","thread_ts":"111.111"}}` - req := signedRequest(t, body, fixedNow()) - - rr := httptest.NewRecorder() - handler.ServeHTTP(rr, req) - - if rr.Code != http.StatusOK { - t.Fatalf("status = %d, want %d", rr.Code, http.StatusOK) - } - if dispatcher.calls != 0 { - t.Fatalf("dispatch calls = %d, want 0", dispatcher.calls) - } -} - -func fixedNow() time.Time { - return time.Unix(1_700_000_000, 0).UTC() -} - -func signedRequest(t *testing.T, body string, now time.Time) *http.Request { - t.Helper() - - req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/slack/events", strings.NewReader(body)) - timestamp := strconv.FormatInt(now.Unix(), 10) - req.Header.Set("X-Slack-Request-Timestamp", timestamp) - req.Header.Set("X-Slack-Signature", slackSignature(testSigningSecret, timestamp, body)) - return req -} - -func slackSignature(secret, timestamp, body string) string { - mac := hmac.New(sha256.New, []byte(secret)) - _, _ = mac.Write([]byte("v0:" + timestamp + ":" + body)) - return "v0=" + hex.EncodeToString(mac.Sum(nil)) -} - -func newTestHandler(t *testing.T) http.Handler { - t.Helper() - return newHandler(Config{ - SigningSecret: testSigningSecret, - AllowedRepo: "entireio/cli", - }, &fakeSlackFetcher{}, &fakeGitHubDispatcher{}, fixedNow) -} - -func newHandlerForTest(t *testing.T, slack *fakeSlackFetcher, github *fakeGitHubDispatcher) http.Handler { - t.Helper() - return newHandler(Config{ - SigningSecret: testSigningSecret, - AllowedRepo: "entireio/cli", - }, slack, github, fixedNow) -} - -type fakeSlackFetcher struct { - calls int - channel string - threadTS string - body string -} - -func (f *fakeSlackFetcher) FetchParentMessage(_ context.Context, channel, threadTS string) (string, error) { - f.calls++ - f.channel = channel - f.threadTS = threadTS - return f.body, nil -} - -type fakeGitHubDispatcher struct { - calls int - payloads []slacktriage.DispatchPayload -} - -func (f *fakeGitHubDispatcher) DispatchRepositoryEvent(_ context.Context, payload slacktriage.DispatchPayload) error { - f.calls++ - f.payloads = append(f.payloads, payload) - return nil -} - -type roundTripFunc func(*http.Request) (*http.Response, error) - -func (f roundTripFunc) RoundTrip(r *http.Request) (*http.Response, error) { - return f(r) -} diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md index 54842dbd9..01549acc2 100644 --- a/docs/architecture/slack-e2e-triage.md +++ b/docs/architecture/slack-e2e-triage.md @@ -1,60 +1,62 @@ # Slack-Triggered E2E Triage -This flow lets a human reply `triage e2e` in the thread of an E2E failure alert and have GitHub Actions run the existing triage workflow. +When E2E tests fail on `main`, a Slack alert is posted with a clickable "Run Triage" link. Clicking it triggers the triage workflow via a Cloudflare Worker. ## Flow -1. `.github/workflows/e2e.yml` posts the failure alert and includes machine-readable metadata. -2. `cmd/e2e-triage-dispatch` listens for Slack thread replies, validates the reply text, fetches the parent alert, and dispatches GitHub. -3. `.github/workflows/e2e-triage.yml` checks out the failed SHA, runs the Claude triage skill, and posts results back to the Slack thread. +1. `.github/workflows/e2e.yml` posts a failure alert to Slack using the bot token (via `chat.postMessage`), then posts a threaded "Run Triage" link that encodes the run URL and Slack thread context. +2. A user clicks the link, which hits the Cloudflare Worker at `e2e-triage.entireio.workers.dev`. +3. The Worker validates the `run_url` and calls `workflow_dispatch` on `.github/workflows/e2e-triage.yml` via the GitHub API. +4. The triage workflow checks out the failed SHA, runs the Claude triage skill per failed agent, and posts results back to the Slack thread. -The trigger is the exact normalized text `triage e2e`. +``` +E2E fails -> bot posts alert to Slack (with "Run Triage" link) + -> user clicks link -> Cloudflare Worker -> GitHub API (workflow_dispatch) + -> e2e-triage.yml runs -> posts results back to Slack thread +``` -## Slack Setup +## Cloudflare Worker -Slack app requirements: +Located in `workers/e2e-triage-trigger/`. -- Event subscription for `message.channels` so the app receives public channel thread replies -- `channels:history` so the app can read the parent E2E alert message -- `chat:write` so the app can post status updates back into the thread +Accepts a GET request at `/triage` with query parameters: -If you want private-channel support, add the equivalent `groups:history` event and scope as well. +- `run_url` (required) — must match `https://github.com/entireio/cli/actions/runs/\d+` +- `slack_channel` — Slack channel ID for thread replies +- `slack_thread_ts` — Slack thread timestamp for thread replies -## GitHub And Runtime Config +The Worker dispatches `e2e-triage.yml` with these values as `workflow_dispatch` inputs. -`cmd/e2e-triage-dispatch` uses these environment variables: +**Secret:** `GITHUB_TOKEN` — a PAT with `actions:write` scope, stored in Cloudflare secrets (`wrangler secret put GITHUB_TOKEN`). -- `SLACK_SIGNING_SECRET` -- `SLACK_BOT_TOKEN` -- `GITHUB_TOKEN` -- `ALLOWED_REPOSITORY` or `GITHUB_REPOSITORY` -- `ADDR` optional, defaults to `:8080` -- `GITHUB_EVENT_TYPE` optional, defaults to `slack_e2e_triage_requested` -- `SLACK_API_BASE_URL` optional, defaults to `https://slack.com/api` -- `GITHUB_API_BASE_URL` optional, defaults to `https://api.github.com` -- `SLACK_REQUEST_TOLERANCE` optional, defaults to `5m` +## Slack Setup -The GitHub Actions workflow uses these secrets: +The Slack app needs: -- `ANTHROPIC_API_KEY` for Claude triage -- `SLACK_BOT_TOKEN` for start and completion replies -- The built-in `${{ github.token }}` for repository dispatch and repository checkout +- `chat:write` scope — to post alerts and triage results +- Bot must be invited to the alert channel -The externally deployed `cmd/e2e-triage-dispatch` service uses `GITHUB_TOKEN` to call the GitHub dispatch API. +No event subscriptions or incoming webhooks are needed. -## Manual Fallback +## GitHub Config -If Slack dispatch is unavailable, you can run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`. +**Repository variables:** -Required inputs: +- `E2E_SLACK_CHANNEL` — Slack channel ID where failure alerts are posted -- `run_url` -- `sha` -- `failed_agents` +**Repository secrets:** -Optional inputs: +- `SLACK_BOT_TOKEN` — Slack bot token with `chat:write` scope +- `ANTHROPIC_API_KEY` — for Claude triage + +The built-in `${{ github.token }}` is used for GitHub API calls within workflows. + +## Manual Fallback -- `slack_channel` -- `slack_thread_ts` +Run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`: -This is the fallback path for ad hoc triage when you already have the failed run URL and commit SHA. +- `run_url` (required) — the failed run URL +- `sha` — commit SHA (auto-detected from run if omitted) +- `failed_agents` — comma-separated list (auto-detected from run if omitted) +- `slack_channel` — for Slack thread replies +- `slack_thread_ts` — for Slack thread replies diff --git a/internal/slacktriage/dispatch.go b/internal/slacktriage/dispatch.go deleted file mode 100644 index a70ee0e30..000000000 --- a/internal/slacktriage/dispatch.go +++ /dev/null @@ -1,34 +0,0 @@ -package slacktriage - -// DispatchPayload is the structured payload sent to GitHub repository_dispatch. -type DispatchPayload struct { - TriggerText string `json:"trigger_text"` - Repo string `json:"repo"` - Branch string `json:"branch"` - SHA string `json:"sha"` - RunURL string `json:"run_url"` - RunID string `json:"run_id"` - FailedAgents []string `json:"failed_agents"` - SlackChannel string `json:"slack_channel"` - SlackThreadTS string `json:"slack_thread_ts"` - SlackUser string `json:"slack_user"` -} - -// NewDispatchPayload creates a pure data payload for the repository_dispatch bridge. -func NewDispatchPayload(meta ParentMessageMetadata, slackChannel, slackThreadTS, slackUser string) DispatchPayload { - failedAgents := make([]string, len(meta.FailedAgents)) - copy(failedAgents, meta.FailedAgents) - - return DispatchPayload{ - TriggerText: TriageTriggerText, - Repo: meta.Repo, - Branch: meta.Branch, - SHA: meta.SHA, - RunURL: meta.RunURL, - RunID: meta.RunID, - FailedAgents: failedAgents, - SlackChannel: slackChannel, - SlackThreadTS: slackThreadTS, - SlackUser: slackUser, - } -} diff --git a/internal/slacktriage/dispatch_test.go b/internal/slacktriage/dispatch_test.go deleted file mode 100644 index 06cced40b..000000000 --- a/internal/slacktriage/dispatch_test.go +++ /dev/null @@ -1,42 +0,0 @@ -package slacktriage - -import "testing" - -func TestNewDispatchPayload(t *testing.T) { - t.Parallel() - - meta := ParentMessageMetadata{ - Repo: "entireio/cli", - Branch: "main", - RunID: "123", - RunURL: "https://github.com/entireio/cli/actions/runs/123", - SHA: "abc123", - FailedAgents: []string{"cursor-cli", "copilot-cli"}, - } - - got := NewDispatchPayload(meta, "C123", "1742230000.123456", "U456") - - if got.TriggerText != TriageTriggerText { - t.Fatalf("TriggerText = %q, want %q", got.TriggerText, TriageTriggerText) - } - if got.Repo != meta.Repo || got.Branch != meta.Branch || got.RunID != meta.RunID || got.RunURL != meta.RunURL || got.SHA != meta.SHA { - t.Fatalf("payload metadata mismatch: got %+v want %+v", got, meta) - } - if got.SlackChannel != "C123" { - t.Fatalf("SlackChannel = %q, want %q", got.SlackChannel, "C123") - } - if got.SlackThreadTS != "1742230000.123456" { - t.Fatalf("SlackThreadTS = %q, want %q", got.SlackThreadTS, "1742230000.123456") - } - if got.SlackUser != "U456" { - t.Fatalf("SlackUser = %q, want %q", got.SlackUser, "U456") - } - if len(got.FailedAgents) != len(meta.FailedAgents) { - t.Fatalf("FailedAgents len = %d, want %d", len(got.FailedAgents), len(meta.FailedAgents)) - } - for i := range meta.FailedAgents { - if got.FailedAgents[i] != meta.FailedAgents[i] { - t.Fatalf("FailedAgents[%d] = %q, want %q", i, got.FailedAgents[i], meta.FailedAgents[i]) - } - } -} diff --git a/internal/slacktriage/normalize.go b/internal/slacktriage/normalize.go deleted file mode 100644 index 2469fcda5..000000000 --- a/internal/slacktriage/normalize.go +++ /dev/null @@ -1,15 +0,0 @@ -package slacktriage - -import "strings" - -const TriageTriggerText = "triage e2e" - -// NormalizeTrigger lowercases, trims, and collapses internal whitespace. -func NormalizeTrigger(text string) string { - return strings.Join(strings.Fields(strings.ToLower(text)), " ") -} - -// IsTriageTrigger reports whether text normalizes to the triage trigger phrase. -func IsTriageTrigger(text string) bool { - return NormalizeTrigger(text) == TriageTriggerText -} diff --git a/internal/slacktriage/normalize_test.go b/internal/slacktriage/normalize_test.go deleted file mode 100644 index 3767533e0..000000000 --- a/internal/slacktriage/normalize_test.go +++ /dev/null @@ -1,55 +0,0 @@ -package slacktriage - -import "testing" - -func TestNormalizeTrigger(t *testing.T) { - t.Parallel() - - tests := []struct { - name string - in string - want string - }{ - { - name: "preserves_exact_trigger", - in: "triage e2e", - want: "triage e2e", - }, - { - name: "lowercases_and_trims", - in: " Triage E2E ", - want: "triage e2e", - }, - { - name: "collapses_internal_whitespace", - in: "triage e2e", - want: "triage e2e", - }, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - t.Parallel() - if got := NormalizeTrigger(tt.in); got != tt.want { - t.Fatalf("NormalizeTrigger(%q) = %q, want %q", tt.in, got, tt.want) - } - }) - } -} - -func TestIsTriageTrigger(t *testing.T) { - t.Parallel() - - if !IsTriageTrigger(" Triage E2E ") { - t.Fatal("expected normalized trigger to match") - } - - for _, in := range []string{"triage", "triage e2e now", "triage-e2e"} { - t.Run(in, func(t *testing.T) { - t.Parallel() - if IsTriageTrigger(in) { - t.Fatalf("IsTriageTrigger(%q) = true, want false", in) - } - }) - } -} diff --git a/internal/slacktriage/parent_message.go b/internal/slacktriage/parent_message.go deleted file mode 100644 index 23d3b2483..000000000 --- a/internal/slacktriage/parent_message.go +++ /dev/null @@ -1,91 +0,0 @@ -package slacktriage - -import ( - "errors" - "fmt" - "strings" -) - -// ParentMessageMetadata captures the parsed machine-readable Slack alert metadata. -type ParentMessageMetadata struct { - Repo string - Branch string - RunID string - RunURL string - SHA string - FailedAgents []string -} - -// ParseParentMessageMetadata extracts the stable meta line from a Slack failure alert. -func ParseParentMessageMetadata(body string) (ParentMessageMetadata, error) { - const metaPrefix = "meta:" - - for _, line := range strings.Split(body, "\n") { - trimmed := strings.TrimSpace(line) - if !strings.HasPrefix(trimmed, metaPrefix) { - continue - } - - fields := strings.Fields(strings.TrimSpace(strings.TrimPrefix(trimmed, metaPrefix))) - values := make(map[string]string, len(fields)) - for _, field := range fields { - key, value, ok := strings.Cut(field, "=") - if !ok || key == "" || value == "" { - return ParentMessageMetadata{}, fmt.Errorf("invalid meta field %q", field) - } - if _, exists := values[key]; exists { - return ParentMessageMetadata{}, fmt.Errorf("duplicate meta field %q", key) - } - values[key] = value - } - - metadata := ParentMessageMetadata{ - Repo: values["repo"], - Branch: values["branch"], - RunID: values["run_id"], - RunURL: values["run_url"], - SHA: values["sha"], - } - if agents, ok := values["agents"]; ok && agents != "" { - metadata.FailedAgents = splitAndTrimCSV(agents) - } - - if err := metadata.validate(); err != nil { - return ParentMessageMetadata{}, err - } - return metadata, nil - } - - return ParentMessageMetadata{}, errors.New("meta line not found") -} - -func (m ParentMessageMetadata) validate() error { - switch { - case m.Repo == "": - return errors.New("repo is required") - case m.Branch == "": - return errors.New("branch is required") - case m.RunID == "": - return errors.New("run_id is required") - case m.RunURL == "": - return errors.New("run_url is required") - case m.SHA == "": - return errors.New("sha is required") - case len(m.FailedAgents) == 0: - return errors.New("failed_agents is required") - default: - return nil - } -} - -func splitAndTrimCSV(value string) []string { - parts := strings.Split(value, ",") - out := make([]string, 0, len(parts)) - for _, part := range parts { - trimmed := strings.TrimSpace(part) - if trimmed != "" { - out = append(out, trimmed) - } - } - return out -} diff --git a/internal/slacktriage/parent_message_test.go b/internal/slacktriage/parent_message_test.go deleted file mode 100644 index 1c42766fb..000000000 --- a/internal/slacktriage/parent_message_test.go +++ /dev/null @@ -1,51 +0,0 @@ -package slacktriage - -import ( - "testing" -) - -func TestParseParentMessageMetadata(t *testing.T) { - t.Parallel() - - body := "E2E Tests Failed on `main`\n\nFailed agents: *cursor-cli*\n\nmeta: repo=entireio/cli branch=main run_id=123 run_url=https://github.com/entireio/cli/actions/runs/123 sha=abc123 agents=cursor-cli,copilot-cli\nCommit: by alisha" - - got, err := ParseParentMessageMetadata(body) - if err != nil { - t.Fatalf("ParseParentMessageMetadata() error = %v", err) - } - - wantAgents := []string{"cursor-cli", "copilot-cli"} - if got.Repo != "entireio/cli" { - t.Fatalf("Repo = %q, want %q", got.Repo, "entireio/cli") - } - if got.Branch != "main" { - t.Fatalf("Branch = %q, want %q", got.Branch, "main") - } - if got.RunID != "123" { - t.Fatalf("RunID = %q, want %q", got.RunID, "123") - } - if got.RunURL != "https://github.com/entireio/cli/actions/runs/123" { - t.Fatalf("RunURL = %q, want %q", got.RunURL, "https://github.com/entireio/cli/actions/runs/123") - } - if got.SHA != "abc123" { - t.Fatalf("SHA = %q, want %q", got.SHA, "abc123") - } - if len(got.FailedAgents) != len(wantAgents) { - t.Fatalf("FailedAgents len = %d, want %d", len(got.FailedAgents), len(wantAgents)) - } - for i := range wantAgents { - if got.FailedAgents[i] != wantAgents[i] { - t.Fatalf("FailedAgents[%d] = %q, want %q", i, got.FailedAgents[i], wantAgents[i]) - } - } -} - -func TestParseParentMessageMetadata_IgnoresHumanReadableBody(t *testing.T) { - t.Parallel() - - body := "E2E Tests Failed on `main`\n\nFailed agents: *cursor-cli*\n\nCommit: by alisha" - - if _, err := ParseParentMessageMetadata(body); err == nil { - t.Fatal("ParseParentMessageMetadata() error = nil, want error") - } -} diff --git a/workers/e2e-triage-trigger/package.json b/workers/e2e-triage-trigger/package.json new file mode 100644 index 000000000..cef026d61 --- /dev/null +++ b/workers/e2e-triage-trigger/package.json @@ -0,0 +1,12 @@ +{ + "name": "e2e-triage-trigger", + "private": true, + "scripts": { + "dev": "wrangler dev", + "deploy": "wrangler deploy" + }, + "devDependencies": { + "@cloudflare/workers-types": "^4.20250312.0", + "wrangler": "^4.0.0" + } +} diff --git a/workers/e2e-triage-trigger/src/index.ts b/workers/e2e-triage-trigger/src/index.ts new file mode 100644 index 000000000..03608938e --- /dev/null +++ b/workers/e2e-triage-trigger/src/index.ts @@ -0,0 +1,59 @@ +export interface Env { + GITHUB_TOKEN: string; +} + +const RUN_URL_PATTERN = /^https:\/\/github\.com\/entireio\/cli\/actions\/runs\/\d+$/; +const WORKFLOW_ID = "e2e-triage.yml"; +const REPO = "entireio/cli"; + +export default { + async fetch(request: Request, env: Env): Promise { + const url = new URL(request.url); + if (url.pathname !== "/triage") { + return new Response("Not found", { status: 404 }); + } + + const runURL = url.searchParams.get("run_url"); + const slackChannel = url.searchParams.get("slack_channel") ?? ""; + const slackThreadTS = url.searchParams.get("slack_thread_ts") ?? ""; + + if (!runURL || !RUN_URL_PATTERN.test(runURL)) { + return new Response("Invalid or missing run_url parameter", { status: 400 }); + } + + const resp = await fetch( + `https://api.github.com/repos/${REPO}/actions/workflows/${WORKFLOW_ID}/dispatches`, + { + method: "POST", + headers: { + Authorization: `Bearer ${env.GITHUB_TOKEN}`, + Accept: "application/vnd.github+json", + "User-Agent": "e2e-triage-trigger-worker", + }, + body: JSON.stringify({ + ref: "main", + inputs: { + run_url: runURL, + slack_channel: slackChannel, + slack_thread_ts: slackThreadTS, + }, + }), + }, + ); + + if (!resp.ok) { + const body = await resp.text(); + return new Response(`GitHub API error: ${resp.status} ${body}`, { status: 502 }); + } + + return new Response( + ` + +

Triage started

+

The E2E triage workflow has been dispatched. Check Slack for results.

+

View original run

+`, + { headers: { "Content-Type": "text/html; charset=utf-8" } }, + ); + }, +} satisfies ExportedHandler; diff --git a/workers/e2e-triage-trigger/wrangler.toml b/workers/e2e-triage-trigger/wrangler.toml new file mode 100644 index 000000000..5ca7cdc73 --- /dev/null +++ b/workers/e2e-triage-trigger/wrangler.toml @@ -0,0 +1,3 @@ +name = "e2e-triage-trigger" +main = "src/index.ts" +compatibility_date = "2025-01-01" From fa9e760a4befb2b21210e1c01d90feddd52cc7df Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 13:45:03 -0700 Subject: [PATCH 23/32] refactor: move Cloudflare Worker to infra repo The e2e-triage-trigger worker belongs in the infra repo (cloudflare/workers/e2e-triage-trigger/), not the CLI repo. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: c67902090531 --- docs/architecture/slack-e2e-triage.md | 2 +- workers/e2e-triage-trigger/package.json | 12 ----- workers/e2e-triage-trigger/src/index.ts | 59 ------------------------ workers/e2e-triage-trigger/wrangler.toml | 3 -- 4 files changed, 1 insertion(+), 75 deletions(-) delete mode 100644 workers/e2e-triage-trigger/package.json delete mode 100644 workers/e2e-triage-trigger/src/index.ts delete mode 100644 workers/e2e-triage-trigger/wrangler.toml diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md index 01549acc2..d7e8caff4 100644 --- a/docs/architecture/slack-e2e-triage.md +++ b/docs/architecture/slack-e2e-triage.md @@ -17,7 +17,7 @@ E2E fails -> bot posts alert to Slack (with "Run Triage" link) ## Cloudflare Worker -Located in `workers/e2e-triage-trigger/`. +Source lives in the infra repo at `cloudflare/workers/e2e-triage-trigger/`. Accepts a GET request at `/triage` with query parameters: diff --git a/workers/e2e-triage-trigger/package.json b/workers/e2e-triage-trigger/package.json deleted file mode 100644 index cef026d61..000000000 --- a/workers/e2e-triage-trigger/package.json +++ /dev/null @@ -1,12 +0,0 @@ -{ - "name": "e2e-triage-trigger", - "private": true, - "scripts": { - "dev": "wrangler dev", - "deploy": "wrangler deploy" - }, - "devDependencies": { - "@cloudflare/workers-types": "^4.20250312.0", - "wrangler": "^4.0.0" - } -} diff --git a/workers/e2e-triage-trigger/src/index.ts b/workers/e2e-triage-trigger/src/index.ts deleted file mode 100644 index 03608938e..000000000 --- a/workers/e2e-triage-trigger/src/index.ts +++ /dev/null @@ -1,59 +0,0 @@ -export interface Env { - GITHUB_TOKEN: string; -} - -const RUN_URL_PATTERN = /^https:\/\/github\.com\/entireio\/cli\/actions\/runs\/\d+$/; -const WORKFLOW_ID = "e2e-triage.yml"; -const REPO = "entireio/cli"; - -export default { - async fetch(request: Request, env: Env): Promise { - const url = new URL(request.url); - if (url.pathname !== "/triage") { - return new Response("Not found", { status: 404 }); - } - - const runURL = url.searchParams.get("run_url"); - const slackChannel = url.searchParams.get("slack_channel") ?? ""; - const slackThreadTS = url.searchParams.get("slack_thread_ts") ?? ""; - - if (!runURL || !RUN_URL_PATTERN.test(runURL)) { - return new Response("Invalid or missing run_url parameter", { status: 400 }); - } - - const resp = await fetch( - `https://api.github.com/repos/${REPO}/actions/workflows/${WORKFLOW_ID}/dispatches`, - { - method: "POST", - headers: { - Authorization: `Bearer ${env.GITHUB_TOKEN}`, - Accept: "application/vnd.github+json", - "User-Agent": "e2e-triage-trigger-worker", - }, - body: JSON.stringify({ - ref: "main", - inputs: { - run_url: runURL, - slack_channel: slackChannel, - slack_thread_ts: slackThreadTS, - }, - }), - }, - ); - - if (!resp.ok) { - const body = await resp.text(); - return new Response(`GitHub API error: ${resp.status} ${body}`, { status: 502 }); - } - - return new Response( - ` - -

Triage started

-

The E2E triage workflow has been dispatched. Check Slack for results.

-

View original run

-`, - { headers: { "Content-Type": "text/html; charset=utf-8" } }, - ); - }, -} satisfies ExportedHandler; diff --git a/workers/e2e-triage-trigger/wrangler.toml b/workers/e2e-triage-trigger/wrangler.toml deleted file mode 100644 index 5ca7cdc73..000000000 --- a/workers/e2e-triage-trigger/wrangler.toml +++ /dev/null @@ -1,3 +0,0 @@ -name = "e2e-triage-trigger" -main = "src/index.ts" -compatibility_date = "2025-01-01" From 22df6767d44c12a609aa098b55892121e13e56c0 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 15:20:27 -0700 Subject: [PATCH 24/32] fix: harden jq agent extraction and add concurrency guard to triage workflow jq capture() crashes the pipeline when a failed job name doesn't match the expected (agent) pattern. Wrap in try-catch to gracefully skip non-matching jobs. Add concurrency group to e2e-triage workflow to prevent duplicate runs from Slack retries or re-dispatches. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 6db4ca1d788d --- .github/workflows/e2e-triage.yml | 12 ++++++++++-- .github/workflows/e2e.yml | 6 +++++- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index f51c1497c..6be5da813 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -28,6 +28,10 @@ permissions: actions: read contents: read +concurrency: + group: e2e-triage-${{ inputs.run_url || github.run_id }} + cancel-in-progress: true + jobs: matrix-setup: runs-on: ubuntu-latest @@ -73,7 +77,11 @@ jobs: sha=$(echo "$run_data" | jq -r '.headSha') fi if [ -z "$FAILED_AGENTS_INPUT" ]; then - agents_json=$(echo "$run_data" | jq -c '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent]') + agents_json=$(echo "$run_data" | jq -c '[.jobs[] + | select(.conclusion == "failure") + | (.name | (try capture("\\((?[^)]+)\\)").agent catch null)) + | select(. != null) + ]') fi fi @@ -90,7 +98,7 @@ jobs: exit 1 fi if [ -z "$agents_json" ] || [ "$agents_json" = "[]" ] || [ "$agents_json" = "null" ]; then - echo "failed_agents is required" >&2 + echo "agents is required (provide failed_agents input or ensure failed job names contain '(agent-name)')" >&2 exit 1 fi diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 12f3c5fcb..60122120b 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -124,7 +124,11 @@ jobs: GH_TOKEN: ${{ github.token }} run: | failed=$(gh api repos/${{ github.repository }}/actions/runs/${{ github.run_id }}/jobs \ - --jq '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent] | join(", ")') + --jq '[.jobs[] + | select(.conclusion == "failure") + | (.name | (try capture("\\((?[^)]+)\\)").agent catch null)) + | select(. != null) + ] | join(", ")') failed_csv="${failed//, /,}" echo "agents=$failed" >> "$GITHUB_OUTPUT" echo "agents_csv=$failed_csv" >> "$GITHUB_OUTPUT" From e2288ad781b440357cb84f5b6e4d2b45321c138f Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 15:34:18 -0700 Subject: [PATCH 25/32] fix: stop dumping raw markdown into step log, direct to job summary The tee to stdout made the "Run triage" step log unreadable since GitHub Actions logs render markdown as plain text. The rendered report is already written to $GITHUB_STEP_SUMMARY. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 748308902ac5 --- scripts/run-e2e-triage.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh index 96bd4b0f9..04b6250ed 100755 --- a/scripts/run-e2e-triage.sh +++ b/scripts/run-e2e-triage.sh @@ -25,4 +25,6 @@ claude \ "Glob" \ -p "$triage_args" \ 2>&1 | sed 's/\x1b\[[0-9;]*[a-zA-Z]//g; s/\x1b\[[?][0-9]*[a-zA-Z]//g' \ - | tee "$TRIAGE_OUTPUT_FILE" + > "$TRIAGE_OUTPUT_FILE" + +echo "Triage complete for ${E2E_AGENT}. See the Job Summary tab for the rendered report." From 63396e2b24b726e8c591c1feab49b705a7305814 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 15:36:38 -0700 Subject: [PATCH 26/32] fix: rename triage artifact from .log to .md for proper rendering Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 9835d73e7909 --- .github/workflows/e2e-triage.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 6be5da813..22443d33a 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -181,7 +181,7 @@ jobs: env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GH_TOKEN: ${{ github.token }} - TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log + TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md TRIAGE_SHA: ${{ needs.matrix-setup.outputs.sha }} run: scripts/run-e2e-triage.sh @@ -190,7 +190,7 @@ jobs: if: always() shell: bash env: - TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.log + TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md run: | set -euo pipefail From 1544f18f2a1a33f7247c6b2784ba51613ec48cbc Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 17:13:15 -0700 Subject: [PATCH 27/32] feat: add plan generation and fix pipeline to e2e triage CI Migrate triage to claude-code-action, add plan generation step after triage, post "Fix It" link to Slack, and create e2e-fix.yml workflow that applies plans and opens draft PRs via claude-code-action. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: b0e6e608e807 --- .claude/skills/e2e/implement.md | 4 + .github/workflows/e2e-fix.yml | 172 ++++++++++++++++++++++++++ .github/workflows/e2e-triage.yml | 129 +++++++++++-------- docs/architecture/slack-e2e-triage.md | 93 ++++++++++++-- 4 files changed, 334 insertions(+), 64 deletions(-) create mode 100644 .github/workflows/e2e-fix.yml diff --git a/.claude/skills/e2e/implement.md b/.claude/skills/e2e/implement.md index 342c87213..db9ba8e93 100644 --- a/.claude/skills/e2e/implement.md +++ b/.claude/skills/e2e/implement.md @@ -2,6 +2,10 @@ Apply fixes for E2E test failures, verify with scoped E2E tests. +> **Before implementing any fixes, enter plan mode by invoking /plan.** +> Analyze the findings (Steps 1-2 below), produce a complete fix plan with +> specific file paths and code changes, and get user approval before executing. + > **IMPORTANT: Running real E2E tests is a HARD REQUIREMENT of this procedure.** > Every fix MUST be verified with real E2E tests before the summary step. > Canary tests use the Vogon fake agent and cannot catch agent-specific issues. diff --git a/.github/workflows/e2e-fix.yml b/.github/workflows/e2e-fix.yml new file mode 100644 index 000000000..860e491b8 --- /dev/null +++ b/.github/workflows/e2e-fix.yml @@ -0,0 +1,172 @@ +name: E2E Fix + +on: + workflow_dispatch: + inputs: + triage_run_id: + description: Run ID of the triage workflow (for downloading plan artifacts) + required: true + type: string + run_url: + description: Original failed E2E run URL + required: true + type: string + failed_agents: + description: Comma-separated list of agents to fix + required: true + type: string + slack_channel: + description: Slack channel ID for thread replies + required: false + type: string + slack_thread_ts: + description: Slack thread timestamp for replies + required: false + type: string + +permissions: + actions: read + contents: write + pull-requests: write + id-token: write + +concurrency: + group: e2e-fix-${{ inputs.run_url || github.run_id }} + cancel-in-progress: true + +jobs: + fix: + runs-on: ubuntu-latest + timeout-minutes: 30 + env: + SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }} + SLACK_CHANNEL: ${{ inputs.slack_channel }} + SLACK_THREAD_TS: ${{ inputs.slack_thread_ts }} + steps: + - name: Write Slack helper + if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + run: | + set -euo pipefail + + helper="$RUNNER_TEMP/post-slack-message.sh" + cat > "$helper" <<'EOF' + #!/usr/bin/env bash + set -euo pipefail + + text="${1:?message is required}" + payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + + if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack notification failed" >&2 + exit 0 + fi + + if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack notification returned non-ok response" >&2 + fi + EOF + chmod +x "$helper" + + - name: Post fix started + if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + shell: bash + env: + FAILED_AGENTS: ${{ inputs.failed_agents }} + run: | + set -euo pipefail + + "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E fix for \`${FAILED_AGENTS}\`." + + - name: Checkout repository + uses: actions/checkout@v6 + with: + fetch-depth: 0 + + - name: Setup mise + uses: jdx/mise-action@v4 + + - name: Download plan artifacts + env: + GH_TOKEN: ${{ github.token }} + TRIAGE_RUN_ID: ${{ inputs.triage_run_id }} + FAILED_AGENTS: ${{ inputs.failed_agents }} + shell: bash + run: | + set -euo pipefail + + mkdir -p triage-plans + + IFS=',' read -ra agents <<< "$FAILED_AGENTS" + for agent in "${agents[@]}"; do + agent="$(echo "$agent" | xargs)" # trim whitespace + echo "Downloading plan for $agent..." + gh run download "$TRIAGE_RUN_ID" \ + --name "e2e-plan-${agent}" \ + --dir "triage-plans/${agent}" || { + echo "warning: no plan artifact found for $agent" >&2 + continue + } + done + + echo "Downloaded plans:" + find triage-plans -name '*.md' -type f + + - name: Apply fixes + id: fix + uses: anthropics/claude-code-action@v1 + with: + prompt: | + Read the fix plans in the triage-plans/ directory. Each subdirectory contains a plan.md for one agent. + + Execute all fixes exactly as specified in the plans. After applying fixes, run: + 1. mise run fmt + 2. mise run lint + 3. mise run test:e2e:canary + + If verification passes, create a git branch fix/e2e-${{ github.run_id }}, commit all changes, + push, and create a draft PR with a summary of what was fixed. + + If verification fails, fix the issues and retry. Do not give up without attempting to fix lint/format errors. + anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} + claude_args: "--allowedTools 'Edit,Write,Read,Glob,Grep,Bash(git:*),Bash(mise:*),Bash(gh:*)'" + + - name: Post success to Slack + if: success() && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' + shell: bash + env: + GH_TOKEN: ${{ github.token }} + FIX_BRANCH: fix/e2e-${{ github.run_id }} + RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }} + run: | + set -euo pipefail + + # Find the draft PR URL from the fix step output + pr_url="$(gh pr list --head "$FIX_BRANCH" --json url -q '.[0].url' 2>/dev/null || true)" + + if [ -n "$pr_url" ]; then + message="E2E fix complete — draft PR ready: <${pr_url}|Review PR>" + else + message="E2E fix complete — changes applied but no PR was created. Check the <${RUN_URL}|workflow run> for details." + fi + + "$RUNNER_TEMP/post-slack-message.sh" "$message" + + - name: Post failure to Slack + if: failure() && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' + shell: bash + env: + RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }} + run: | + set -euo pipefail + + message="E2E fix failed. Check the <${RUN_URL}|workflow run> for details." + + "$RUNNER_TEMP/post-slack-message.sh" "$message" diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 22443d33a..53e8c1187 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -27,6 +27,7 @@ on: permissions: actions: read contents: read + id-token: write concurrency: group: e2e-triage-${{ inputs.run_url || github.run_id }} @@ -171,72 +172,70 @@ jobs: - name: Setup mise uses: jdx/mise-action@v4 - - name: Install Claude CLI + - name: Download E2E artifacts + id: artifacts + env: + GH_TOKEN: ${{ github.token }} run: | - curl -fsSL https://claude.ai/install.sh | bash - echo "$HOME/.local/bin" >> "$GITHUB_PATH" + artifact_path="$(scripts/download-e2e-artifacts.sh "$RUN_URL")" + echo "path=$artifact_path" >> "$GITHUB_OUTPUT" - name: Run triage id: triage - env: - ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} - GH_TOKEN: ${{ github.token }} - TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md - TRIAGE_SHA: ${{ needs.matrix-setup.outputs.sha }} - run: scripts/run-e2e-triage.sh + uses: anthropics/claude-code-action@v1 + with: + prompt: | + /e2e:triage-ci ${{ steps.artifacts.outputs.path }} --agent ${{ matrix.agent }} --sha ${{ needs.matrix-setup.outputs.sha }} + anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} + claude_args: "--allowedTools 'Read,Grep,Glob'" + display_report: true - - name: Summarize triage output - id: summary - if: always() + - name: Extract triage output + id: triage_output + if: always() && steps.triage.outputs.execution_file != '' shell: bash env: + EXECUTION_FILE: ${{ steps.triage.outputs.execution_file }} TRIAGE_OUTPUT_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md run: | set -euo pipefail - summary="" - if [ -f "$TRIAGE_OUTPUT_FILE" ]; then - summary="$( - (grep '^## ' "$TRIAGE_OUTPUT_FILE" 2>/dev/null | head -n 3 | sed 's/^## //' | awk ' - NF { - if (out != "") { - out = out " | " - } - out = out $0 - } - END { - print out - } - ') || true - )" - fi + mkdir -p "$(dirname "$TRIAGE_OUTPUT_FILE")" + # Extract assistant text content from execution JSON + jq -r '[.[] | select(.type == "assistant") | .message.content[] + | select(.type == "text") | .text] | join("\n")' \ + "$EXECUTION_FILE" > "$TRIAGE_OUTPUT_FILE" - { - echo 'summary<> "$GITHUB_OUTPUT" - - # Print triage output to job summary - { - echo "## E2E Triage: ${E2E_AGENT}" - echo "" - if [ -f "$TRIAGE_OUTPUT_FILE" ] && [ -s "$TRIAGE_OUTPUT_FILE" ]; then - sed 's/\x1b\[[0-9;]*[a-zA-Z]//g; s/\x1b\[[?][0-9]*[a-zA-Z]//g' "$TRIAGE_OUTPUT_FILE" - else - echo "No triage output was produced." - echo "" - echo "The triage log file was either not created or is empty." - echo "Check the 'Run triage' step logs for details." - fi - } >> "$GITHUB_STEP_SUMMARY" + - name: Generate fix plan + id: plan + if: steps.triage.outcome == 'success' + uses: anthropics/claude-code-action@v1 + with: + prompt: | + /e2e:implement Read the triage findings at ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md for agent ${{ matrix.agent }}. + anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} + claude_args: "--allowedTools 'Read,Grep,Glob'" + display_report: true + + - name: Extract plan output + id: plan_output + if: steps.plan.outcome == 'success' + shell: bash + env: + EXECUTION_FILE: ${{ steps.plan.outputs.execution_file }} + PLAN_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/plan.md + run: | + set -euo pipefail + + jq -r '[.[] | select(.type == "assistant") | .message.content[] + | select(.type == "text") | .text] | join("\n")' \ + "$EXECUTION_FILE" > "$PLAN_FILE" - name: Post triage completion if: ${{ always() && (steps.triage.outcome == 'success' || steps.triage.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} shell: bash env: TRIAGE_OUTCOME: ${{ steps.triage.outcome }} - TRIAGE_SUMMARY: ${{ steps.summary.outputs.summary }} run: | set -euo pipefail @@ -245,9 +244,29 @@ jobs: else message="E2E triage failed for \`$E2E_AGENT\`." fi - if [ -n "$TRIAGE_SUMMARY" ]; then - message="$message $TRIAGE_SUMMARY" - fi + + "$RUNNER_TEMP/post-slack-message.sh" "$message" + + - name: Post fix plan to Slack + if: steps.plan.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' + shell: bash + env: + PLAN_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/plan.md + TRIAGE_RUN_ID: ${{ github.run_id }} + run: | + set -euo pipefail + + # Extract first few lines as summary + summary="$(head -20 "$PLAN_FILE" | sed '/^$/d' | head -5)" + + # Construct Fix It URL + encoded_run_url="$(python3 -c "import urllib.parse; print(urllib.parse.quote('$RUN_URL', safe=''))")" + fix_url="https://e2e-triage.entireio.workers.dev/fix?triage_run_id=${TRIAGE_RUN_ID}&run_url=${encoded_run_url}&failed_agents=${E2E_AGENT}&slack_channel=${SLACK_CHANNEL}&slack_thread_ts=${SLACK_THREAD_TS}" + + message="Fix plan ready for \`$E2E_AGENT\`: + ${summary} + + <${fix_url}|Fix It> — applies the plan and creates a draft PR" "$RUNNER_TEMP/post-slack-message.sh" "$message" @@ -258,3 +277,11 @@ jobs: name: e2e-triage-${{ matrix.agent }} path: e2e-triage-artifacts/ retention-days: 7 + + - name: Upload plan artifact + if: always() && steps.plan.outcome == 'success' + uses: actions/upload-artifact@v7 + with: + name: e2e-plan-${{ matrix.agent }} + path: e2e-triage-artifacts/${{ matrix.agent }}/plan.md + retention-days: 7 diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md index d7e8caff4..8c22dbb79 100644 --- a/docs/architecture/slack-e2e-triage.md +++ b/docs/architecture/slack-e2e-triage.md @@ -1,34 +1,89 @@ -# Slack-Triggered E2E Triage +# Slack-Triggered E2E Triage & Fix Pipeline -When E2E tests fail on `main`, a Slack alert is posted with a clickable "Run Triage" link. Clicking it triggers the triage workflow via a Cloudflare Worker. +When E2E tests fail on `main`, a Slack alert is posted with a clickable "Run Triage" link. Clicking it triggers the triage workflow via a Cloudflare Worker, which triages failures, generates a fix plan, and offers a "Fix It" link that auto-applies fixes and opens a draft PR. ## Flow -1. `.github/workflows/e2e.yml` posts a failure alert to Slack using the bot token (via `chat.postMessage`), then posts a threaded "Run Triage" link that encodes the run URL and Slack thread context. -2. A user clicks the link, which hits the Cloudflare Worker at `e2e-triage.entireio.workers.dev`. -3. The Worker validates the `run_url` and calls `workflow_dispatch` on `.github/workflows/e2e-triage.yml` via the GitHub API. -4. The triage workflow checks out the failed SHA, runs the Claude triage skill per failed agent, and posts results back to the Slack thread. - ``` -E2E fails -> bot posts alert to Slack (with "Run Triage" link) - -> user clicks link -> Cloudflare Worker -> GitHub API (workflow_dispatch) - -> e2e-triage.yml runs -> posts results back to Slack thread +E2E fails → Slack alert → "Run Triage" → e2e-triage.yml + → triage (claude-code-action, read-only) → plan generation (claude-code-action, read-only) + → plan in GH summary + Slack "Fix It" link + → user clicks "Fix It" → Worker /fix → e2e-fix.yml + → claude-code-action (write tools) → applies fixes + creates draft PR → Slack ``` +1. `.github/workflows/e2e.yml` posts a failure alert to Slack using the bot token (via `chat.postMessage`), then posts a threaded "Run Triage" link that encodes the run URL and Slack thread context. +2. A user clicks the link, which hits the Cloudflare Worker at `e2e-triage.entireio.workers.dev/triage`. +3. The Worker validates the `run_url` and calls `workflow_dispatch` on `.github/workflows/e2e-triage.yml` via the GitHub API. +4. The triage workflow: + - Downloads E2E artifacts via `scripts/download-e2e-artifacts.sh` + - Runs the `/e2e:triage-ci` skill via `anthropics/claude-code-action@v1` (read-only tools) + - Extracts triage output from the execution file and writes it to `triage.md` + - Runs the `/e2e:implement` skill via `claude-code-action` (read-only tools) to generate a fix plan + - Posts the plan to GitHub step summary (via `display_report: true`) and uploads it as an artifact + - Posts a "Fix It" link to the Slack thread +5. A user clicks the "Fix It" link, which hits the Worker at `e2e-triage.entireio.workers.dev/fix`. +6. The Worker dispatches `.github/workflows/e2e-fix.yml` with the triage run ID and agent list. +7. The fix workflow: + - Downloads plan artifacts from the triage run + - Runs `claude-code-action` with write tools to apply fixes, run verification (`fmt`, `lint`, `test:e2e:canary`), and create a draft PR + - Posts the PR link (or failure details) to the Slack thread + ## Cloudflare Worker Source lives in the infra repo at `cloudflare/workers/e2e-triage-trigger/`. -Accepts a GET request at `/triage` with query parameters: +### `/triage` endpoint + +Accepts a GET request with query parameters: - `run_url` (required) — must match `https://github.com/entireio/cli/actions/runs/\d+` - `slack_channel` — Slack channel ID for thread replies - `slack_thread_ts` — Slack thread timestamp for thread replies -The Worker dispatches `e2e-triage.yml` with these values as `workflow_dispatch` inputs. +Dispatches `e2e-triage.yml` with these values as `workflow_dispatch` inputs. + +### `/fix` endpoint + +Accepts a GET request with query parameters: + +- `triage_run_id` (required) — numeric run ID of the triage workflow +- `run_url` (required) — original failed E2E run URL +- `failed_agents` (required) — comma-separated list of agents to fix +- `slack_channel` — Slack channel ID for thread replies +- `slack_thread_ts` — Slack thread timestamp for thread replies + +Dispatches `e2e-fix.yml` with these values as `workflow_dispatch` inputs. **Secret:** `GITHUB_TOKEN` — a PAT with `actions:write` scope, stored in Cloudflare secrets (`wrangler secret put GITHUB_TOKEN`). +## Workflows + +### `e2e-triage.yml` + +Triages E2E failures and generates fix plans. Uses a matrix strategy (one job per failed agent). + +**Claude invocations** (both via `anthropics/claude-code-action@v1`): + +| Step | Skill | Tools | Output | +|------|-------|-------|--------| +| Run triage | `/e2e:triage-ci` | Read, Grep, Glob | triage.md (artifact + GH summary) | +| Generate fix plan | `/e2e:implement` | Read, Grep, Glob | plan.md (artifact + GH summary) | + +The `/e2e:implement` skill enters plan mode first (read-only tools prevent actual file changes), producing a detailed fix plan without applying anything. + +### `e2e-fix.yml` + +Applies fix plans and creates a draft PR. Single job (not matrix) since fixes may touch shared test infrastructure. + +**Claude invocation** (via `anthropics/claude-code-action@v1`): + +| Step | Tools | Output | +|------|-------|--------| +| Apply fixes | Edit, Write, Read, Glob, Grep, Bash(git:\*), Bash(mise:\*), Bash(gh:\*) | Branch + draft PR | + +Claude reads the plan artifacts, applies fixes, runs `mise run fmt && mise run lint && mise run test:e2e:canary`, then creates a `fix/e2e-` branch and opens a draft PR. + ## Slack Setup The Slack app needs: @@ -47,12 +102,14 @@ No event subscriptions or incoming webhooks are needed. **Repository secrets:** - `SLACK_BOT_TOKEN` — Slack bot token with `chat:write` scope -- `ANTHROPIC_API_KEY` — for Claude triage +- `ANTHROPIC_API_KEY` — for Claude triage and fix steps The built-in `${{ github.token }}` is used for GitHub API calls within workflows. ## Manual Fallback +### Triage + Run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`: - `run_url` (required) — the failed run URL @@ -60,3 +117,13 @@ Run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`: - `failed_agents` — comma-separated list (auto-detected from run if omitted) - `slack_channel` — for Slack thread replies - `slack_thread_ts` — for Slack thread replies + +### Fix + +Run `.github/workflows/e2e-fix.yml` manually with `workflow_dispatch`: + +- `triage_run_id` (required) — run ID of the triage workflow +- `run_url` (required) — original failed E2E run URL +- `failed_agents` (required) — comma-separated list of agents to fix +- `slack_channel` — for Slack thread replies +- `slack_thread_ts` — for Slack thread replies From eb354bd4676c7e6c683c2dd81d441dc83407d884 Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 17:26:51 -0700 Subject: [PATCH 28/32] fix: create plan output directory before writing plan.md Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: f8fee1d345e0 --- .github/workflows/e2e-triage.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 53e8c1187..87e68dc02 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -227,6 +227,7 @@ jobs: run: | set -euo pipefail + mkdir -p "$(dirname "$PLAN_FILE")" jq -r '[.[] | select(.type == "assistant") | .message.content[] | select(.type == "text") | .text] | join("\n")' \ "$EXECUTION_FILE" > "$PLAN_FILE" From e093f7eb95c54130c3aee4be402c1f37d23b3acd Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Fri, 20 Mar 2026 17:47:00 -0700 Subject: [PATCH 29/32] fix: guard plan extraction against empty execution_file output The claude-code-action can succeed without producing an execution_file. Add the same non-empty guard used by the triage extraction step, and gate downstream steps (Slack post, artifact upload) on plan_output succeeding rather than just the plan step. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 7f2ce308957d --- .github/workflows/e2e-triage.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index 87e68dc02..faea38f52 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -219,7 +219,7 @@ jobs: - name: Extract plan output id: plan_output - if: steps.plan.outcome == 'success' + if: steps.plan.outcome == 'success' && steps.plan.outputs.execution_file != '' shell: bash env: EXECUTION_FILE: ${{ steps.plan.outputs.execution_file }} @@ -249,7 +249,7 @@ jobs: "$RUNNER_TEMP/post-slack-message.sh" "$message" - name: Post fix plan to Slack - if: steps.plan.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' + if: steps.plan_output.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' shell: bash env: PLAN_FILE: ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/plan.md @@ -280,7 +280,7 @@ jobs: retention-days: 7 - name: Upload plan artifact - if: always() && steps.plan.outcome == 'success' + if: always() && steps.plan_output.outcome == 'success' uses: actions/upload-artifact@v7 with: name: e2e-plan-${{ matrix.agent }} From 3eca00acfedbb476262dba13c0568187a06ec39e Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Sat, 21 Mar 2026 09:18:08 -0700 Subject: [PATCH 30/32] fix: restore Slack webhook notification and remove unused docs Revert Slack notification from chat.postMessage API back to the original incoming webhook approach for minimal changes. Remove premature architecture doc, design plan, and README section. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: eeb2507079c5 --- .github/workflows/e2e.yml | 75 ++---- README.md | 6 - docs/architecture/slack-e2e-triage.md | 129 ---------- ...03-17-slack-triggered-e2e-triage-design.md | 221 ------------------ 4 files changed, 25 insertions(+), 406 deletions(-) delete mode 100644 docs/architecture/slack-e2e-triage.md delete mode 100644 docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 60122120b..a0d9c9e84 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -129,56 +129,31 @@ jobs: | (.name | (try capture("\\((?[^)]+)\\)").agent catch null)) | select(. != null) ] | join(", ")') - failed_csv="${failed//, /,}" echo "agents=$failed" >> "$GITHUB_OUTPUT" - echo "agents_csv=$failed_csv" >> "$GITHUB_OUTPUT" - name: Notify Slack of E2E failure - env: - SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }} - SLACK_CHANNEL: ${{ vars.E2E_SLACK_CHANNEL }} - FAILED_AGENTS: ${{ steps.failed.outputs.agents }} - RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} - COMMIT_URL: ${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }} - COMMIT_SHA: ${{ github.sha }} - ACTOR: ${{ github.actor }} - shell: bash - run: | - set -euo pipefail - - text=":red_circle: *E2E Tests Failed* on \`main\` - - Failed agents: *${FAILED_AGENTS}* - <${RUN_URL}|View run details> - Commit: <${COMMIT_URL}|${COMMIT_SHA}> by ${ACTOR}" - - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg text "$text" \ - '{channel: $channel, text: $text}')" - - response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")" - - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "::error::Slack API returned non-ok response: $(jq -r '.error // "unknown"' <<<"$response")" - exit 1 - fi - - channel="$(jq -r '.channel' <<<"$response")" - thread_ts="$(jq -r '.ts' <<<"$response")" - - triage_url="https://e2e-triage.entireio.workers.dev/triage?run_url=$(jq -rn --arg u "$RUN_URL" '$u | @uri')&slack_channel=${channel}&slack_thread_ts=${thread_ts}" - - followup="$(jq -n \ - --arg channel "$channel" \ - --arg thread_ts "$thread_ts" \ - --arg text ":mag: <${triage_url}|Run Triage>" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" - - curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$followup" > /dev/null + uses: slackapi/slack-github-action@91efab103c0de0a537f72a35f6b8cda0ee76bf0a # v2.1.1 + with: + webhook: ${{ secrets.E2E_SLACK_WEBHOOK_URL }} + webhook-type: incoming-webhook + payload: | + { + "blocks": [ + { + "type": "section", + "text": { + "type": "mrkdwn", + "text": ":red_circle: *E2E Tests Failed* on `main`\n\nFailed agents: *${{ steps.failed.outputs.agents }}*\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run details>" + } + }, + { + "type": "context", + "elements": [ + { + "type": "mrkdwn", + "text": "Commit: <${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}> by ${{ github.actor }}" + } + ] + } + ] + } diff --git a/README.md b/README.md index 48387ae0c..7d104a051 100644 --- a/README.md +++ b/README.md @@ -104,12 +104,6 @@ entire disable Removes the git hooks. Your code and commit history remain untouched. -## E2E Triage - -E2E failure alerts post a "Run Triage" link to Slack. Clicking it triggers the triage workflow via a Cloudflare Worker. See [docs/architecture/slack-e2e-triage.md](docs/architecture/slack-e2e-triage.md) for the full architecture. - -The triage job runs in [`.github/workflows/e2e-triage.yml`](.github/workflows/e2e-triage.yml). If Slack is unavailable, you can trigger the workflow manually with `workflow_dispatch` using the failed run URL. - ## Key Concepts ### Sessions diff --git a/docs/architecture/slack-e2e-triage.md b/docs/architecture/slack-e2e-triage.md deleted file mode 100644 index 8c22dbb79..000000000 --- a/docs/architecture/slack-e2e-triage.md +++ /dev/null @@ -1,129 +0,0 @@ -# Slack-Triggered E2E Triage & Fix Pipeline - -When E2E tests fail on `main`, a Slack alert is posted with a clickable "Run Triage" link. Clicking it triggers the triage workflow via a Cloudflare Worker, which triages failures, generates a fix plan, and offers a "Fix It" link that auto-applies fixes and opens a draft PR. - -## Flow - -``` -E2E fails → Slack alert → "Run Triage" → e2e-triage.yml - → triage (claude-code-action, read-only) → plan generation (claude-code-action, read-only) - → plan in GH summary + Slack "Fix It" link - → user clicks "Fix It" → Worker /fix → e2e-fix.yml - → claude-code-action (write tools) → applies fixes + creates draft PR → Slack -``` - -1. `.github/workflows/e2e.yml` posts a failure alert to Slack using the bot token (via `chat.postMessage`), then posts a threaded "Run Triage" link that encodes the run URL and Slack thread context. -2. A user clicks the link, which hits the Cloudflare Worker at `e2e-triage.entireio.workers.dev/triage`. -3. The Worker validates the `run_url` and calls `workflow_dispatch` on `.github/workflows/e2e-triage.yml` via the GitHub API. -4. The triage workflow: - - Downloads E2E artifacts via `scripts/download-e2e-artifacts.sh` - - Runs the `/e2e:triage-ci` skill via `anthropics/claude-code-action@v1` (read-only tools) - - Extracts triage output from the execution file and writes it to `triage.md` - - Runs the `/e2e:implement` skill via `claude-code-action` (read-only tools) to generate a fix plan - - Posts the plan to GitHub step summary (via `display_report: true`) and uploads it as an artifact - - Posts a "Fix It" link to the Slack thread -5. A user clicks the "Fix It" link, which hits the Worker at `e2e-triage.entireio.workers.dev/fix`. -6. The Worker dispatches `.github/workflows/e2e-fix.yml` with the triage run ID and agent list. -7. The fix workflow: - - Downloads plan artifacts from the triage run - - Runs `claude-code-action` with write tools to apply fixes, run verification (`fmt`, `lint`, `test:e2e:canary`), and create a draft PR - - Posts the PR link (or failure details) to the Slack thread - -## Cloudflare Worker - -Source lives in the infra repo at `cloudflare/workers/e2e-triage-trigger/`. - -### `/triage` endpoint - -Accepts a GET request with query parameters: - -- `run_url` (required) — must match `https://github.com/entireio/cli/actions/runs/\d+` -- `slack_channel` — Slack channel ID for thread replies -- `slack_thread_ts` — Slack thread timestamp for thread replies - -Dispatches `e2e-triage.yml` with these values as `workflow_dispatch` inputs. - -### `/fix` endpoint - -Accepts a GET request with query parameters: - -- `triage_run_id` (required) — numeric run ID of the triage workflow -- `run_url` (required) — original failed E2E run URL -- `failed_agents` (required) — comma-separated list of agents to fix -- `slack_channel` — Slack channel ID for thread replies -- `slack_thread_ts` — Slack thread timestamp for thread replies - -Dispatches `e2e-fix.yml` with these values as `workflow_dispatch` inputs. - -**Secret:** `GITHUB_TOKEN` — a PAT with `actions:write` scope, stored in Cloudflare secrets (`wrangler secret put GITHUB_TOKEN`). - -## Workflows - -### `e2e-triage.yml` - -Triages E2E failures and generates fix plans. Uses a matrix strategy (one job per failed agent). - -**Claude invocations** (both via `anthropics/claude-code-action@v1`): - -| Step | Skill | Tools | Output | -|------|-------|-------|--------| -| Run triage | `/e2e:triage-ci` | Read, Grep, Glob | triage.md (artifact + GH summary) | -| Generate fix plan | `/e2e:implement` | Read, Grep, Glob | plan.md (artifact + GH summary) | - -The `/e2e:implement` skill enters plan mode first (read-only tools prevent actual file changes), producing a detailed fix plan without applying anything. - -### `e2e-fix.yml` - -Applies fix plans and creates a draft PR. Single job (not matrix) since fixes may touch shared test infrastructure. - -**Claude invocation** (via `anthropics/claude-code-action@v1`): - -| Step | Tools | Output | -|------|-------|--------| -| Apply fixes | Edit, Write, Read, Glob, Grep, Bash(git:\*), Bash(mise:\*), Bash(gh:\*) | Branch + draft PR | - -Claude reads the plan artifacts, applies fixes, runs `mise run fmt && mise run lint && mise run test:e2e:canary`, then creates a `fix/e2e-` branch and opens a draft PR. - -## Slack Setup - -The Slack app needs: - -- `chat:write` scope — to post alerts and triage results -- Bot must be invited to the alert channel - -No event subscriptions or incoming webhooks are needed. - -## GitHub Config - -**Repository variables:** - -- `E2E_SLACK_CHANNEL` — Slack channel ID where failure alerts are posted - -**Repository secrets:** - -- `SLACK_BOT_TOKEN` — Slack bot token with `chat:write` scope -- `ANTHROPIC_API_KEY` — for Claude triage and fix steps - -The built-in `${{ github.token }}` is used for GitHub API calls within workflows. - -## Manual Fallback - -### Triage - -Run `.github/workflows/e2e-triage.yml` manually with `workflow_dispatch`: - -- `run_url` (required) — the failed run URL -- `sha` — commit SHA (auto-detected from run if omitted) -- `failed_agents` — comma-separated list (auto-detected from run if omitted) -- `slack_channel` — for Slack thread replies -- `slack_thread_ts` — for Slack thread replies - -### Fix - -Run `.github/workflows/e2e-fix.yml` manually with `workflow_dispatch`: - -- `triage_run_id` (required) — run ID of the triage workflow -- `run_url` (required) — original failed E2E run URL -- `failed_agents` (required) — comma-separated list of agents to fix -- `slack_channel` — for Slack thread replies -- `slack_thread_ts` — for Slack thread replies diff --git a/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md b/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md deleted file mode 100644 index fe8b1a2e8..000000000 --- a/docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md +++ /dev/null @@ -1,221 +0,0 @@ -# Slack-Triggered E2E Triage Design - -## Goal - -Allow a human to reply `triage e2e` in the Slack thread for an E2E failure alert and have GitHub Actions run the repo's existing Claude E2E triage workflow based on [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md). - -## Scope - -This design covers: - -- Slack trigger detection for a single exact-match phrase: `triage e2e` -- Hand-off from Slack to GitHub Actions -- A new GitHub Actions workflow that runs triage against an existing failed CI run -- Posting triage status and results back into the originating Slack thread - -This design does not cover: - -- Automatic remediation or code changes -- Running the full E2E fix pipeline -- General-purpose Slack command routing -- Local rerun verification beyond what the existing skill supports for CI run references - -## Existing Context - -The repo already has the core ingredients needed for the triage operation: - -- [`.github/workflows/e2e.yml`](../../.github/workflows/e2e.yml) posts Slack alerts when E2E runs on `main` fail -- [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md) defines the triage procedure -- [`.claude/plugins/e2e/commands/triage-ci.md`](../../.claude/plugins/e2e/commands/triage-ci.md) exposes the procedure as the `/e2e:triage-ci` command -- [`scripts/download-e2e-artifacts.sh`](../../scripts/download-e2e-artifacts.sh) already supports artifact download from a GitHub Actions run reference - -The missing piece is the Slack-to-GitHub bridge. - -## Architecture - -The system is composed of three narrow responsibilities: - -1. Slack app - - Listen for new thread replies - - Normalize reply text - - Trigger only when the reply text is exactly `triage e2e` - - Validate that the parent message is an E2E failure alert for this repository - -2. Dispatch bridge - - Read structured data from the parent Slack alert - - Build a `repository_dispatch` payload for this repository - - Send the dispatch event to GitHub - -3. GitHub Action - - Receive the dispatch payload - - Check out the repository at the failed commit SHA - - Install and authenticate Claude CLI - - Load the local plugin directory at [`.claude/plugins/e2e`](../../.claude/plugins/e2e) - - Invoke `/e2e:triage-ci` with the CI run URL and failed agent - - Upload artifacts and post results back to the Slack thread - -This keeps Slack focused on intent capture and routing while GitHub Actions remains the execution environment for triage. - -## Trigger Contract - -The Slack app should send a structured `repository_dispatch` event with custom type `slack_e2e_triage_requested`. - -Recommended payload: - -```json -{ - "trigger_text": "triage e2e", - "repo": "entireio/cli", - "branch": "main", - "sha": "447cde1aeee938448c3edbae78242c950dc35cf0", - "run_url": "https://github.com/entireio/cli/actions/runs/123456789", - "run_id": "123456789", - "failed_agents": ["cursor-cli"], - "slack_channel": "C123456", - "slack_thread_ts": "1742230000.123456", - "slack_user": "U123456" -} -``` - -Workflow-side validation rules: - -- Reject if `trigger_text` is not exactly `triage e2e` -- Reject if `run_url` or `slack_thread_ts` is missing -- Reject if the target repo or branch is unexpected -- Treat `failed_agents` as the source of truth for which agent-specific triage jobs to run - -## Slack Message Requirements - -The current Slack failure notification in [`.github/workflows/e2e.yml`](../../.github/workflows/e2e.yml) already includes the run details link, commit SHA, actor, and failed agent list. That is enough for a first version if the Slack app parses the parent message. - -However, the safer design is to make the alert payload more machine-friendly so the Slack app does not need to scrape display text. Two acceptable options: - -- Add stable metadata in the Slack message text or blocks for `run_url`, `sha`, and `failed_agents` -- Store a compact JSON blob in a Slack block element or message metadata if the chosen Slack app framework supports it - -The first version can parse the existing message format, but the implementation should isolate that parsing into one small component because it is brittle compared to a structured payload. - -## GitHub Workflow Design - -Add a new workflow at [`.github/workflows/e2e-triage.yml`](../../.github/workflows/e2e-triage.yml). - -### Triggers - -- `repository_dispatch` with type `slack_e2e_triage_requested` -- `workflow_dispatch` for manual testing and debugging - -### High-Level Job Flow - -1. Validate dispatch payload -2. Post "triage started" reply to the Slack thread -3. Check out repository at the failed `sha` -4. Set up `mise` -5. Install Claude CLI and any required dependencies -6. Authenticate Claude using a GitHub Actions secret -7. Run the E2E triage command: - -```bash -claude --plugin-dir .claude/plugins/e2e -p "/e2e:triage-ci --agent " -``` - -8. Capture output to files for artifact upload -9. Post a Slack thread reply with a concise summary and a link to the triage workflow run -10. Upload triage artifacts regardless of success or failure - -### Agent Fan-Out - -If the alert has multiple failed agents, the workflow should fan out one matrix job per failed agent. This keeps results isolated and simplifies failure attribution in Slack and in GitHub artifacts. - -### Concurrency - -Use concurrency keyed by CI `run_id` or Slack thread timestamp so repeated `triage e2e` replies do not start duplicate work for the same failure thread. - -## Invocation Model - -This design intentionally uses the existing CI-run path in [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md): - -- The workflow passes the original GitHub Actions run URL to `/e2e:triage-ci` -- The skill downloads artifacts via [`scripts/download-e2e-artifacts.sh`](../../scripts/download-e2e-artifacts.sh) -- The triage workflow analyzes the failed run's artifacts instead of starting fresh E2E reruns - -That keeps cost and runtime bounded for the first version. - -If local rerun verification is later required for Slack-triggered triage, that should be added as a deliberate extension to the workflow and possibly to the skill behavior for CI-driven contexts. - -## Slack Responses - -Recommended thread messages: - -- Start: - - `Starting E2E triage for cursor-cli from CI run .` -- Success: - - Short classification summary per agent and a link to the GitHub triage workflow -- Failure: - - Short failure reason and a link to the GitHub triage workflow - -Slack replies should stay short. The full triage report belongs in workflow logs and uploaded artifacts. - -## Error Handling - -### Slack App - -- Ignore non-thread replies -- Ignore messages whose normalized text is not exactly `triage e2e` -- Refuse to trigger if the parent message is not recognized as an E2E failure alert from this repository -- Reply in-thread with a short failure message if dispatch fails - -### GitHub Workflow - -- Fail fast on malformed dispatch payloads -- Fail with a clear Slack reply if checkout or Claude setup fails -- Fail with a clear Slack reply if CI artifact download fails -- Always upload raw triage output as artifacts - -## Security - -The Slack app should use a GitHub token scoped only to dispatch workflows on this repository. - -The workflow should: - -- Use the minimum required GitHub permissions -- Store Claude authentication in GitHub Actions secrets -- Avoid echoing secrets or full auth state into logs - -The Slack app should validate Slack request signatures before processing events. - -## Testing Strategy - -### Slack App - -- Unit test normalization for exact-match `triage e2e` -- Unit test parent-message validation -- Unit test extraction of `run_url`, `sha`, and `failed_agents` -- Unit test dispatch payload construction - -### GitHub Workflow - -- Add `workflow_dispatch` inputs mirroring the dispatch payload for manual testing -- Smoke test against a known failed E2E run URL -- Verify success path posts to Slack thread -- Verify invalid payload path exits early and reports clearly - -### Non-Goals for Testing - -- Do not run real E2E reruns as part of this workflow -- Do not test code-fixing behavior in this first version - -## Recommended Implementation Order - -1. Add the new GitHub Actions workflow with manual `workflow_dispatch` -2. Prove the workflow can run `/e2e:triage-ci` against a known failed CI run URL -3. Add Slack thread notification hooks for started/succeeded/failed states -4. Build the Slack app that validates the thread reply and sends `repository_dispatch` -5. Tighten the original E2E Slack alert format if parsing proves brittle - -## Open Decisions Resolved - -- Trigger phrase: exact match `triage e2e` -- Execution environment: GitHub Actions -- Triage source of truth: [`.claude/skills/e2e/triage-ci.md`](../../.claude/skills/e2e/triage-ci.md) -- Invocation surface: [`.claude/plugins/e2e/commands/triage-ci.md`](../../.claude/plugins/e2e/commands/triage-ci.md) -- Initial scope: artifact-based triage of an existing failed CI run, not automatic fixing From 8551955a3a69609c6af8e0b7c9830a386a2f77ce Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Sat, 21 Mar 2026 09:47:46 -0700 Subject: [PATCH 31/32] =?UTF-8?q?fix:=20address=20code=20review=20?= =?UTF-8?q?=E2=80=94=20revert=20e2e.yml,=20add=20rerun=20toggle,=20update?= =?UTF-8?q?=20triage=20tools?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revert e2e.yml jq change to match main. Delete unused run-e2e-triage.sh script. Add `rerun` boolean input to e2e-triage.yml that installs agent CLIs and enables Bash tools for flaky detection via test re-runs. Update plan generation step with EnterPlanMode, Write, and Bash(mise:*) tools. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 2f2c12c4092e --- .github/workflows/e2e-triage.yml | 85 +++++++++++++++++++++++++++++--- .github/workflows/e2e.yml | 6 +-- scripts/run-e2e-triage.sh | 30 ----------- 3 files changed, 79 insertions(+), 42 deletions(-) delete mode 100755 scripts/run-e2e-triage.sh diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index faea38f52..f53088be5 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -23,6 +23,11 @@ on: description: Slack thread timestamp for replies required: false type: string + rerun: + description: 'Re-run failing tests locally to detect flaky (costs API tokens)' + required: false + type: boolean + default: false permissions: actions: read @@ -112,7 +117,7 @@ jobs: triage: needs: [matrix-setup] runs-on: ubuntu-latest - timeout-minutes: 45 + timeout-minutes: ${{ inputs.rerun == true && 90 || 45 }} env: RUN_URL: ${{ needs.matrix-setup.outputs.run_url }} E2E_AGENT: ${{ matrix.agent }} @@ -174,14 +179,44 @@ jobs: - name: Download E2E artifacts id: artifacts + if: inputs.rerun != true env: GH_TOKEN: ${{ github.token }} run: | artifact_path="$(scripts/download-e2e-artifacts.sh "$RUN_URL")" echo "path=$artifact_path" >> "$GITHUB_OUTPUT" - - name: Run triage - id: triage + - name: Install system dependencies + if: inputs.rerun == true + run: sudo apt-get update && sudo apt-get install -y tmux + + - name: Install agent CLI + if: inputs.rerun == true + run: | + case "${{ matrix.agent }}" in + claude-code) curl -fsSL https://claude.ai/install.sh | bash ;; + opencode) curl -fsSL https://opencode.ai/install | bash ;; + gemini-cli) npm install -g @google/gemini-cli ;; + cursor-cli) curl https://cursor.com/install -fsS | bash ;; + factoryai-droid) curl -fsSL https://app.factory.ai/cli | sh ;; + copilot-cli) npm install -g @github/copilot ;; + roger-roger) ;; # installed by mise (see mise.toml) + esac + echo "$HOME/.local/bin" >> $GITHUB_PATH + + - name: Bootstrap agent + if: inputs.rerun == true && matrix.agent != 'roger-roger' + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} + CURSOR_API_KEY: ${{ secrets.CURSOR_API_KEY }} + FACTORY_API_KEY: ${{ secrets.FACTORY_API_KEY }} + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} + run: go run ./e2e/bootstrap + + - name: Run triage (analysis only) + id: triage_analysis + if: inputs.rerun != true uses: anthropics/claude-code-action@v1 with: prompt: | @@ -190,6 +225,42 @@ jobs: claude_args: "--allowedTools 'Read,Grep,Glob'" display_report: true + - name: Run triage (with re-runs) + id: triage_rerun + if: inputs.rerun == true + uses: anthropics/claude-code-action@v1 + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} + CURSOR_API_KEY: ${{ secrets.CURSOR_API_KEY }} + FACTORY_API_KEY: ${{ secrets.FACTORY_API_KEY }} + COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }} + with: + prompt: | + /e2e:triage-ci ${{ env.RUN_URL }} --agent ${{ matrix.agent }} --sha ${{ needs.matrix-setup.outputs.sha }} + anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} + claude_args: "--allowedTools 'Read,Grep,Glob,Bash(mise:*),Bash(scripts:*)'" + display_report: true + + - name: Resolve triage outcome + id: triage + if: always() + shell: bash + env: + ANALYSIS_OUTCOME: ${{ steps.triage_analysis.outcome }} + ANALYSIS_EXEC: ${{ steps.triage_analysis.outputs.execution_file }} + RERUN_OUTCOME: ${{ steps.triage_rerun.outcome }} + RERUN_EXEC: ${{ steps.triage_rerun.outputs.execution_file }} + run: | + outcome="${ANALYSIS_OUTCOME:-skipped}" + exec_file="${ANALYSIS_EXEC}" + if [ "$outcome" = "skipped" ]; then + outcome="${RERUN_OUTCOME:-skipped}" + exec_file="${RERUN_EXEC}" + fi + echo "outcome=$outcome" >> "$GITHUB_OUTPUT" + echo "execution_file=$exec_file" >> "$GITHUB_OUTPUT" + - name: Extract triage output id: triage_output if: always() && steps.triage.outputs.execution_file != '' @@ -208,13 +279,13 @@ jobs: - name: Generate fix plan id: plan - if: steps.triage.outcome == 'success' + if: steps.triage.outputs.outcome == 'success' uses: anthropics/claude-code-action@v1 with: prompt: | /e2e:implement Read the triage findings at ${{ github.workspace }}/e2e-triage-artifacts/${{ matrix.agent }}/triage.md for agent ${{ matrix.agent }}. anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} - claude_args: "--allowedTools 'Read,Grep,Glob'" + claude_args: "--allowedTools 'Read,Grep,Glob,Write,EnterPlanMode,ExitPlanMode,Bash(mise:*)'" display_report: true - name: Extract plan output @@ -233,10 +304,10 @@ jobs: "$EXECUTION_FILE" > "$PLAN_FILE" - name: Post triage completion - if: ${{ always() && (steps.triage.outcome == 'success' || steps.triage.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} + if: ${{ always() && (steps.triage.outputs.outcome == 'success' || steps.triage.outputs.outcome == 'failure') && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} shell: bash env: - TRIAGE_OUTCOME: ${{ steps.triage.outcome }} + TRIAGE_OUTCOME: ${{ steps.triage.outputs.outcome }} run: | set -euo pipefail diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index a0d9c9e84..18d39a5ff 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -124,11 +124,7 @@ jobs: GH_TOKEN: ${{ github.token }} run: | failed=$(gh api repos/${{ github.repository }}/actions/runs/${{ github.run_id }}/jobs \ - --jq '[.jobs[] - | select(.conclusion == "failure") - | (.name | (try capture("\\((?[^)]+)\\)").agent catch null)) - | select(. != null) - ] | join(", ")') + --jq '[.jobs[] | select(.conclusion == "failure") | .name | capture("\\((?[^)]+)\\)") | .agent] | join(", ")') echo "agents=$failed" >> "$GITHUB_OUTPUT" - name: Notify Slack of E2E failure diff --git a/scripts/run-e2e-triage.sh b/scripts/run-e2e-triage.sh deleted file mode 100755 index 04b6250ed..000000000 --- a/scripts/run-e2e-triage.sh +++ /dev/null @@ -1,30 +0,0 @@ -#!/usr/bin/env bash - -set -euo pipefail - -: "${RUN_URL:?RUN_URL is required}" -: "${E2E_AGENT:?E2E_AGENT is required}" -: "${TRIAGE_OUTPUT_FILE:?TRIAGE_OUTPUT_FILE is required}" - -mkdir -p "$(dirname "$TRIAGE_OUTPUT_FILE")" - -# Download artifacts before invoking Claude so it only needs read-only access -artifact_path="$(scripts/download-e2e-artifacts.sh "$RUN_URL")" - -triage_args="/e2e:triage-ci ${artifact_path} --agent ${E2E_AGENT}" -if [ -n "${TRIAGE_SHA:-}" ]; then - triage_args="${triage_args} --sha ${TRIAGE_SHA}" -fi - -claude \ - --plugin-dir .claude/plugins/e2e \ - --output-format text \ - --allowedTools \ - "Read" \ - "Grep" \ - "Glob" \ - -p "$triage_args" \ - 2>&1 | sed 's/\x1b\[[0-9;]*[a-zA-Z]//g; s/\x1b\[[?][0-9]*[a-zA-Z]//g' \ - > "$TRIAGE_OUTPUT_FILE" - -echo "Triage complete for ${E2E_AGENT}. See the Job Summary tab for the rendered report." From 9ddf6ccb773339c622b63fa6f52dd29d21939dcb Mon Sep 17 00:00:00 2001 From: Alisha Kawaguchi Date: Mon, 23 Mar 2026 10:00:56 -0700 Subject: [PATCH 32/32] fix: extract shared Slack helper script and fix shell injection Extract duplicated Slack post-message heredoc from e2e-triage.yml and e2e-fix.yml into scripts/post-slack-message.sh. Fix shell injection in Python URL-encoding by reading RUN_URL from os.environ instead of interpolating into the code string. Co-Authored-By: Claude Opus 4.6 (1M context) Entire-Checkpoint: 91d990ef6fb1 --- .github/workflows/e2e-fix.yml | 46 +++++------------------------- .github/workflows/e2e-triage.yml | 48 ++++++-------------------------- scripts/post-slack-message.sh | 24 ++++++++++++++++ 3 files changed, 39 insertions(+), 79 deletions(-) create mode 100755 scripts/post-slack-message.sh diff --git a/.github/workflows/e2e-fix.yml b/.github/workflows/e2e-fix.yml index 860e491b8..c110c948b 100644 --- a/.github/workflows/e2e-fix.yml +++ b/.github/workflows/e2e-fix.yml @@ -43,37 +43,10 @@ jobs: SLACK_CHANNEL: ${{ inputs.slack_channel }} SLACK_THREAD_TS: ${{ inputs.slack_thread_ts }} steps: - - name: Write Slack helper - if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} - shell: bash - run: | - set -euo pipefail - - helper="$RUNNER_TEMP/post-slack-message.sh" - cat > "$helper" <<'EOF' - #!/usr/bin/env bash - set -euo pipefail - - text="${1:?message is required}" - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg thread_ts "$SLACK_THREAD_TS" \ - --arg text "$text" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" - - if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")"; then - echo "warning: slack notification failed" >&2 - exit 0 - fi - - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "warning: slack notification returned non-ok response" >&2 - fi - EOF - chmod +x "$helper" + - name: Checkout repository + uses: actions/checkout@v6 + with: + fetch-depth: 0 - name: Post fix started if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} @@ -83,12 +56,7 @@ jobs: run: | set -euo pipefail - "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E fix for \`${FAILED_AGENTS}\`." - - - name: Checkout repository - uses: actions/checkout@v6 - with: - fetch-depth: 0 + scripts/post-slack-message.sh "Starting E2E fix for \`${FAILED_AGENTS}\`." - name: Setup mise uses: jdx/mise-action@v4 @@ -157,7 +125,7 @@ jobs: message="E2E fix complete — changes applied but no PR was created. Check the <${RUN_URL}|workflow run> for details." fi - "$RUNNER_TEMP/post-slack-message.sh" "$message" + scripts/post-slack-message.sh "$message" - name: Post failure to Slack if: failure() && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' @@ -169,4 +137,4 @@ jobs: message="E2E fix failed. Check the <${RUN_URL}|workflow run> for details." - "$RUNNER_TEMP/post-slack-message.sh" "$message" + scripts/post-slack-message.sh "$message" diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml index f53088be5..e26a2bb23 100644 --- a/.github/workflows/e2e-triage.yml +++ b/.github/workflows/e2e-triage.yml @@ -129,37 +129,10 @@ jobs: matrix: agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }} steps: - - name: Write Slack helper - if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} - shell: bash - run: | - set -euo pipefail - - helper="$RUNNER_TEMP/post-slack-message.sh" - cat > "$helper" <<'EOF' - #!/usr/bin/env bash - set -euo pipefail - - text="${1:?message is required}" - payload="$(jq -n \ - --arg channel "$SLACK_CHANNEL" \ - --arg thread_ts "$SLACK_THREAD_TS" \ - --arg text "$text" \ - '{channel: $channel, thread_ts: $thread_ts, text: $text}')" - - if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ - -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ - -H 'Content-type: application/json; charset=utf-8' \ - --data "$payload")"; then - echo "warning: slack notification failed" >&2 - exit 0 - fi - - if ! jq -e '.ok == true' >/dev/null <<<"$response"; then - echo "warning: slack notification returned non-ok response" >&2 - fi - EOF - chmod +x "$helper" + - name: Checkout repository + uses: actions/checkout@v6 + with: + fetch-depth: 1 - name: Post triage started if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }} @@ -167,12 +140,7 @@ jobs: run: | set -euo pipefail - "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>." - - - name: Checkout repository - uses: actions/checkout@v6 - with: - fetch-depth: 1 + scripts/post-slack-message.sh "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>." - name: Setup mise uses: jdx/mise-action@v4 @@ -317,7 +285,7 @@ jobs: message="E2E triage failed for \`$E2E_AGENT\`." fi - "$RUNNER_TEMP/post-slack-message.sh" "$message" + scripts/post-slack-message.sh "$message" - name: Post fix plan to Slack if: steps.plan_output.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' @@ -332,7 +300,7 @@ jobs: summary="$(head -20 "$PLAN_FILE" | sed '/^$/d' | head -5)" # Construct Fix It URL - encoded_run_url="$(python3 -c "import urllib.parse; print(urllib.parse.quote('$RUN_URL', safe=''))")" + encoded_run_url="$(python3 -c "import urllib.parse, os; print(urllib.parse.quote(os.environ['RUN_URL'], safe=''))")" fix_url="https://e2e-triage.entireio.workers.dev/fix?triage_run_id=${TRIAGE_RUN_ID}&run_url=${encoded_run_url}&failed_agents=${E2E_AGENT}&slack_channel=${SLACK_CHANNEL}&slack_thread_ts=${SLACK_THREAD_TS}" message="Fix plan ready for \`$E2E_AGENT\`: @@ -340,7 +308,7 @@ jobs: <${fix_url}|Fix It> — applies the plan and creates a draft PR" - "$RUNNER_TEMP/post-slack-message.sh" "$message" + scripts/post-slack-message.sh "$message" - name: Upload triage output if: always() diff --git a/scripts/post-slack-message.sh b/scripts/post-slack-message.sh new file mode 100755 index 000000000..76546dc0b --- /dev/null +++ b/scripts/post-slack-message.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Post a message to a Slack thread using the chat.postMessage API. +# Requires SLACK_BOT_TOKEN, SLACK_CHANNEL, and SLACK_THREAD_TS env vars. + +text="${1:?message is required}" +payload="$(jq -n \ + --arg channel "$SLACK_CHANNEL" \ + --arg thread_ts "$SLACK_THREAD_TS" \ + --arg text "$text" \ + '{channel: $channel, thread_ts: $thread_ts, text: $text}')" + +if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \ + -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ + -H 'Content-type: application/json; charset=utf-8' \ + --data "$payload")"; then + echo "warning: slack notification failed" >&2 + exit 0 +fi + +if ! jq -e '.ok == true' >/dev/null <<<"$response"; then + echo "warning: slack notification returned non-ok response" >&2 +fi