feat: add functional test framework for agent pipelines by ralphbean · Pull Request #1682 · fullsend-ai/fullsend

ralphbean · 2026-05-29T00:26:28Z

Summary

Adds functional test framework that exercises the full agent pipeline (pre-script → agent → post-script) by observing side effects on GitHub fixtures
Ephemeral repos with UUID suffixes ensure concurrent test runs don't interfere
Uses agent-eval-harness for LLM-graded scoring via CLI runner interface
First test case (001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue description
Behavioral thresholds (max_turns, max_cost_usd) enforced universally by the orchestrator — agents that pass quality judges but loop or burn tokens fail the test

What's in the box

File	Purpose
`eval/fullsend-runner.sh`	CLI runner: ephemeral repo → fixture → fullsend run → capture state → teardown
`eval/run-functional.sh`	Orchestrator: iterate cases, validate thresholds, call runner, score with agent-eval-harness
`eval/triage/eval.yaml`	Test config: LLM judge (1-5 rubric) + deterministic label check
`eval/triage/cases/001-*`	Test case with tricky scenario (regex accepts `+` but issue claims it doesn't)
`eval/triage/repos/python-webapp/`	Shared repo content, symlinked by cases
`.github/workflows/functional-tests.yml`	CI workflow triggered by `eval/` or `internal/scaffold/` changes
`Makefile`	`make functional-tests` target
`internal/cli/progress.go`	Captures `num_turns`, `total_cost_usd`, token counts from Claude Code stream events
`internal/cli/run.go`	Writes `metrics.json` with aggregated behavioral metrics after all iterations
`docs/testing/functional-tests.md`	Documents threshold requirements for test authors

Behavioral thresholds

Every test case declares max_turns and max_cost_usd in annotations.yaml. The orchestrator validates these before running and compares against metrics.json after each run. This catches efficiency regressions (looping agents, model cost spikes) that quality judges can't see.

Thresholds are universal invariants — not per-skill judges — so every skill gets them automatically and new skills can't opt out.

Current results

The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles +. Threshold set to 2.5 with a TODO to raise it.

Still TODO

Set up evals GitHub environment with secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) and vars (EVAL_ORG, ANTHROPIC_VERTEX_PROJECT_ID)
Refactor case layout to match agent-eval-harness execute.py expectations so we can use their parallelism
Add more test cases (feature request, vague bug, PR review)
Per-agent path filtering in CI (only run triage tests when triage-related files change)

Refs #499, #73
Closes #346

🤖 Generated with Claude Code

github-actions · 2026-05-29T00:27:54Z

Site preview

Preview: https://8567a3ef-site.fullsend-ai.workers.dev

Commit: 722af5c700bd493fa7f0883a7494ceb68db10721

fullsend-ai-review · 2026-05-29T00:32:22Z

Review

Findings

Medium

[protected-path] .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[logic-error] eval/run-functional.sh:27 — The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.
[internal-inconsistency] docs/ADRs/0051-agent-eval-harness-for-test-infrastructure.md — The YAML frontmatter title says "47. agent-eval-harness" and the H1 heading says "# 47." but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052. The four-digit gap means cross-references by number are ambiguous — readers cannot tell whether "47" refers to ADR file 0047 or this document.
Remediation: Update the YAML title and H1 heading in 0051-*.md to "51." and in 0052-*.md to "52." to match the filenames.
[edge-case] eval/scripts/setup-fixture.sh:97 — The ephemeral repo is created with gh repo create (line 97) but .hook-outputs.yaml (which propagates EPHEMERAL_REPO to the teardown hook) is written at the end of the script (line 178). If any step between repo creation and the YAML write fails under set -e, the teardown hook has no EPHEMERAL_REPO to delete, leaving an orphaned repo.
Remediation: Register a trap in setup-fixture.sh to delete the ephemeral repo on failure, or write .hook-outputs.yaml (with at least EPHEMERAL_REPO) immediately after repo creation.
[submodule-dependency] .gitmodules — The git submodule dependency points to ralphbean/agent-eval-harness (a personal fork) on branch worktree-execution-hooks, not the upstream opendatahub-io/agent-eval-harness. ADR 0051 references the upstream. A personal fork can be force-pushed, deleted, or transferred, creating a maintenance risk for the test infrastructure.
Remediation: Point the submodule URL to the upstream or an org fork (fullsend-ai/agent-eval-harness). If the fork contains unreleased features, document the dependency and a migration plan.

Low

[technical-doc-inaccuracy] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:66 — The design spec references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The "Files changed" table also references internal/cli/progress_test.go instead of internal/runtime/claude_progress_test.go.
[yaml-injection] eval/scripts/setup-fixture.sh:146 — The .hook-outputs.yaml heredoc interpolates shell variables without YAML escaping. The attack surface is limited to developers with commit access crafting malicious test fixtures, but quoting values would be defense-in-depth.
[edge-case] eval/scripts/run-fullsend.sh:89 — The metrics.json copy logic uses find ... | head -1 with non-deterministic ordering. In practice the right file should be found since run.go writes directly to runDir, but a direct path would be more robust.
[test-adequacy] internal/runtime/claude_progress_test.go — Tests cover normal result events and absent result events but do not test malformed JSON in the usage sub-object. json.Unmarshal silently zero-defaults missing fields.
[ephemeral-repo-visibility] eval/scripts/setup-fixture.sh:97 — Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist indefinitely. Content is not sensitive, but --private would be defense-in-depth.
[go-error-wrapping] internal/cli/run.go:87 — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
Remediation: Wrap with fmt.Errorf("marshaling metrics: %w", err).
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing eval/ or internal/scaffold/ should know about this test target.
Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.
[missing-doc] docs/guides/README.md:37 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
Remediation: Add a link to functional-tests.md in the Development section.

Previous run

Review

Findings

Medium

[protected-path] .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[logic-error] eval/run-functional.sh:27 — The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.
[stale-reference] docs/ADRs/0052-functional-tests-for-agent-pipelines.md:99 — The directory layout diagram shows fullsend-runner.sh as a top-level file under eval/, but the actual file created by this PR is eval/scripts/run-fullsend.sh. The docs/testing/functional-tests.md doc correctly shows the scripts/ subdirectory layout.
Remediation: Update the ADR diagram from fullsend-runner.sh to scripts/run-fullsend.sh.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The files-changed table also references internal/cli/progress_test.go instead of internal/runtime/claude_progress_test.go.
Remediation: Update all references to use the correct file paths.
[internal-inconsistency] docs/ADRs/0051-agent-eval-harness-for-test-infrastructure.md — The YAML frontmatter title says "47. agent-eval-harness" and the H1 heading says "# 47." but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052.
Remediation: Align the title/heading numbers with the filename numbers (51 and 52).

Low

[secrets-in-artifacts] eval/scripts/run-fullsend.sh:83 — The .eval-env file containing credentials is written to $OUTPUT_DIR. Three cleanup layers exist (trap, explicit rm, workflow scrub with if: always()), and the artifact upload excludes .eval-env. Risk is low but writing the env file outside the artifact upload path (e.g., /tmp) would eliminate it entirely.
[forge-abstraction-violation] eval/scripts/setup-fixture.sh — Eval scripts use gh CLI directly for GitHub API operations. The AGENTS.md forge abstraction requirement targets Go application code (cmd/, internal/), not shell-based test infrastructure. Documenting this exception would clarify intent.
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing eval/ or internal/scaffold/ should know about this test target.
Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.
[missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
Remediation: Add a link to functional-tests.md in the Development section.
[go-error-wrapping] internal/cli/run.go:87 — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
Remediation: Wrap with fmt.Errorf("marshaling metrics: %w", err).

Labels: PR adds git submodule for agent-eval-harness

Previous run (2)

Review

Findings

Medium

[protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

Low

[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). The test guidance is scoped to Go code changes, but developers changing eval/ or internal/scaffold/ should know about this test target.
Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.
[silent-pass-on-failure] eval/triage/eval.yaml — If the Claude Code result event is malformed, progressParser silently ignores the parse error and leaves metrics at zero. Zero values pass positive thresholds (0 <= 30), meaning a broken metrics pipeline silently passes behavioral checks. No test covers this path. See also: [test-adequacy] finding at internal/runtime/claude_progress_test.go.
[go-error-wrapping] internal/cli/run.go — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible but defense-in-depth would sanitize before interpolation.
[missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
[forge-abstraction-violation] eval/scripts/setup-fixture.sh — Eval scripts (setup-fixture.sh, capture-fixture.sh, teardown-fixture.sh) use gh CLI directly. The AGENTS.md forge abstraction rule targets Go application code, not shell-based test infrastructure, but documenting this exception would clarify intent.

Info

[permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.
[secrets-handling] eval/scripts/run-fullsend.sh — Env file handling is well-layered: mode 0600, trap cleanup, explicit delete, workflow scrub, artifact exclusion.
[authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.

Previous run (3)

Review

Findings

Medium

[protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.
[go-error-wrapping] internal/cli/run.go — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible but defense-in-depth would sanitize before interpolation.
[missing-doc] docs/guides/README.md:38 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

Info

[test-adequacy] internal/runtime/claude_progress_test.go — Tests cover the happy path and absent-result-event cases but do not test malformed result events. The code handles this gracefully via json.Unmarshal failing silently.
[permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.
[prompt-injection-defense] eval/triage/eval.yaml — The LLM judge prompt includes injection defense instructions for evaluating untrusted agent output. Reasonable for the eval context.
[authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.

Labels: PR adds functional test CI workflow and modifies agent runner metrics emission in Go code.

Previous run (4)

Review

Findings

Critical

[adr-numbering-conflict] docs/ADRs/0049-agent-eval-harness-for-test-infrastructure.md — ADR numbering conflict. The PR introduces ADR 0049 (agent-eval-harness) and ADR 0050 (functional tests), but ADR 0049 already exists on main as 0049-agent-configuration-env-var-convention.md. This creates a collision — two files with the same ADR number. The highest ADR on main is 0049, so new ADRs should start at 0050.
Remediation: Renumber to 0050 and 0051. Update all cross-references in docs/architecture.md, docs/problems/testing-agents.md, the design spec, and the ADR cross-references (ADR 0049 references ADR 0050, which would become 0050→0051).

Medium

[protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.
[edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path, so this is latent.
[secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion). Leakage is unlikely but creating the file outside the artifact upload path would be additional defense-in-depth.
[go-error-wrapping] internal/cli/run.go — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
[missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

Info

[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible.
[test-adequacy] internal/runtime/claude_progress_test.go — Tests cover the happy path and absent-result-event cases but do not test malformed result events. The code handles this gracefully via json.Unmarshal failing silently.
[permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.
[prompt-injection-defense] eval/triage/eval.yaml — The LLM judge prompt includes injection defense instructions for evaluating untrusted agent output. Reasonable for the eval context.
[authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.

Previous run (5)

Review

Findings

Medium

[protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The file paths table was corrected but this heading was not updated.
Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), EVAL_ORG is user-provided and not sanitized. Defense-in-depth would sanitize before interpolation.
[secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion). Leakage is unlikely but creating the file outside the artifact upload path would be additional defense-in-depth.
[edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path, so this is latent.
[go-error-wrapping] internal/cli/run.go — writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
[scope-coherence] eval/scripts/setup-fixture.sh — Test infrastructure hardcodes GitHub-specific operations but input.yaml declares a forge field suggesting multi-forge intent. This is standard YAGNI-compliant development (build the extension point, implement what you need now), but documenting that functional tests are currently GitHub-only would clarify intent.
[shell-script-output-redirection] eval/scripts/teardown-fixture.sh:18 — Error redirection pattern 2>&1 merges stderr into stdout without capturing the output. The error is still visible in CI logs, but capturing it would produce clearer diagnostics.
[missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
[naming-alignment] docs/ADRs/0050-functional-tests-for-agent-pipelines.md — "Functional tests" naming creates potential ambiguity with traditional integration tests. The ADR addresses this with a dedicated naming note, which is a reasonable documented choice.
[abstraction-fit] eval/.agent-eval-harness — Introduces Python dependency in a primarily Go codebase. ADR 0049 documents the rationale. Accepted trade-off.

Info

[test-adequacy] internal/runtime/claude_progress_test.go — Tests cover the happy path and absent-result-event cases but do not test malformed result events. The code handles this gracefully via json.Unmarshal failing silently.
[permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.
[prompt-injection-defense] eval/triage/eval.yaml — The LLM judge prompt includes injection defense instructions for evaluating untrusted agent output. Reasonable for the eval context.
[authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.

Previous run (6)

Review

Findings

Medium

[protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (centralizing OpenShell version management, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[adr-numbering-inconsistency] docs/ADRs/0049-agent-eval-harness-for-test-infrastructure.md:12 — The H1 heading says # 47. but the filename is 0049 and the frontmatter title says 49.. The heading number is wrong.
Remediation: Change the H1 from # 47. to # 49..
[adr-numbering-inconsistency] docs/ADRs/0050-functional-tests-for-agent-pipelines.md:12 — The H1 heading says # 48. but the filename is 0050 and the frontmatter title says 50.. The heading number is wrong.
Remediation: Change the H1 from # 48. to # 50..
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:79 — The 'Files changed' table references wrong file paths: eval/fullsend-runner.sh (actual: eval/scripts/run-fullsend.sh), internal/cli/progress.go (actual: internal/runtime/claude_progress.go), internal/cli/progress_test.go (actual: internal/runtime/claude_progress_test.go).
Remediation: Update the file paths in the table to match actual locations.
[stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:109 — Section 5 references 'ADR 0048' but ADR 0048 is 0048-automatic-updates.md (unrelated). The intended reference is ADR 0050 (functional tests). The H1 numbering bugs in the ADRs above likely caused this confusion.
Remediation: Change 'ADR 0048' to 'ADR 0050' in section 5.
[missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

[documentation-accuracy] docs/testing/functional-tests.md:207 — The CI integration section states tests require the evals GitHub environment but the workflow specifies environment: functional-tests.
Remediation: Change evals to functional-tests.
[documentation-accuracy] docs/testing/functional-tests.md:207 — The documented secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) don't match the workflow, which uses E2E_GCP_WIF_PROVIDER, E2E_GCP_SERVICE_ACCOUNT, and E2E_GCP_PROJECT_ID.
Remediation: Update the documented secret names to match the workflow.
[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:14 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), sanitizing before interpolation would be defense-in-depth.
[secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. The five defense layers (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion) make leakage very unlikely, but creating the file outside the artifact upload path would be additional defense-in-depth.
[edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), the script would attempt gh pr create with no diff from main, which would fail. No current test case exercises this path, so this is latent.

Info

[permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.
[test-adequacy] internal/runtime/claude_progress_test.go — Tests cover the happy path and absent-result-event cases but do not test malformed result events. The code handles this gracefully via json.Unmarshal failing silently.
[authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.
[prompt-injection-defense] eval/triage/eval.yaml — The LLM judge prompt includes injection defense instructions for evaluating untrusted agent output. Reasonable for the eval context.

Previous run (7)

Review

Reason: stale-head

The review agent reviewed commit b1d8849d91502488e596ad415469b1474068201c but the PR HEAD is now 2fc80f184de40de4a9db3c03d7512451ef2b863f. This review was discarded to avoid approving unreviewed code.

Previous run (8)

Review

Findings

Medium

[protected-path] .github/scripts/install-openshell.sh, .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (centralizing OpenShell version management, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

Low

[tier-classification] PR title uses feat: but the change is testing infrastructure, not user-facing functionality. Per COMMITS.md, feat populates the Features section of release notes and is reserved for changes end users would recognize as new capability. test: or ci: would be more accurate and would correctly exclude this from release notes under Features.
Remediation: Change PR title prefix from feat: to test:.
[supply-chain] .github/scripts/install-openshell.sh:15 — The install script downloads and pipes a remote shell script to sh from a commit-SHA-pinned URL. The SHA pin prevents silent upstream changes, but there is no checksum verification of the downloaded content.
[supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:17 — The GHA workflow command ::warning::Ephemeral repo $EPHEMERAL_REPO was not deleted interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), sanitizing before interpolation would be defense-in-depth.
[edge-case] eval/scripts/setup-fixture.sh:107 — In the pull_request path, if FIXTURE_FILES is empty and repo/ content was already committed on the base branch, the diff --cached guard correctly skips the commit. However, gh pr create will then fail because the PR branch has no new commits pushed to the remote. This is a test authoring error (PR fixture with no file changes), so the failure is appropriate but the error message will be opaque.
[documentation-accuracy] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md — The design spec's Files Changed table references internal/cli/progress.go and internal/cli/progress_test.go, but the actual files are internal/runtime/claude_progress.go and internal/runtime/claude_progress_test.go.
[stale-version-reference] docs/plans/agent-execution-environment.md:49 — Example Dockerfile contains outdated OpenShell version (0.0.37-dev) and does not reference the new single source of truth at .github/scripts/openshell-version.sh (currently 0.0.54). Pre-existing staleness, but the new centralized version management makes it more visible.

Info

[prior-finding-resolved] eval/scripts/setup-fixture.sh:119 — Prior medium edge-case finding about unconditional git commit in the pull_request path has been resolved. The if ! git diff --cached --quiet guard was added, matching the pattern used for the initial content commit.
[prior-finding-resolved] eval/scripts/setup-fixture.sh:60 — Prior low data-exposure finding about ephemeral repos created with --public has been resolved. Repos are now created with --private.
[prior-finding-resolved] action.yml:284 — Prior critical finding about bare relative path in composite action has been resolved. The updated code uses $GITHUB_ACTION_PATH/.github/scripts/install-openshell.sh.
[prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path remains resolved. Metrics are written before the early return on error.
[potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.
[go-test-assertion-style] internal/cli/run_test.go — TestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.
[makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.
[permissions] .github/workflows/functional-tests.yml — Workflow permissions are appropriately scoped to contents: read and id-token: write. Uses pull_request (not pull_request_target), so fork PRs cannot access secrets. Secrets are gated behind the functional-tests environment.

Previous run (9)

Review

Findings

Medium

[protected-path] .github/scripts/install-openshell.sh, .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (centralizing OpenShell version management, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.
[edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet. No current test case exercises the pull_request path, but the first PR-type test case with empty fixture.files will hit this.
Remediation: Add the same guard: git -C "$TARGET_DIR" add -A && if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi
[tier-classification] PR title uses feat: but the change is testing infrastructure, not user-facing functionality. Per COMMITS.md, test: or ci: would be more accurate. End users of fullsend don't interact with the functional test framework — it's for contributors testing agent behavior.
Remediation: Change PR title prefix from feat: to test:.

Low

[supply-chain] .github/scripts/install-openshell.sh:15 — The install script downloads and pipes a remote shell script to sh from a commit-SHA-pinned URL. The SHA pin prevents silent upstream changes, but there is no checksum verification of the downloaded content.
[supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

Info

[prior-finding-resolved] action.yml:289 — Prior critical finding about bare relative path in composite action has been resolved. The updated code uses $GITHUB_ACTION_PATH/.github/scripts/install-openshell.sh, and install-openshell.sh resolves its own directory via SCRIPT_DIR before sourcing openshell-version.sh.
[potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.
[go-test-assertion-style] internal/cli/run_test.go — TestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.
[makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.
[prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check remains resolved.
[prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path remains resolved. Metrics are now written before the early return on error.

Previous run (10)

Review

Findings

Critical

[logic-error] action.yml:269 — The refactored "Install OpenShell CLI" step uses source .github/scripts/openshell-version.sh with a bare relative path. action.yml is a composite GitHub Action (using: composite). In composite actions, run: steps execute in $GITHUB_WORKSPACE (the consumer repo's checkout), not in the action's directory. The file openshell-version.sh only exists in the fullsend repo, so any consumer workflow using uses: fullsend-ai/fullsend@... will fail with "file not found". The pre-existing code used inline echo statements that didn't depend on any file.
Remediation: Use ${{ github.action_path }} to reference the action's own directory: source "${{ github.action_path }}/.github/scripts/openshell-version.sh".

Medium

[protected-path] .github/workflows/functional-tests.yml, .github/scripts/openshell-version.sh — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.
[edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

[supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[supply-chain] .github/workflows/functional-tests.yml:122 — pip install --quiet "jsonschema>=4.18.0" uses a floor pin rather than an exact version pin.
[data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.
[stale-reference] docs/plans/agent-execution-environment.md:49 — Dockerfile example hardcodes OPENSHELL_VERSION=0.0.37-dev, but this PR introduces .github/scripts/openshell-version.sh as the single source of truth (currently 0.0.54). Pre-existing staleness, but the new centralized version management makes it more visible.

Info

[potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.
[go-test-assertion-style] internal/cli/run_test.go — TestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.
[makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.
[tier-classification] PR title uses feat: but adds testing infrastructure. test: or ci: would be more accurate per Conventional Commits.
[prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check has been resolved. The existence check now correctly precedes all yq invocations.
[prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. Metrics are now written before the early return on error.

Previous run (11)

Review

Findings

Critical

[logic-error] action.yml:269 — The refactored "Install OpenShell CLI" step uses source .github/scripts/openshell-version.sh with a bare relative path. action.yml is a composite GitHub Action (using: composite). In composite actions, run: steps execute in $GITHUB_WORKSPACE (the consumer repo's checkout), not in the action's directory. The file openshell-version.sh only exists in the fullsend repo, so any consumer workflow using uses: fullsend-ai/fullsend@... will fail with "file not found". The pre-existing code used inline echo statements that didn't depend on any file.
Remediation: Use ${{ github.action_path }} to reference the action's own directory: source "${{ github.action_path }}/.github/scripts/openshell-version.sh".

Medium

[protected-path] .github/workflows/functional-tests.yml, .github/scripts/openshell-version.sh — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.
[edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

[supply-chain] .github/workflows/functional-tests.yml:81 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.
[data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.
[supply-chain] .github/workflows/functional-tests.yml:121 — pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.
[go-test-assertion-style] internal/cli/run_test.go — TestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.
[makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.

Info

[potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.
[tier-classification] PR title uses feat: but adds testing infrastructure. test: or ci: would be more accurate per Conventional Commits.
[permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. The documentation states tests require the evals GitHub environment, but the workflow does not declare it.
[prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check has been resolved. The existence check now correctly precedes all yq invocations.
[prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. Metrics are now written before the early return on error.

Previous run (12)

Review

Findings

High

[supply-chain] .gitmodules:5 — The agent-eval-harness git submodule points to a personal fork (https://github.com/ralphbean/agent-eval-harness.git) on branch worktree-execution-hooks, not the upstream repository (https://github.com/opendatahub-io/agent-eval-harness). A personal fork can be force-pushed, deleted, or transferred at any time. The subproject commit SHA provides pinning for the current checkout, but future git submodule update --remote or re-clones will fetch from the personal fork. ADR 0047 documents the upstream as opendatahub-io/agent-eval-harness, creating a discrepancy.
Remediation: Point the submodule URL to the upstream repository or an organization fork (fullsend-ai/agent-eval-harness). If the worktree-execution-hooks branch has unreleased features, merge them upstream first.

Medium

[protected-path] .github/workflows/functional-tests.yml — This PR adds a new GitHub Actions workflow under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.
[logic-error] eval/run-functional.sh:24 — The yq rewrite of EVAL_YAML_SRC (line 24) runs before the existence check (line 27). If the file does not exist, yq fails with a confusing error under set -euo pipefail before the friendly error message is shown.
Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp and yq rewrite block.
[edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

[supply-chain] .github/workflows/functional-tests.yml:56 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.
[data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.
[supply-chain] .github/workflows/functional-tests.yml:148 — pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.
[stale-framework-reference] docs/problems/testing-agents.md:278 — The "Practical architecture" section recommends Inspect AI as the execution framework, but ADR 0047 adopts agent-eval-harness. The file's own caveat ("any choice should be held lightly") reduces impact, but a forward reference to ADR 0047/0048 would help readers.

Info

[permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. The documentation states tests require the evals GitHub environment, but the workflow does not declare it.
[injection] eval/scripts/setup-fixture.sh:68 — Hook-outputs.yaml is generated via heredoc with shell variable interpolation. All interpolated values are controlled by the test framework, so there is no realistic injection vector.
[potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.
[prior-finding-resolved] internal/cli/run.go:838 — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. The author now calls writeMetricsJSON before the early return in the error path.

Previous run (13)

Review

Findings

High

[supply-chain] .gitmodules:5 — The agent-eval-harness git submodule points to a personal fork (https://github.com/ralphbean/agent-eval-harness.git) on branch worktree-execution-hooks, not the upstream repository (https://github.com/opendatahub-io/agent-eval-harness). A personal fork can be force-pushed, deleted, or transferred at any time. The subproject commit SHA provides pinning for the current checkout, but future git submodule update --remote or re-clones will fetch from the personal fork. ADR 0047 documents the upstream as opendatahub-io/agent-eval-harness, creating a discrepancy.
Remediation: Point the submodule URL to the upstream repository or an organization fork (fullsend-ai/agent-eval-harness). If the worktree-execution-hooks branch has unreleased features, merge them upstream first.

Medium

[protected-path] .github/workflows/functional-tests.yml — This PR adds a new GitHub Actions workflow under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.
[logic-error] internal/cli/run.go:838 — When rt.Run() returns an error, runAgent returns early and writeMetricsJSON is never called. The metrics accumulation code correctly runs before the error check, but the write at line ~958 is unreachable on error paths. Downstream consumers (eval behavioral threshold judges) will see "metrics.json not found" instead of partial metrics that would distinguish timeout from zero-turn failures.
Remediation: Move writeMetricsJSON(runDir, aggMetrics) into a defer after aggMetrics is declared, guarded with if aggMetrics.Iterations > 0.
[logic-error] eval/run-functional.sh:24 — The yq rewrite of EVAL_YAML_SRC (line 24) runs before the existence check (line 27). If the file doesn't exist, yq fails with a confusing error under set -euo pipefail before the friendly error message is shown.
Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp and yq rewrite block.
[edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi
[potential-data-race] internal/runtime/runtime.go:12 — The PR adds non-atomic fields (NumTurns int, TotalCostUSD float64, InputTokens int, OutputTokens int) to RunMetrics, while the existing ToolCalls field uses atomic.Int32. The new fields are written by the progressParser goroutine and read from the main goroutine after rt.Run() returns. The existing code's use of atomic.Int32 for ToolCalls suggests the authors were aware of the concurrency concern, making plain types for new fields an inconsistency that should be verified for synchronization safety.
Remediation: Verify the happens-before relationship through the goroutine join. If synchronization is not guaranteed on all paths (including timeout/cancellation), use atomic types or a mutex.

Low

[supply-chain] .github/workflows/functional-tests.yml:56 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.
[stale-framework-reference] docs/problems/testing-agents.md:278 — The "Practical architecture" section recommends Inspect AI as the execution framework, but ADR 0047 adopts agent-eval-harness. The file's own caveat ("any choice should be held lightly") reduces impact, but a forward reference to ADR 0047/0048 would help readers.
[secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.
[data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.
[supply-chain] .github/workflows/functional-tests.yml:148 — pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.
[injection] eval/scripts/setup-fixture.sh:68 — Hook-outputs.yaml is generated via heredoc with shell variable interpolation. The interpolated values are all controlled by the test framework, but using a YAML-aware tool would be more robust.
[edge-case] internal/runtime/claude_progress.go:82 — New result-event parsing correctly follows the defensive pattern (silent skip on malformed JSON via if err == nil guard), consistent with existing assistant-event handling.

Info

[permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. An explicit environment declaration would be defense-in-depth.
[prior-finding-resolved] docs/ADRs/0047-agent-eval-harness-for-test-infrastructure.md — Prior critical findings about ADR numbering collision (0046→0047) and header mismatch have been resolved. ADRs correctly renumbered to 0047 and 0048.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review

See the review comment for full details.

maruiz93

Nothing blocking. Left a couple of comments.

Either way, I am not a fan of doing LLM judgment in the functional tests, but let's see how it goes.

fullsend-ai-review · 2026-06-18T15:10:38Z

🤖 Finished Review · ✅ Success · Started 3:10 PM UTC · Completed 3:26 PM UTC
Commit: 4b2120a · View workflow run →

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review · 2026-06-18T15:26:23Z

+}
+```
+
+**Implementation:** The `progressParser` in `internal/cli/progress.go`


[medium] stale-reference

The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go.

Suggested fix: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

fullsend-ai-review · 2026-06-18T15:26:23Z

+judges, and baselined roughly until statistical evals provide observed
+distributions.
+
+### 6. `fullsend-runner.sh` propagates `metrics.json`


[low] stale-reference

Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh.

Suggested fix: Change heading from fullsend-runner.sh to run-fullsend.sh.

fullsend-ai-review · 2026-06-18T15:26:23Z

+  git -C "$TARGET_DIR" commit -m "eval: initial content"
+  git -C "$TARGET_DIR" push origin HEAD
+fi
+


[low] edge-case

For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path.

fullsend-ai-review · 2026-06-18T15:26:23Z

+  || rc=$?
+
+if [[ $rc -ne 0 ]]; then
+  echo "WARNING: fullsend run exited with status $rc" >&2


[low] secrets-handling

The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place. Leakage is unlikely.

fullsend-ai-review · 2026-06-18T15:26:23Z

+
+if ! gh repo delete "$EPHEMERAL_REPO" --yes 2>&1; then
+  echo "WARNING: failed to delete $EPHEMERAL_REPO — may need manual cleanup" >&2
+  echo "::warning::Ephemeral repo $EPHEMERAL_REPO was not deleted — manual cleanup required"


[info] gha-workflow-command-injection

The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is set by repo admins, not arbitrary user input. Risk is negligible.

fullsend-ai-review · 2026-06-18T17:07:53Z

🤖 Finished Review · ✅ Success · Started 5:07 PM UTC · Completed 5:20 PM UTC
Commit: 6349a8b · View workflow run →

fullsend-ai-review · 2026-06-18T18:06:14Z

🤖 Finished Review · ✅ Success · Started 6:06 PM UTC · Completed 6:18 PM UTC
Commit: e9e0dbe · View workflow run →

fullsend-ai-review · 2026-06-22T18:35:54Z

🤖 Review · ⚠️ Cancelled · Started 6:35 PM UTC · Ended 6:36 PM UTC
Commit: 4e21a60 · View workflow run →

fullsend-ai-review · 2026-06-22T18:39:28Z

🤖 Finished Review · ✅ Success · Started 6:39 PM UTC · Completed 6:53 PM UTC
Commit: 11cf91d · View workflow run →

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review · 2026-06-22T18:53:41Z

+
+AGENT="${1:?agent name required}"
+EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)"


[medium] logic-error

The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.

Suggested fix: Move the existence check to before the mktemp / yq rewrite block.

fullsend-ai-review · 2026-06-22T18:53:41Z

+```
+eval/
+  fullsend-runner.sh          # CLI runner: fixture setup -> fullsend run -> capture state
+  run-functional.sh           # Orchestrator: iterate cases, score


[medium] stale-reference

The directory layout diagram shows fullsend-runner.sh as a top-level file under eval/, but the actual file is eval/scripts/run-fullsend.sh.

Suggested fix: Update the ADR diagram from fullsend-runner.sh to scripts/run-fullsend.sh.

fullsend-ai-review · 2026-06-22T18:53:41Z

+}
+```
+
+**Implementation:** The `progressParser` in `internal/cli/progress.go`


[medium] stale-reference

The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The files-changed table also has incorrect paths.

Suggested fix: Update all references to use the correct file paths.

fullsend-ai-review · 2026-06-22T18:53:41Z

@@ -0,0 +1,72 @@
+---


[medium] internal-inconsistency

Title/heading numbers (47, 48) do not match filename numbers (0051, 0052).

Suggested fix: Align the title/heading numbers with the filename numbers (51 and 52).

fullsend-ai-review · 2026-06-22T18:53:41Z

+fi
+
+# Remove env file to prevent secrets from being uploaded as artifacts
+rm -f "$ENV_FILE"


[low] secrets-in-artifacts

The .eval-env file containing credentials is written to $OUTPUT_DIR. Three cleanup layers exist but writing to /tmp would eliminate the residual risk.

fullsend-ai-review · 2026-06-22T18:53:41Z

+	ToolCalls  int `json:"tool_calls"`
+}
+
+func writeMetricsJSON(dir string, m aggregateMetrics) error {


[low] go-error-wrapping

writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

Suggested fix: Wrap with fmt.Errorf("marshaling metrics: %%w", err).

Add a functional test framework for agent pipelines using agent-eval-harness lifecycle hooks. The harness drives case iteration with before_each/after_each hooks for ephemeral repo management, while fullsend runs inside openshell sandboxes. Key components: - eval/scripts/setup-fixture.sh: before_each hook creates ephemeral GitHub repos and fixtures (issues/PRs) from input.yaml - eval/scripts/run-fullsend.sh: CLI runner invokes fullsend run - eval/scripts/capture-fixture.sh: after_each hook snapshots fixture state for judges - eval/scripts/teardown-fixture.sh: after_each hook deletes repos - eval/run-functional.sh: orchestrator calling workspace.py, execute.py, and score.py with behavioral threshold checks - eval/triage/: first eval suite with LLM judge and label checks Also includes CI workflow, behavioral thresholds (max_turns, max_cost_usd), metrics capture from Claude Code stream events, ADRs, and documentation. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

- Missing metrics.json is now a FAIL (not a warning), ensuring behavioral thresholds cannot be silently bypassed when the agent crashes or fullsend run fails. - Validate that jq output is numeric before threshold comparison, preventing null/malformed values from silently passing as 0. - Add shellcheck SC2317 disable directives for trap handler commands that shellcheck incorrectly flags as unreachable. Signed-off-by: Ralph Bean <rbean@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

The harness was referenced twice: once as a git submodule and once as a pip install from the same git URL. Install from the already-checked-out submodule so the fork URL only appears in .gitmodules. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Move max_turns and max_cost_usd checks from custom shell code in run-functional.sh into deterministic check judges in eval.yaml. The harness's score.py now enforces these via min_pass_rate: 1.0 thresholds. Extract the pre-flight annotation validation into a standalone eval/lint-cases.sh linter, wired up as `make lint-eval-cases` and included in `make test`. This runs cheaply without executing agents. Net effect: ~90 lines removed from run-functional.sh. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Extend lint-cases.sh to verify that eval.yaml declares max_turns and max_cost judges, not just that annotations.yaml declares the thresholds. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

The CLI runner receives {output_dir} which is workspace/output. Writing metrics.json to $OUTPUT_DIR/output/ created a double-nested path that score.py couldn't find. Write to $OUTPUT_DIR/metrics.json instead so the file appears at the expected output/metrics.json key. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

The triage agent consistently takes ~23 turns on this case. The previous threshold of 15 was too tight. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Add three new triage eval cases covering the remaining major outcomes: - 002-needs-info-vague-crash: vague issue with no repro steps, expects action "insufficient" and needs-info label - 003-feature-request: clear feature request, expects action "sufficient" with category "feature" and triaged+feature labels (not ready-to-code) - 004-duplicate-issue: issue duplicating a seed issue, expects action "duplicate" with duplicate label Supporting changes: - setup-fixture.sh: support seed_issues in input.yaml for pre-populating issues before the main fixture (needed by duplicate test) - eval.yaml: add forbidden_labels judge to verify wrong labels are NOT applied (needs-info must not get ready-to-code, etc.) - lint-cases.sh: check for forbidden_labels judge presence Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Cases use isolated ephemeral repos with UUID suffixes, so there's no shared state. Sequential execution took ~15 min of agent time across 4 cases; parallel should bring wall-clock down to ~6 min. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

ADR 0045 was taken on main by forge-portable-harness-schema while this branch was out of date. Renumber both branch ADRs and update all cross-references. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

ADR 0046-host-side-api-server-design was merged to main while this branch was open, creating a duplicate number. Renumber 0046-functional-tests-for-agent-pipelines to 0048 and update all cross-references. Also fix stale heading numbers in ADRs 0047 and 0048. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

- Write partial metrics.json on agent failure so downstream judges can inspect behavioral data even when the agent errors out - Add CLOUD_ML_REGION to runner.env in eval.yaml (was only in execution.env) - Let teardown-fixture.sh fail visibly instead of swallowing errors with 2>/dev/null || true — the harness on_failure: continue already handles non-fatal teardown - Distinguish infrastructure failures from agent failures in run-functional.sh by checking for case output before proceeding to scoring - Fix stale reference to eval/fullsend-runner.sh in ADR 0047 (actual path is eval/scripts/run-fullsend.sh) Signed-off-by: Ralph Bean <rbean@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

Main acquired ADR numbers 0047-0050 while this branch was open. Renumber the PR's ADRs and update all cross-references. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

fullsend-ai-review · 2026-06-22T19:12:09Z

🤖 Finished Review · ✅ Success · Started 7:12 PM UTC · Completed 7:27 PM UTC
Commit: 722af5c · View workflow run →

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review · 2026-06-22T19:27:26Z

+
+AGENT="${1:?agent name required}"
+EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)"


[medium] logic-error

The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.

Suggested fix: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.

fullsend-ai-review · 2026-06-22T19:27:26Z

@@ -0,0 +1,72 @@
+---
+title: "47. agent-eval-harness for test infrastructure"
+status: Accepted


[medium] internal-inconsistency

The YAML frontmatter title says 47. agent-eval-harness and the H1 heading says # 47. but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052. Cross-references by number are ambiguous.

Suggested fix: Update the YAML title and H1 heading in 0051-.md to 51. and in 0052-.md to 52. to match the filenames.

fullsend-ai-review · 2026-06-22T19:27:26Z

+
+# --- Create fixture ---
+FIXTURE_URL=""
+FIXTURE_NUMBER=""


[medium] edge-case

The ephemeral repo is created with gh repo create but .hook-outputs.yaml (which propagates EPHEMERAL_REPO to teardown) is written at the end of the script. A mid-script failure leaves an orphaned repo with no teardown path.

Suggested fix: Register a trap to delete the ephemeral repo on failure, or write .hook-outputs.yaml immediately after repo creation.

fullsend-ai-review · 2026-06-22T19:27:26Z

+}
+```
+
+When retries occur, all values are summed. The functional test cares about


[low] technical-doc-inaccuracy

The design spec references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The Files changed table also references wrong paths.

fullsend-ai-review · 2026-06-22T19:27:26Z

+  EPHEMERAL_REPO: "${EPHEMERAL_REPO}"
+  FIXTURE_URL: "${FIXTURE_URL}"
+  FIXTURE_NUMBER: "${FIXTURE_NUMBER}"
+  FIXTURE_TYPE: "${FIXTURE_TYPE}"


[low] yaml-injection

The .hook-outputs.yaml heredoc interpolates shell variables without YAML escaping. Attack surface is limited to developers with commit access, but quoting values would be defense-in-depth.

fullsend-ai-review · 2026-06-22T19:27:26Z

+# OUTPUT_DIR is {output_dir} from the harness, which is workspace/output.
+# score.py loads files relative to case_dir/output, so metrics.json needs
+# to be at OUTPUT_DIR/metrics.json (not OUTPUT_DIR/output/metrics.json).
+METRICS_FILE=$(find "$OUTPUT_DIR" -maxdepth 3 -name metrics.json -not -path "$OUTPUT_DIR/metrics.json" 2>/dev/null | head -1)


[low] edge-case

The metrics.json copy logic uses find ... | head -1 with non-deterministic ordering. A direct path would be more robust.

fullsend-ai-review · 2026-06-22T19:27:26Z

+
+# --- Create fixture ---
+FIXTURE_URL=""
+FIXTURE_NUMBER=""


[low] ephemeral-repo-visibility

Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist. Content is not sensitive but --private would be defense-in-depth.

fullsend-ai-review · 2026-06-22T19:27:26Z

+	ToolCalls  int `json:"tool_calls"`
+}
+
+func writeMetricsJSON(dir string, m aggregateMetrics) error {


[low] go-error-wrapping

writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

Suggested fix: Wrap with fmt.Errorf for marshaling and writing errors.

fullsend-ai-retro · 2026-06-22T19:34:26Z

🤖 Finished Retro · ✅ Success · Started 7:34 PM UTC · Completed 7:49 PM UTC
Commit: 722af5c · View workflow run →

fullsend-ai-retro · 2026-06-22T19:49:51Z

Retro: PR #1682 — Functional test framework

PR type: Human-authored (Ralph Bean), branch ci/functional-evals, open ~24 days (partly waiting on upstream dependency).

Review agent behavior was the primary concern. The fullsend-ai-review[bot] posted 86 inline review comments across ~10 review rounds and issued 8 CHANGES_REQUESTED verdicts. The vast majority of comments were verbatim duplicates of findings from prior rounds:

yq checksum verification: flagged 8+ times (author acknowledged as non-blocking on May 29)
Personal fork in .gitmodules: flagged 6+ times (eventually fixed, bot kept flagging after fix)
Empty fixture.files edge case: flagged 7+ times
Existence check ordering in run-functional.sh: flagged 6+ times (code was completely rewritten, bot kept flagging the old pattern)
Debug code in run.go: flagged 4 times (cleaned up in early revisions)

Meanwhile, human reviewers (waynesun09's 9-agent review squad, maruiz93, ascerra) found ~8 genuinely novel high/medium findings the bot never detected across all 10 rounds:

Critical: composite action source uses bare relative path, breaks all external consumers (4/4 squad agents caught this)
High: OpenShell version drift (0.0.38 vs 0.0.54 on main)
High: partial metrics not written on runErr != nil early return
High: spec-implementation drift (design spec says orchestrator thresholds, implementation uses per-skill judges)
Medium: exit code masking in teardown, infrastructure failure masking by || true, doc-schema mismatches, grep -P portability on macOS

Existing issues already cover the main improvements:

#1013 — Deduplicate findings across iterations on the same PR
#1285 — Don't regenerate unchanged inline comments on re-reviews
#956 — Resolve inline comments from previous reviews on re-review
#1500 — Don't re-request changes for unchanged findings
#1583 — Recognize human-resolved findings and stop re-flagging
#992 / #1570 — ADR collision CI lint / auto-renumber

This PR is a strong data point for prioritizing #1013 and #1285. The 86 duplicate comments represent significant token waste and human attention cost. The bot consumed ~10x the review budget of the human reviewers while finding fewer unique issues.

No new proposals filed — all identified improvements are covered by existing open issues.

ralphbean had a problem deploying to evals May 29, 2026 00:26 — with GitHub Actions Failure

github-actions Bot deployed to site-preview May 29, 2026 00:27 View deployment

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 00:36 View deployment

github-actions Bot deployed to site-preview May 29, 2026 00:38 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 01:38 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 01:48 View deployment

ralphbean force-pushed the ci/functional-evals branch from 6e40384 to 826ceaf Compare May 29, 2026 01:50

github-actions Bot deployed to site-preview May 29, 2026 01:52 View deployment

fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 10:40 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread .github/workflows/functional-evals.yml Outdated

Comment thread .github/workflows/functional-tests.yml

fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label May 29, 2026

ralphbean marked this pull request as draft May 29, 2026 10:57

github-actions Bot deployed to site-preview May 29, 2026 11:02 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread eval/run-functional.sh

Comment thread eval/fullsend-runner.sh Outdated

github-actions Bot deployed to site-preview May 29, 2026 13:50 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread eval/run-functional.sh

Comment thread .github/workflows/functional-tests.yml

Comment thread eval/fullsend-runner.sh Outdated

github-actions Bot deployed to site-preview May 29, 2026 14:00 View deployment

fullsend-ai-review Bot suggested changes May 29, 2026

View reviewed changes

Comment thread internal/cli/run.go Outdated

Comment thread .github/workflows/functional-evals.yml Outdated

Comment thread eval/run-functional.sh

Comment thread .github/workflows/functional-tests.yml

Comment thread eval/fullsend-runner.sh Outdated

github-actions Bot deployed to site-preview May 29, 2026 14:21 View deployment

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026

github-actions Bot deployed to site-preview May 29, 2026 14:58 View deployment

maruiz93 approved these changes Jun 18, 2026

View reviewed changes

ralphbean mentioned this pull request Jun 18, 2026

eval: merge duplicate gh issue view calls in capture-fixture.sh #2427

Open

fullsend-ai-review Bot suggested changes Jun 18, 2026

View reviewed changes

ralphbean mentioned this pull request Jun 22, 2026

fix(skills): enqueue script polls until merge or ejection #2516

Closed

2 tasks

fullsend-ai-review Bot reviewed Jun 22, 2026

View reviewed changes

ralphbean added 14 commits June 22, 2026 15:06

fix(eval): raise max_turns threshold to 30 for triage case

8129761

The triage agent consistently takes ~23 turns on this case. The previous threshold of 15 was too tight. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

chore: fix gofmt in claude_progress.go

a1dfdc2

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

fix(docs): renumber ADRs 0047/0048 to 0051/0052 to avoid collision

722af5c

Main acquired ADR numbers 0047-0050 while this branch was open. Renumber the PR's ADRs and update all cross-references. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>

fullsend-ai-review Bot reviewed Jun 22, 2026

View reviewed changes

This was referenced Jun 22, 2026

feat: add standalone mint with custom role support #2537

Merged

feat(#2577): add Level 1 distributed tracing #2752

Merged

Uh oh!

Conversation

ralphbean commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the box

Behavioral thresholds

Current results

Still TODO

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Site preview

Uh oh!

fullsend-ai-review Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Review

Findings

Medium

Low

Info

Review

Findings

Medium

Low

Info

Review

Findings

Critical

Medium

Low

Info

Review

Findings

Medium

Low

Info

Review

Findings

Medium

Low

Info

Review

Review

Findings

Medium

Low

Info

Review

Findings

Medium

Low

Info

Review

Findings

Critical

Medium

Low

Info

Review

Findings

Critical

Medium

Low

Info

Review

Findings

High

Medium

Low

Info

Review

ralphbean commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

fullsend-ai-review Bot commented May 29, 2026 •

edited

Loading

maruiz93 left a comment •

edited

Loading

fullsend-ai-review Bot commented Jun 18, 2026 •

edited

Loading

fullsend-ai-review Bot commented Jun 18, 2026 •

edited

Loading

fullsend-ai-review Bot commented Jun 18, 2026 •

edited

Loading