Skip to content

feat: add functional test framework for agent pipelines#1682

Merged
ralphbean merged 14 commits into
mainfrom
ci/functional-evals
Jun 22, 2026
Merged

feat: add functional test framework for agent pipelines#1682
ralphbean merged 14 commits into
mainfrom
ci/functional-evals

Conversation

@ralphbean

@ralphbean ralphbean commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds functional test framework that exercises the full agent pipeline (pre-script → agent → post-script) by observing side effects on GitHub fixtures
  • Ephemeral repos with UUID suffixes ensure concurrent test runs don't interfere
  • Uses agent-eval-harness for LLM-graded scoring via CLI runner interface
  • First test case (001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue description
  • Behavioral thresholds (max_turns, max_cost_usd) enforced universally by the orchestrator — agents that pass quality judges but loop or burn tokens fail the test

What's in the box

File Purpose
eval/fullsend-runner.sh CLI runner: ephemeral repo → fixture → fullsend run → capture state → teardown
eval/run-functional.sh Orchestrator: iterate cases, validate thresholds, call runner, score with agent-eval-harness
eval/triage/eval.yaml Test config: LLM judge (1-5 rubric) + deterministic label check
eval/triage/cases/001-* Test case with tricky scenario (regex accepts + but issue claims it doesn't)
eval/triage/repos/python-webapp/ Shared repo content, symlinked by cases
.github/workflows/functional-tests.yml CI workflow triggered by eval/ or internal/scaffold/ changes
Makefile make functional-tests target
internal/cli/progress.go Captures num_turns, total_cost_usd, token counts from Claude Code stream events
internal/cli/run.go Writes metrics.json with aggregated behavioral metrics after all iterations
docs/testing/functional-tests.md Documents threshold requirements for test authors

Behavioral thresholds

Every test case declares max_turns and max_cost_usd in annotations.yaml. The orchestrator validates these before running and compares against metrics.json after each run. This catches efficiency regressions (looping agents, model cost spikes) that quality judges can't see.

Thresholds are universal invariants — not per-skill judges — so every skill gets them automatically and new skills can't opt out.

Current results

The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles +. Threshold set to 2.5 with a TODO to raise it.

Still TODO

  • Set up evals GitHub environment with secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) and vars (EVAL_ORG, ANTHROPIC_VERTEX_PROJECT_ID)
  • Refactor case layout to match agent-eval-harness execute.py expectations so we can use their parallelism
  • Add more test cases (feature request, vague bug, PR review)
  • Per-agent path filtering in CI (only run triage tests when triage-related files change)

Refs #499, #73
Closes #346

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

Site preview

Preview: https://8567a3ef-site.fullsend-ai.workers.dev

Commit: 722af5c700bd493fa7f0883a7494ceb68db10721

@fullsend-ai-review

fullsend-ai-review Bot commented May 29, 2026

Copy link
Copy Markdown

Review

Findings

Medium

  • [protected-path] .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [logic-error] eval/run-functional.sh:27 — The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
    Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.

  • [internal-inconsistency] docs/ADRs/0051-agent-eval-harness-for-test-infrastructure.md — The YAML frontmatter title says "47. agent-eval-harness" and the H1 heading says "# 47." but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052. The four-digit gap means cross-references by number are ambiguous — readers cannot tell whether "47" refers to ADR file 0047 or this document.
    Remediation: Update the YAML title and H1 heading in 0051-*.md to "51." and in 0052-*.md to "52." to match the filenames.

  • [edge-case] eval/scripts/setup-fixture.sh:97 — The ephemeral repo is created with gh repo create (line 97) but .hook-outputs.yaml (which propagates EPHEMERAL_REPO to the teardown hook) is written at the end of the script (line 178). If any step between repo creation and the YAML write fails under set -e, the teardown hook has no EPHEMERAL_REPO to delete, leaving an orphaned repo.
    Remediation: Register a trap in setup-fixture.sh to delete the ephemeral repo on failure, or write .hook-outputs.yaml (with at least EPHEMERAL_REPO) immediately after repo creation.

  • [submodule-dependency] .gitmodules — The git submodule dependency points to ralphbean/agent-eval-harness (a personal fork) on branch worktree-execution-hooks, not the upstream opendatahub-io/agent-eval-harness. ADR 0051 references the upstream. A personal fork can be force-pushed, deleted, or transferred, creating a maintenance risk for the test infrastructure.
    Remediation: Point the submodule URL to the upstream or an org fork (fullsend-ai/agent-eval-harness). If the fork contains unreleased features, document the dependency and a migration plan.

Low

  • [technical-doc-inaccuracy] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:66 — The design spec references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The "Files changed" table also references internal/cli/progress_test.go instead of internal/runtime/claude_progress_test.go.

  • [yaml-injection] eval/scripts/setup-fixture.sh:146 — The .hook-outputs.yaml heredoc interpolates shell variables without YAML escaping. The attack surface is limited to developers with commit access crafting malicious test fixtures, but quoting values would be defense-in-depth.

  • [edge-case] eval/scripts/run-fullsend.sh:89 — The metrics.json copy logic uses find ... | head -1 with non-deterministic ordering. In practice the right file should be found since run.go writes directly to runDir, but a direct path would be more robust.

  • [test-adequacy] internal/runtime/claude_progress_test.go — Tests cover normal result events and absent result events but do not test malformed JSON in the usage sub-object. json.Unmarshal silently zero-defaults missing fields.

  • [ephemeral-repo-visibility] eval/scripts/setup-fixture.sh:97 — Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist indefinitely. Content is not sensitive, but --private would be defense-in-depth.

  • [go-error-wrapping] internal/cli/run.go:87writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
    Remediation: Wrap with fmt.Errorf("marshaling metrics: %w", err).

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing eval/ or internal/scaffold/ should know about this test target.
    Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.

  • [missing-doc] docs/guides/README.md:37 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
    Remediation: Add a link to functional-tests.md in the Development section.

Previous run

Review

Findings

Medium

  • [protected-path] .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [logic-error] eval/run-functional.sh:27 — The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
    Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.

  • [stale-reference] docs/ADRs/0052-functional-tests-for-agent-pipelines.md:99 — The directory layout diagram shows fullsend-runner.sh as a top-level file under eval/, but the actual file created by this PR is eval/scripts/run-fullsend.sh. The docs/testing/functional-tests.md doc correctly shows the scripts/ subdirectory layout.
    Remediation: Update the ADR diagram from fullsend-runner.sh to scripts/run-fullsend.sh.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The files-changed table also references internal/cli/progress_test.go instead of internal/runtime/claude_progress_test.go.
    Remediation: Update all references to use the correct file paths.

  • [internal-inconsistency] docs/ADRs/0051-agent-eval-harness-for-test-infrastructure.md — The YAML frontmatter title says "47. agent-eval-harness" and the H1 heading says "# 47." but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052.
    Remediation: Align the title/heading numbers with the filename numbers (51 and 52).

Low

  • [secrets-in-artifacts] eval/scripts/run-fullsend.sh:83 — The .eval-env file containing credentials is written to $OUTPUT_DIR. Three cleanup layers exist (trap, explicit rm, workflow scrub with if: always()), and the artifact upload excludes .eval-env. Risk is low but writing the env file outside the artifact upload path (e.g., /tmp) would eliminate it entirely.

  • [forge-abstraction-violation] eval/scripts/setup-fixture.sh — Eval scripts use gh CLI directly for GitHub API operations. The AGENTS.md forge abstraction requirement targets Go application code (cmd/, internal/), not shell-based test infrastructure. Documenting this exception would clarify intent.

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing eval/ or internal/scaffold/ should know about this test target.
    Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.

  • [missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.
    Remediation: Add a link to functional-tests.md in the Development section.

  • [go-error-wrapping] internal/cli/run.go:87writeMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
    Remediation: Wrap with fmt.Errorf("marshaling metrics: %w", err).


Labels: PR adds git submodule for agent-eval-harness

Previous run (2)

Review

Findings

Medium

  • [protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
    Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

Low

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). The test guidance is scoped to Go code changes, but developers changing eval/ or internal/scaffold/ should know about this test target.
    Remediation: Add a bullet point documenting make functional-tests alongside existing test targets.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
    Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.

  • [silent-pass-on-failure] eval/triage/eval.yaml — If the Claude Code result event is malformed, progressParser silently ignores the parse error and leaves metrics at zero. Zero values pass positive thresholds (0 <= 30), meaning a broken metrics pipeline silently passes behavioral checks. No test covers this path. See also: [test-adequacy] finding at internal/runtime/claude_progress_test.go.

  • [go-error-wrapping] internal/cli/run.gowriteMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible but defense-in-depth would sanitize before interpolation.

  • [missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

  • [forge-abstraction-violation] eval/scripts/setup-fixture.sh — Eval scripts (setup-fixture.sh, capture-fixture.sh, teardown-fixture.sh) use gh CLI directly. The AGENTS.md forge abstraction rule targets Go application code, not shell-based test infrastructure, but documenting this exception would clarify intent.

Info

Previous run (3)

Review

Findings

Medium

  • [protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
    Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
    Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
    Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.

  • [go-error-wrapping] internal/cli/run.gowriteMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible but defense-in-depth would sanitize before interpolation.

  • [missing-doc] docs/guides/README.md:38 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

Info


Labels: PR adds functional test CI workflow and modifies agent runner metrics emission in Go code.

Previous run (4)

Review

Findings

Critical

  • [adr-numbering-conflict] docs/ADRs/0049-agent-eval-harness-for-test-infrastructure.md — ADR numbering conflict. The PR introduces ADR 0049 (agent-eval-harness) and ADR 0050 (functional tests), but ADR 0049 already exists on main as 0049-agent-configuration-env-var-convention.md. This creates a collision — two files with the same ADR number. The highest ADR on main is 0049, so new ADRs should start at 0050.
    Remediation: Renumber to 0050 and 0051. Update all cross-references in docs/architecture.md, docs/problems/testing-agents.md, the design spec, and the ADR cross-references (ADR 0049 references ADR 0050, which would become 0050→0051).

Medium

  • [protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
    Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
    Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The files-changed table at the bottom correctly names the file, making the inconsistency self-evident.
    Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.

  • [edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path, so this is latent.

  • [secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion). Leakage is unlikely but creating the file outside the artifact upload path would be additional defense-in-depth.

  • [go-error-wrapping] internal/cli/run.gowriteMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

  • [missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

Info

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is a GitHub Actions variable set by repo admins, not arbitrary user input. Risk is negligible.

  • [test-adequacy] internal/runtime/claude_progress_test.go — Tests cover the happy path and absent-result-event cases but do not test malformed result events. The code handles this gracefully via json.Unmarshal failing silently.

  • [permissions] .github/workflows/functional-tests.yml — Workflow permissions (contents: read, id-token: write) are minimal and appropriate for WIF with GCP. The functional-tests environment gates fork PR access to secrets.

  • [prompt-injection-defense] eval/triage/eval.yaml — The LLM judge prompt includes injection defense instructions for evaluating untrusted agent output. Reasonable for the eval context.

  • [authorized-scope] Scope is well-aligned with authorized issues Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, and Add regression tests / evals for all agents and skills #73.

Previous run (5)

Review

Findings

Medium

  • [protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (adding OPENSHELL_WHEEL_SHA to centralized version script, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:47 — The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The file paths table in the same document was corrected in a prior revision but this inline reference was not updated.
    Remediation: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:138 — Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh. The file paths table was corrected but this heading was not updated.
    Remediation: Change heading from fullsend-runner.sh to run-fullsend.sh.

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
    Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:16 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), EVAL_ORG is user-provided and not sanitized. Defense-in-depth would sanitize before interpolation.

  • [secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion). Leakage is unlikely but creating the file outside the artifact upload path would be additional defense-in-depth.

  • [edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path, so this is latent.

  • [go-error-wrapping] internal/cli/run.gowriteMetricsJSON returns raw errors from json.MarshalIndent and os.WriteFile without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

  • [scope-coherence] eval/scripts/setup-fixture.sh — Test infrastructure hardcodes GitHub-specific operations but input.yaml declares a forge field suggesting multi-forge intent. This is standard YAGNI-compliant development (build the extension point, implement what you need now), but documenting that functional tests are currently GitHub-only would clarify intent.

  • [shell-script-output-redirection] eval/scripts/teardown-fixture.sh:18 — Error redirection pattern 2>&1 merges stderr into stdout without capturing the output. The error is still visible in CI logs, but capturing it would produce clearer diagnostics.

  • [missing-doc] docs/guides/README.md:36 — The Development section lists E2E testing but does not link to the new docs/testing/functional-tests.md.

  • [naming-alignment] docs/ADRs/0050-functional-tests-for-agent-pipelines.md — "Functional tests" naming creates potential ambiguity with traditional integration tests. The ADR addresses this with a dedicated naming note, which is a reasonable documented choice.

  • [abstraction-fit] eval/.agent-eval-harness — Introduces Python dependency in a primarily Go codebase. ADR 0049 documents the rationale. Accepted trade-off.

Info

Previous run (6)

Review

Findings

Medium

  • [protected-path] .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR modifies files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (centralizing OpenShell version management, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [adr-numbering-inconsistency] docs/ADRs/0049-agent-eval-harness-for-test-infrastructure.md:12 — The H1 heading says # 47. but the filename is 0049 and the frontmatter title says 49.. The heading number is wrong.
    Remediation: Change the H1 from # 47. to # 49..

  • [adr-numbering-inconsistency] docs/ADRs/0050-functional-tests-for-agent-pipelines.md:12 — The H1 heading says # 48. but the filename is 0050 and the frontmatter title says 50.. The heading number is wrong.
    Remediation: Change the H1 from # 48. to # 50..

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:79 — The 'Files changed' table references wrong file paths: eval/fullsend-runner.sh (actual: eval/scripts/run-fullsend.sh), internal/cli/progress.go (actual: internal/runtime/claude_progress.go), internal/cli/progress_test.go (actual: internal/runtime/claude_progress_test.go).
    Remediation: Update the file paths in the table to match actual locations.

  • [stale-reference] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md:109 — Section 5 references 'ADR 0048' but ADR 0048 is 0048-automatic-updates.md (unrelated). The intended reference is ADR 0050 (functional tests). The H1 numbering bugs in the ADRs above likely caused this confusion.
    Remediation: Change 'ADR 0048' to 'ADR 0050' in section 5.

  • [missing-doc] AGENTS.md:36 — AGENTS.md documents unit tests (make go-test) and e2e tests (make e2e-test) but does not mention the new functional tests (make functional-tests). Developers changing agent-related code under internal/scaffold/, internal/runtime/, or eval/ should be aware of this third test category.
    Remediation: Add a bullet point documenting make functional-tests alongside the existing test targets.

Low

  • [documentation-accuracy] docs/testing/functional-tests.md:207 — The CI integration section states tests require the evals GitHub environment but the workflow specifies environment: functional-tests.
    Remediation: Change evals to functional-tests.

  • [documentation-accuracy] docs/testing/functional-tests.md:207 — The documented secrets (EVAL_GH_TOKEN, GCP_CREDENTIALS) don't match the workflow, which uses E2E_GCP_WIF_PROVIDER, E2E_GCP_SERVICE_ACCOUNT, and E2E_GCP_PROJECT_ID.
    Remediation: Update the documented secret names to match the workflow.

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:14 — The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), sanitizing before interpolation would be defense-in-depth.

  • [secrets-handling] eval/scripts/run-fullsend.sh:79 — The .eval-env file contains plaintext secrets on disk during execution. The five defense layers (mode 0600, trap EXIT, explicit rm, workflow scrub, artifact exclusion) make leakage very unlikely, but creating the file outside the artifact upload path would be additional defense-in-depth.

  • [edge-case] eval/scripts/setup-fixture.sh:88 — For a pull_request fixture with empty fixture.files (the default), the script would attempt gh pr create with no diff from main, which would fail. No current test case exercises this path, so this is latent.

Info

Previous run (7)

Review

Reason: stale-head

The review agent reviewed commit b1d8849d91502488e596ad415469b1474068201c but the PR HEAD is now 2fc80f184de40de4a9db3c03d7512451ef2b863f. This review was discarded to avoid approving unreviewed code.

Previous run (8)

Review

Findings

Medium

Low

  • [tier-classification] PR title uses feat: but the change is testing infrastructure, not user-facing functionality. Per COMMITS.md, feat populates the Features section of release notes and is reserved for changes end users would recognize as new capability. test: or ci: would be more accurate and would correctly exclude this from release notes under Features.
    Remediation: Change PR title prefix from feat: to test:.

  • [supply-chain] .github/scripts/install-openshell.sh:15 — The install script downloads and pipes a remote shell script to sh from a commit-SHA-pinned URL. The SHA pin prevents silent upstream changes, but there is no checksum verification of the downloaded content.

  • [supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [gha-workflow-command-injection] eval/scripts/teardown-fixture.sh:17 — The GHA workflow command ::warning::Ephemeral repo $EPHEMERAL_REPO was not deleted interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. While the value is constructed from controlled inputs (EVAL_ORG + sanitized case ID + UUID), sanitizing before interpolation would be defense-in-depth.

  • [edge-case] eval/scripts/setup-fixture.sh:107 — In the pull_request path, if FIXTURE_FILES is empty and repo/ content was already committed on the base branch, the diff --cached guard correctly skips the commit. However, gh pr create will then fail because the PR branch has no new commits pushed to the remote. This is a test authoring error (PR fixture with no file changes), so the failure is appropriate but the error message will be opaque.

  • [documentation-accuracy] docs/superpowers/specs/2026-06-01-behavioral-thresholds-for-functional-tests-design.md — The design spec's Files Changed table references internal/cli/progress.go and internal/cli/progress_test.go, but the actual files are internal/runtime/claude_progress.go and internal/runtime/claude_progress_test.go.

  • [stale-version-reference] docs/plans/agent-execution-environment.md:49 — Example Dockerfile contains outdated OpenShell version (0.0.37-dev) and does not reference the new single source of truth at .github/scripts/openshell-version.sh (currently 0.0.54). Pre-existing staleness, but the new centralized version management makes it more visible.

Info

  • [prior-finding-resolved] eval/scripts/setup-fixture.sh:119 — Prior medium edge-case finding about unconditional git commit in the pull_request path has been resolved. The if ! git diff --cached --quiet guard was added, matching the pattern used for the initial content commit.

  • [prior-finding-resolved] eval/scripts/setup-fixture.sh:60 — Prior low data-exposure finding about ephemeral repos created with --public has been resolved. Repos are now created with --private.

  • [prior-finding-resolved] action.yml:284 — Prior critical finding about bare relative path in composite action has been resolved. The updated code uses $GITHUB_ACTION_PATH/.github/scripts/install-openshell.sh.

  • [prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path remains resolved. Metrics are written before the early return on error.

  • [potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.

  • [go-test-assertion-style] internal/cli/run_test.goTestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.

  • [makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.

  • [permissions] .github/workflows/functional-tests.yml — Workflow permissions are appropriately scoped to contents: read and id-token: write. Uses pull_request (not pull_request_target), so fork PRs cannot access secrets. Secrets are gated behind the functional-tests environment.

Previous run (9)

Review

Findings

Medium

  • [protected-path] .github/scripts/install-openshell.sh, .github/scripts/openshell-version.sh, .github/workflows/functional-tests.yml — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale (centralizing OpenShell version management, adding CI workflow for functional tests). Human approval is always required for protected-path changes regardless of context.

  • [edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet. No current test case exercises the pull_request path, but the first PR-type test case with empty fixture.files will hit this.
    Remediation: Add the same guard: git -C "$TARGET_DIR" add -A && if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

  • [tier-classification] PR title uses feat: but the change is testing infrastructure, not user-facing functionality. Per COMMITS.md, test: or ci: would be more accurate. End users of fullsend don't interact with the functional test framework — it's for contributors testing agent behavior.
    Remediation: Change PR title prefix from feat: to test:.

Low

  • [supply-chain] .github/scripts/install-openshell.sh:15 — The install script downloads and pipes a remote shell script to sh from a commit-SHA-pinned URL. The SHA pin prevents silent upstream changes, but there is no checksum verification of the downloaded content.

  • [supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

Info

  • [prior-finding-resolved] action.yml:289 — Prior critical finding about bare relative path in composite action has been resolved. The updated code uses $GITHUB_ACTION_PATH/.github/scripts/install-openshell.sh, and install-openshell.sh resolves its own directory via SCRIPT_DIR before sourcing openshell-version.sh.

  • [potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.

  • [go-test-assertion-style] internal/cli/run_test.goTestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.

  • [makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.

  • [prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check remains resolved.

  • [prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path remains resolved. Metrics are now written before the early return on error.

Previous run (10)

Review

Findings

Critical

  • [logic-error] action.yml:269 — The refactored "Install OpenShell CLI" step uses source .github/scripts/openshell-version.sh with a bare relative path. action.yml is a composite GitHub Action (using: composite). In composite actions, run: steps execute in $GITHUB_WORKSPACE (the consumer repo's checkout), not in the action's directory. The file openshell-version.sh only exists in the fullsend repo, so any consumer workflow using uses: fullsend-ai/fullsend@... will fail with "file not found". The pre-existing code used inline echo statements that didn't depend on any file.
    Remediation: Use ${{ github.action_path }} to reference the action's own directory: source "${{ github.action_path }}/.github/scripts/openshell-version.sh".

Medium

  • [protected-path] .github/workflows/functional-tests.yml, .github/scripts/openshell-version.sh — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.

  • [edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
    Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

  • [supply-chain] .github/workflows/functional-tests.yml:82 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [supply-chain] .github/workflows/functional-tests.yml:122pip install --quiet "jsonschema>=4.18.0" uses a floor pin rather than an exact version pin.

  • [data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

  • [stale-reference] docs/plans/agent-execution-environment.md:49 — Dockerfile example hardcodes OPENSHELL_VERSION=0.0.37-dev, but this PR introduces .github/scripts/openshell-version.sh as the single source of truth (currently 0.0.54). Pre-existing staleness, but the new centralized version management makes it more visible.

Info

  • [potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets is written to disk with mode 0600. Cleanup is handled via trap EXIT, explicit rm -f, and a CI workflow scrub step — thorough for the ephemeral CI context.

  • [go-test-assertion-style] internal/cli/run_test.goTestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.

  • [makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.

  • [tier-classification] PR title uses feat: but adds testing infrastructure. test: or ci: would be more accurate per Conventional Commits.

  • [prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check has been resolved. The existence check now correctly precedes all yq invocations.

  • [prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. Metrics are now written before the early return on error.

Previous run (11)

Review

Findings

Critical

  • [logic-error] action.yml:269 — The refactored "Install OpenShell CLI" step uses source .github/scripts/openshell-version.sh with a bare relative path. action.yml is a composite GitHub Action (using: composite). In composite actions, run: steps execute in $GITHUB_WORKSPACE (the consumer repo's checkout), not in the action's directory. The file openshell-version.sh only exists in the fullsend repo, so any consumer workflow using uses: fullsend-ai/fullsend@... will fail with "file not found". The pre-existing code used inline echo statements that didn't depend on any file.
    Remediation: Use ${{ github.action_path }} to reference the action's own directory: source "${{ github.action_path }}/.github/scripts/openshell-version.sh".

Medium

  • [protected-path] .github/workflows/functional-tests.yml, .github/scripts/openshell-version.sh — This PR adds new files under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.

  • [edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
    Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

  • [supply-chain] .github/workflows/functional-tests.yml:81 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.

  • [data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

  • [supply-chain] .github/workflows/functional-tests.yml:121pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.

  • [go-test-assertion-style] internal/cli/run_test.goTestWriteMetricsJSON uses manual if/t.Errorf pattern instead of the testify assertion library (require.NoError, assert.Equal) that the rest of the test file uses extensively.

  • [makefile-target-naming] Makefile — The new target functional-tests uses plural form, but existing test targets use singular: go-test, script-test, e2e-test.

Info

  • [potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.

  • [tier-classification] PR title uses feat: but adds testing infrastructure. test: or ci: would be more accurate per Conventional Commits.

  • [permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. The documentation states tests require the evals GitHub environment, but the workflow does not declare it.

  • [prior-finding-resolved] eval/run-functional.sh — Prior medium logic-error finding about yq running before the existence check has been resolved. The existence check now correctly precedes all yq invocations.

  • [prior-finding-resolved] internal/cli/run.go — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. Metrics are now written before the early return on error.

Previous run (12)

Review

Findings

High

  • [supply-chain] .gitmodules:5 — The agent-eval-harness git submodule points to a personal fork (https://github.com/ralphbean/agent-eval-harness.git) on branch worktree-execution-hooks, not the upstream repository (https://github.com/opendatahub-io/agent-eval-harness). A personal fork can be force-pushed, deleted, or transferred at any time. The subproject commit SHA provides pinning for the current checkout, but future git submodule update --remote or re-clones will fetch from the personal fork. ADR 0047 documents the upstream as opendatahub-io/agent-eval-harness, creating a discrepancy.
    Remediation: Point the submodule URL to the upstream repository or an organization fork (fullsend-ai/agent-eval-harness). If the worktree-execution-hooks branch has unreleased features, merge them upstream first.

Medium

  • [protected-path] .github/workflows/functional-tests.yml — This PR adds a new GitHub Actions workflow under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.

  • [logic-error] eval/run-functional.sh:24 — The yq rewrite of EVAL_YAML_SRC (line 24) runs before the existence check (line 27). If the file does not exist, yq fails with a confusing error under set -euo pipefail before the friendly error message is shown.
    Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp and yq rewrite block.

  • [edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
    Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

Low

  • [supply-chain] .github/workflows/functional-tests.yml:56 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.

  • [data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

  • [supply-chain] .github/workflows/functional-tests.yml:148pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.

  • [stale-framework-reference] docs/problems/testing-agents.md:278 — The "Practical architecture" section recommends Inspect AI as the execution framework, but ADR 0047 adopts agent-eval-harness. The file's own caveat ("any choice should be held lightly") reduces impact, but a forward reference to ADR 0047/0048 would help readers.

Info

  • [permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. The documentation states tests require the evals GitHub environment, but the workflow does not declare it.

  • [injection] eval/scripts/setup-fixture.sh:68 — Hook-outputs.yaml is generated via heredoc with shell variable interpolation. All interpolated values are controlled by the test framework, so there is no realistic injection vector.

  • [potential-data-race] internal/runtime/runtime.go:12 — New non-atomic fields added to RunMetrics while existing ToolCalls uses atomic.Int32. However, progressParser runs synchronously within Run(), so there is no concurrent access. The atomic.Int32 on ToolCalls is defensive/stylistic.

  • [prior-finding-resolved] internal/cli/run.go:838 — Prior medium logic-error finding about writeMetricsJSON not being called on error path has been resolved. The author now calls writeMetricsJSON before the early return in the error path.

Previous run (13)

Review

Findings

High

  • [supply-chain] .gitmodules:5 — The agent-eval-harness git submodule points to a personal fork (https://github.com/ralphbean/agent-eval-harness.git) on branch worktree-execution-hooks, not the upstream repository (https://github.com/opendatahub-io/agent-eval-harness). A personal fork can be force-pushed, deleted, or transferred at any time. The subproject commit SHA provides pinning for the current checkout, but future git submodule update --remote or re-clones will fetch from the personal fork. ADR 0047 documents the upstream as opendatahub-io/agent-eval-harness, creating a discrepancy.
    Remediation: Point the submodule URL to the upstream repository or an organization fork (fullsend-ai/agent-eval-harness). If the worktree-execution-hooks branch has unreleased features, merge them upstream first.

Medium

  • [protected-path] .github/workflows/functional-tests.yml — This PR adds a new GitHub Actions workflow under .github/, a protected path. The PR links to issues (Add functional tests for agents using local openshell sandbox #346, Investigate opendatahub-io/agent-eval-harness for skill evals #499, Add regression tests / evals for all agents and skills #73) and the description explains the rationale. Human approval is always required for protected-path changes regardless of context.

  • [logic-error] internal/cli/run.go:838 — When rt.Run() returns an error, runAgent returns early and writeMetricsJSON is never called. The metrics accumulation code correctly runs before the error check, but the write at line ~958 is unreachable on error paths. Downstream consumers (eval behavioral threshold judges) will see "metrics.json not found" instead of partial metrics that would distinguish timeout from zero-turn failures.
    Remediation: Move writeMetricsJSON(runDir, aggMetrics) into a defer after aggMetrics is declared, guarded with if aggMetrics.Iterations > 0.

  • [logic-error] eval/run-functional.sh:24 — The yq rewrite of EVAL_YAML_SRC (line 24) runs before the existence check (line 27). If the file doesn't exist, yq fails with a confusing error under set -euo pipefail before the friendly error message is shown.
    Remediation: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp and yq rewrite block.

  • [edge-case] eval/scripts/setup-fixture.sh:119 — When creating a pull_request fixture, git commit -m "eval: fixture changes" runs unconditionally after the fixture-files loop. If fixture.files is empty (the default []), file_count is 0, the loop body never executes, nothing is staged, and git commit fails with "nothing to commit" under set -euo pipefail. The initial-content commit at line 76 correctly guards with git diff --cached --quiet.
    Remediation: Add the same guard: if ! git -C "$TARGET_DIR" diff --cached --quiet; then git -C "$TARGET_DIR" commit -m "eval: fixture changes" && git -C "$TARGET_DIR" push origin "$PR_BRANCH"; fi

  • [potential-data-race] internal/runtime/runtime.go:12 — The PR adds non-atomic fields (NumTurns int, TotalCostUSD float64, InputTokens int, OutputTokens int) to RunMetrics, while the existing ToolCalls field uses atomic.Int32. The new fields are written by the progressParser goroutine and read from the main goroutine after rt.Run() returns. The existing code's use of atomic.Int32 for ToolCalls suggests the authors were aware of the concurrency concern, making plain types for new fields an inconsistency that should be verified for synchronization safety.
    Remediation: Verify the happens-before relationship through the goroutine join. If synchronization is not guaranteed on all paths (including timeout/cancellation), use atomic types or a mutex.

Low

  • [supply-chain] .github/workflows/functional-tests.yml:56 — The yq binary is downloaded from GitHub Releases via curl without checksum or signature verification.

  • [stale-framework-reference] docs/problems/testing-agents.md:278 — The "Practical architecture" section recommends Inspect AI as the execution framework, but ADR 0047 adopts agent-eval-harness. The file's own caveat ("any choice should be held lightly") reduces impact, but a forward reference to ADR 0047/0048 would help readers.

  • [secrets-handling] eval/scripts/run-fullsend.sh:73 — The .eval-env file containing secrets (GH_TOKEN, cloud credentials) is written to disk with mode 0600. Cleanup trap and explicit removal provide defense in depth, but SIGKILL bypasses traps in local runs.

  • [data-exposure] eval/scripts/setup-fixture.sh:87 — Ephemeral repos are created with --public. If teardown fails (CI runner killed, API outage), orphaned public repos persist indefinitely. Using --private would be defense-in-depth.

  • [supply-chain] .github/workflows/functional-tests.yml:148pip install --quiet "jsonschema>=4.18.0" is unpinned. Pin to a specific version for reproducibility.

  • [injection] eval/scripts/setup-fixture.sh:68 — Hook-outputs.yaml is generated via heredoc with shell variable interpolation. The interpolated values are all controlled by the test framework, but using a YAML-aware tool would be more robust.

  • [edge-case] internal/runtime/claude_progress.go:82 — New result-event parsing correctly follows the defensive pattern (silent skip on malformed JSON via if err == nil guard), consistent with existing assistant-event handling.

Info

  • [permissions] .github/workflows/functional-tests.yml:22 — No environment: declaration on the job to gate secret access. An explicit environment declaration would be defense-in-depth.

  • [prior-finding-resolved] docs/ADRs/0047-agent-eval-harness-for-test-infrastructure.md — Prior critical findings about ADR numbering collision (0046→0047) and header mismatch have been resolved. ADRs correctly renumbered to 0047 and 0048.

@fullsend-ai-review fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label May 29, 2026
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026
@ralphbean ralphbean force-pushed the ci/functional-evals branch from 6e40384 to 826ceaf Compare May 29, 2026 01:50
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels May 29, 2026

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread .github/workflows/functional-evals.yml Outdated
Comment thread .github/workflows/functional-tests.yml
@fullsend-ai-review fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label May 29, 2026
@ralphbean ralphbean marked this pull request as draft May 29, 2026 10:57

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread eval/run-functional.sh
Comment thread eval/fullsend-runner.sh Outdated

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread eval/run-functional.sh
Comment thread .github/workflows/functional-tests.yml
Comment thread eval/fullsend-runner.sh Outdated

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/cli/run.go Outdated
Comment thread .github/workflows/functional-evals.yml Outdated
Comment thread eval/run-functional.sh
Comment thread .github/workflows/functional-tests.yml
Comment thread eval/fullsend-runner.sh Outdated

@maruiz93 maruiz93 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking. Left a couple of comments.

Either way, I am not a fan of doing LLM judgment in the functional tests, but let's see how it goes.

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 3:10 PM UTC · Completed 3:26 PM UTC
Commit: 4b2120a · View workflow run →

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

}
```

**Implementation:** The `progressParser` in `internal/cli/progress.go`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] stale-reference

The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go.

Suggested fix: Change internal/cli/progress.go to internal/runtime/claude_progress.go.

judges, and baselined roughly until statistical evals provide observed
distributions.

### 6. `fullsend-runner.sh` propagates `metrics.json`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] stale-reference

Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh.

Suggested fix: Change heading from fullsend-runner.sh to run-fullsend.sh.

git -C "$TARGET_DIR" commit -m "eval: initial content"
git -C "$TARGET_DIR" push origin HEAD
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] edge-case

For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path.

|| rc=$?

if [[ $rc -ne 0 ]]; then
echo "WARNING: fullsend run exited with status $rc" >&2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] secrets-handling

The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place. Leakage is unlikely.

Comment thread eval/scripts/teardown-fixture.sh Outdated

if ! gh repo delete "$EPHEMERAL_REPO" --yes 2>&1; then
echo "WARNING: failed to delete $EPHEMERAL_REPO — may need manual cleanup" >&2
echo "::warning::Ephemeral repo $EPHEMERAL_REPO was not deleted — manual cleanup required"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] gha-workflow-command-injection

The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is set by repo admins, not arbitrary user input. Risk is negligible.

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 5:07 PM UTC · Completed 5:20 PM UTC
Commit: 6349a8b · View workflow run →

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 6:06 PM UTC · Completed 6:18 PM UTC
Commit: e9e0dbe · View workflow run →

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 22, 2026

Copy link
Copy Markdown

🤖 Review · ⚠️ Cancelled · Started 6:35 PM UTC · Ended 6:36 PM UTC
Commit: 4e21a60 · View workflow run →

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 22, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 6:39 PM UTC · Completed 6:53 PM UTC
Commit: 11cf91d · View workflow run →

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread eval/run-functional.sh

AGENT="${1:?agent name required}"
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] logic-error

The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.

Suggested fix: Move the existence check to before the mktemp / yq rewrite block.

```
eval/
fullsend-runner.sh # CLI runner: fixture setup -> fullsend run -> capture state
run-functional.sh # Orchestrator: iterate cases, score

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] stale-reference

The directory layout diagram shows fullsend-runner.sh as a top-level file under eval/, but the actual file is eval/scripts/run-fullsend.sh.

Suggested fix: Update the ADR diagram from fullsend-runner.sh to scripts/run-fullsend.sh.

}
```

**Implementation:** The `progressParser` in `internal/cli/progress.go`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] stale-reference

The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The files-changed table also has incorrect paths.

Suggested fix: Update all references to use the correct file paths.

@@ -0,0 +1,72 @@
---

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] internal-inconsistency

Title/heading numbers (47, 48) do not match filename numbers (0051, 0052).

Suggested fix: Align the title/heading numbers with the filename numbers (51 and 52).

fi

# Remove env file to prevent secrets from being uploaded as artifacts
rm -f "$ENV_FILE"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] secrets-in-artifacts

The .eval-env file containing credentials is written to $OUTPUT_DIR. Three cleanup layers exist but writing to /tmp would eliminate the residual risk.

Comment thread internal/cli/run.go
ToolCalls int `json:"tool_calls"`
}

func writeMetricsJSON(dir string, m aggregateMetrics) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] go-error-wrapping

writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

Suggested fix: Wrap with fmt.Errorf("marshaling metrics: %%w", err).

ralphbean added 14 commits June 22, 2026 15:06
Add a functional test framework for agent pipelines using
agent-eval-harness lifecycle hooks. The harness drives case iteration
with before_each/after_each hooks for ephemeral repo management,
while fullsend runs inside openshell sandboxes.

Key components:
- eval/scripts/setup-fixture.sh: before_each hook creates ephemeral
  GitHub repos and fixtures (issues/PRs) from input.yaml
- eval/scripts/run-fullsend.sh: CLI runner invokes fullsend run
- eval/scripts/capture-fixture.sh: after_each hook snapshots fixture
  state for judges
- eval/scripts/teardown-fixture.sh: after_each hook deletes repos
- eval/run-functional.sh: orchestrator calling workspace.py,
  execute.py, and score.py with behavioral threshold checks
- eval/triage/: first eval suite with LLM judge and label checks

Also includes CI workflow, behavioral thresholds (max_turns,
max_cost_usd), metrics capture from Claude Code stream events,
ADRs, and documentation.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
- Missing metrics.json is now a FAIL (not a warning), ensuring
  behavioral thresholds cannot be silently bypassed when the agent
  crashes or fullsend run fails.
- Validate that jq output is numeric before threshold comparison,
  preventing null/malformed values from silently passing as 0.
- Add shellcheck SC2317 disable directives for trap handler commands
  that shellcheck incorrectly flags as unreachable.

Signed-off-by: Ralph Bean <rbean@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
The harness was referenced twice: once as a git submodule and once as a
pip install from the same git URL. Install from the already-checked-out
submodule so the fork URL only appears in .gitmodules.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Move max_turns and max_cost_usd checks from custom shell code in
run-functional.sh into deterministic check judges in eval.yaml. The
harness's score.py now enforces these via min_pass_rate: 1.0 thresholds.

Extract the pre-flight annotation validation into a standalone
eval/lint-cases.sh linter, wired up as `make lint-eval-cases` and
included in `make test`. This runs cheaply without executing agents.

Net effect: ~90 lines removed from run-functional.sh.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Extend lint-cases.sh to verify that eval.yaml declares max_turns and
max_cost judges, not just that annotations.yaml declares the thresholds.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
The CLI runner receives {output_dir} which is workspace/output. Writing
metrics.json to $OUTPUT_DIR/output/ created a double-nested path that
score.py couldn't find. Write to $OUTPUT_DIR/metrics.json instead so
the file appears at the expected output/metrics.json key.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
The triage agent consistently takes ~23 turns on this case. The
previous threshold of 15 was too tight.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Add three new triage eval cases covering the remaining major outcomes:

- 002-needs-info-vague-crash: vague issue with no repro steps, expects
  action "insufficient" and needs-info label
- 003-feature-request: clear feature request, expects action "sufficient"
  with category "feature" and triaged+feature labels (not ready-to-code)
- 004-duplicate-issue: issue duplicating a seed issue, expects action
  "duplicate" with duplicate label

Supporting changes:
- setup-fixture.sh: support seed_issues in input.yaml for pre-populating
  issues before the main fixture (needed by duplicate test)
- eval.yaml: add forbidden_labels judge to verify wrong labels are NOT
  applied (needs-info must not get ready-to-code, etc.)
- lint-cases.sh: check for forbidden_labels judge presence

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Cases use isolated ephemeral repos with UUID suffixes, so there's no
shared state. Sequential execution took ~15 min of agent time across
4 cases; parallel should bring wall-clock down to ~6 min.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
ADR 0045 was taken on main by forge-portable-harness-schema while this
branch was out of date. Renumber both branch ADRs and update all
cross-references.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
ADR 0046-host-side-api-server-design was merged to main while this
branch was open, creating a duplicate number. Renumber
0046-functional-tests-for-agent-pipelines to 0048 and update all
cross-references. Also fix stale heading numbers in ADRs 0047 and 0048.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
- Write partial metrics.json on agent failure so downstream judges can
  inspect behavioral data even when the agent errors out
- Add CLOUD_ML_REGION to runner.env in eval.yaml (was only in
  execution.env)
- Let teardown-fixture.sh fail visibly instead of swallowing errors
  with 2>/dev/null || true — the harness on_failure: continue already
  handles non-fatal teardown
- Distinguish infrastructure failures from agent failures in
  run-functional.sh by checking for case output before proceeding to
  scoring
- Fix stale reference to eval/fullsend-runner.sh in ADR 0047 (actual
  path is eval/scripts/run-fullsend.sh)

Signed-off-by: Ralph Bean <rbean@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
Main acquired ADR numbers 0047-0050 while this branch was open.
Renumber the PR's ADRs and update all cross-references.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
@fullsend-ai-review

fullsend-ai-review Bot commented Jun 22, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 7:12 PM UTC · Completed 7:27 PM UTC
Commit: 722af5c · View workflow run →

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread eval/run-functional.sh

AGENT="${1:?agent name required}"
EVAL_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] logic-error

The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.

Suggested fix: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.

@@ -0,0 +1,72 @@
---
title: "47. agent-eval-harness for test infrastructure"
status: Accepted

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] internal-inconsistency

The YAML frontmatter title says 47. agent-eval-harness and the H1 heading says # 47. but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052. Cross-references by number are ambiguous.

Suggested fix: Update the YAML title and H1 heading in 0051-.md to 51. and in 0052-.md to 52. to match the filenames.


# --- Create fixture ---
FIXTURE_URL=""
FIXTURE_NUMBER=""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] edge-case

The ephemeral repo is created with gh repo create but .hook-outputs.yaml (which propagates EPHEMERAL_REPO to teardown) is written at the end of the script. A mid-script failure leaves an orphaned repo with no teardown path.

Suggested fix: Register a trap to delete the ephemeral repo on failure, or write .hook-outputs.yaml immediately after repo creation.

}
```

When retries occur, all values are summed. The functional test cares about

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] technical-doc-inaccuracy

The design spec references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The Files changed table also references wrong paths.

EPHEMERAL_REPO: "${EPHEMERAL_REPO}"
FIXTURE_URL: "${FIXTURE_URL}"
FIXTURE_NUMBER: "${FIXTURE_NUMBER}"
FIXTURE_TYPE: "${FIXTURE_TYPE}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] yaml-injection

The .hook-outputs.yaml heredoc interpolates shell variables without YAML escaping. Attack surface is limited to developers with commit access, but quoting values would be defense-in-depth.

# OUTPUT_DIR is {output_dir} from the harness, which is workspace/output.
# score.py loads files relative to case_dir/output, so metrics.json needs
# to be at OUTPUT_DIR/metrics.json (not OUTPUT_DIR/output/metrics.json).
METRICS_FILE=$(find "$OUTPUT_DIR" -maxdepth 3 -name metrics.json -not -path "$OUTPUT_DIR/metrics.json" 2>/dev/null | head -1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] edge-case

The metrics.json copy logic uses find ... | head -1 with non-deterministic ordering. A direct path would be more robust.


# --- Create fixture ---
FIXTURE_URL=""
FIXTURE_NUMBER=""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] ephemeral-repo-visibility

Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist. Content is not sensitive but --private would be defense-in-depth.

Comment thread internal/cli/run.go
ToolCalls int `json:"tool_calls"`
}

func writeMetricsJSON(dir string, m aggregateMetrics) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] go-error-wrapping

writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.

Suggested fix: Wrap with fmt.Errorf for marshaling and writing errors.

@fullsend-ai-retro

fullsend-ai-retro Bot commented Jun 22, 2026

Copy link
Copy Markdown

🤖 Finished Retro · ✅ Success · Started 7:34 PM UTC · Completed 7:49 PM UTC
Commit: 722af5c · View workflow run →

@fullsend-ai-retro

Copy link
Copy Markdown

Retro: PR #1682 — Functional test framework

PR type: Human-authored (Ralph Bean), branch ci/functional-evals, open ~24 days (partly waiting on upstream dependency).

Review agent behavior was the primary concern. The fullsend-ai-review[bot] posted 86 inline review comments across ~10 review rounds and issued 8 CHANGES_REQUESTED verdicts. The vast majority of comments were verbatim duplicates of findings from prior rounds:

  • yq checksum verification: flagged 8+ times (author acknowledged as non-blocking on May 29)
  • Personal fork in .gitmodules: flagged 6+ times (eventually fixed, bot kept flagging after fix)
  • Empty fixture.files edge case: flagged 7+ times
  • Existence check ordering in run-functional.sh: flagged 6+ times (code was completely rewritten, bot kept flagging the old pattern)
  • Debug code in run.go: flagged 4 times (cleaned up in early revisions)

Meanwhile, human reviewers (waynesun09's 9-agent review squad, maruiz93, ascerra) found ~8 genuinely novel high/medium findings the bot never detected across all 10 rounds:

  • Critical: composite action source uses bare relative path, breaks all external consumers (4/4 squad agents caught this)
  • High: OpenShell version drift (0.0.38 vs 0.0.54 on main)
  • High: partial metrics not written on runErr != nil early return
  • High: spec-implementation drift (design spec says orchestrator thresholds, implementation uses per-skill judges)
  • Medium: exit code masking in teardown, infrastructure failure masking by || true, doc-schema mismatches, grep -P portability on macOS

Existing issues already cover the main improvements:

  • #1013 — Deduplicate findings across iterations on the same PR
  • #1285 — Don't regenerate unchanged inline comments on re-reviews
  • #956 — Resolve inline comments from previous reviews on re-review
  • #1500 — Don't re-request changes for unchanged findings
  • #1583 — Recognize human-resolved findings and stop re-flagging
  • #992 / #1570 — ADR collision CI lint / auto-renumber

This PR is a strong data point for prioritizing #1013 and #1285. The 86 duplicate comments represent significant token waste and human attention cost. The bot consumed ~10x the review budget of the human reviewers while finding fewer unique issues.

No new proposals filed — all identified improvements are covered by existing open issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ci CI pipelines and checks component/runner Agent runner behavior and lifecycle go Pull requests that update go code requires-manual-review Review requires human judgment submodules Pull requests that update submodules code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add functional tests for agents using local openshell sandbox

5 participants