feat: add functional test framework for agent pipelines#1682
Conversation
Site previewPreview: https://8567a3ef-site.fullsend-ai.workers.dev Commit: |
ReviewFindingsMedium
Low
Previous runReviewFindingsMedium
Low
Labels: PR adds git submodule for agent-eval-harness Previous run (2)ReviewFindingsMedium
Low
Info
Previous run (3)ReviewFindingsMedium
Low
Info
Labels: PR adds functional test CI workflow and modifies agent runner metrics emission in Go code. Previous run (4)ReviewFindingsCritical
Medium
Low
Info
Previous run (5)ReviewFindingsMedium
Low
Info
Previous run (6)ReviewFindingsMedium
Low
Info
Previous run (7)ReviewReason: stale-head The review agent reviewed commit Previous run (8)ReviewFindingsMedium
Low
Info
Previous run (9)ReviewFindingsMedium
Low
Info
Previous run (10)ReviewFindingsCritical
Medium
Low
Info
Previous run (11)ReviewFindingsCritical
Medium
Low
Info
Previous run (12)ReviewFindingsHigh
Medium
Low
Info
Previous run (13)ReviewFindingsHigh
Medium
Low
Info
|
6e40384 to
826ceaf
Compare
|
🤖 Finished Review · ✅ Success · Started 3:10 PM UTC · Completed 3:26 PM UTC |
| } | ||
| ``` | ||
|
|
||
| **Implementation:** The `progressParser` in `internal/cli/progress.go` |
There was a problem hiding this comment.
[medium] stale-reference
The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go.
Suggested fix: Change internal/cli/progress.go to internal/runtime/claude_progress.go.
| judges, and baselined roughly until statistical evals provide observed | ||
| distributions. | ||
|
|
||
| ### 6. `fullsend-runner.sh` propagates `metrics.json` |
There was a problem hiding this comment.
[low] stale-reference
Section 6 heading references fullsend-runner.sh but the actual file is eval/scripts/run-fullsend.sh.
Suggested fix: Change heading from fullsend-runner.sh to run-fullsend.sh.
| git -C "$TARGET_DIR" commit -m "eval: initial content" | ||
| git -C "$TARGET_DIR" push origin HEAD | ||
| fi | ||
|
|
There was a problem hiding this comment.
[low] edge-case
For a pull_request fixture with empty fixture.files (the default), gh pr create would fail because no diff exists from main. No current test case exercises this path.
| || rc=$? | ||
|
|
||
| if [[ $rc -ne 0 ]]; then | ||
| echo "WARNING: fullsend run exited with status $rc" >&2 |
There was a problem hiding this comment.
[low] secrets-handling
The .eval-env file contains plaintext secrets on disk during execution. Five defense layers are in place. Leakage is unlikely.
|
|
||
| if ! gh repo delete "$EPHEMERAL_REPO" --yes 2>&1; then | ||
| echo "WARNING: failed to delete $EPHEMERAL_REPO — may need manual cleanup" >&2 | ||
| echo "::warning::Ephemeral repo $EPHEMERAL_REPO was not deleted — manual cleanup required" |
There was a problem hiding this comment.
[info] gha-workflow-command-injection
The ::warning:: annotation interpolates $EPHEMERAL_REPO without sanitizing for :: sequences. EVAL_ORG is set by repo admins, not arbitrary user input. Risk is negligible.
|
🤖 Finished Review · ✅ Success · Started 5:07 PM UTC · Completed 5:20 PM UTC |
|
🤖 Finished Review · ✅ Success · Started 6:06 PM UTC · Completed 6:18 PM UTC |
|
🤖 Review · |
|
🤖 Finished Review · ✅ Success · Started 6:39 PM UTC · Completed 6:53 PM UTC |
|
|
||
| AGENT="${1:?agent name required}" | ||
| EVAL_DIR="$(cd "$(dirname "$0")" && pwd)" | ||
| REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)" |
There was a problem hiding this comment.
[medium] logic-error
The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
Suggested fix: Move the existence check to before the mktemp / yq rewrite block.
| ``` | ||
| eval/ | ||
| fullsend-runner.sh # CLI runner: fixture setup -> fullsend run -> capture state | ||
| run-functional.sh # Orchestrator: iterate cases, score |
There was a problem hiding this comment.
[medium] stale-reference
The directory layout diagram shows fullsend-runner.sh as a top-level file under eval/, but the actual file is eval/scripts/run-fullsend.sh.
Suggested fix: Update the ADR diagram from fullsend-runner.sh to scripts/run-fullsend.sh.
| } | ||
| ``` | ||
|
|
||
| **Implementation:** The `progressParser` in `internal/cli/progress.go` |
There was a problem hiding this comment.
[medium] stale-reference
The prose references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The files-changed table also has incorrect paths.
Suggested fix: Update all references to use the correct file paths.
| @@ -0,0 +1,72 @@ | |||
| --- | |||
There was a problem hiding this comment.
[medium] internal-inconsistency
Title/heading numbers (47, 48) do not match filename numbers (0051, 0052).
Suggested fix: Align the title/heading numbers with the filename numbers (51 and 52).
| fi | ||
|
|
||
| # Remove env file to prevent secrets from being uploaded as artifacts | ||
| rm -f "$ENV_FILE" |
There was a problem hiding this comment.
[low] secrets-in-artifacts
The .eval-env file containing credentials is written to $OUTPUT_DIR. Three cleanup layers exist but writing to /tmp would eliminate the residual risk.
| ToolCalls int `json:"tool_calls"` | ||
| } | ||
|
|
||
| func writeMetricsJSON(dir string, m aggregateMetrics) error { |
There was a problem hiding this comment.
[low] go-error-wrapping
writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
Suggested fix: Wrap with fmt.Errorf("marshaling metrics: %%w", err).
Add a functional test framework for agent pipelines using agent-eval-harness lifecycle hooks. The harness drives case iteration with before_each/after_each hooks for ephemeral repo management, while fullsend runs inside openshell sandboxes. Key components: - eval/scripts/setup-fixture.sh: before_each hook creates ephemeral GitHub repos and fixtures (issues/PRs) from input.yaml - eval/scripts/run-fullsend.sh: CLI runner invokes fullsend run - eval/scripts/capture-fixture.sh: after_each hook snapshots fixture state for judges - eval/scripts/teardown-fixture.sh: after_each hook deletes repos - eval/run-functional.sh: orchestrator calling workspace.py, execute.py, and score.py with behavioral threshold checks - eval/triage/: first eval suite with LLM judge and label checks Also includes CI workflow, behavioral thresholds (max_turns, max_cost_usd), metrics capture from Claude Code stream events, ADRs, and documentation. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
- Missing metrics.json is now a FAIL (not a warning), ensuring behavioral thresholds cannot be silently bypassed when the agent crashes or fullsend run fails. - Validate that jq output is numeric before threshold comparison, preventing null/malformed values from silently passing as 0. - Add shellcheck SC2317 disable directives for trap handler commands that shellcheck incorrectly flags as unreachable. Signed-off-by: Ralph Bean <rbean@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
The harness was referenced twice: once as a git submodule and once as a pip install from the same git URL. Install from the already-checked-out submodule so the fork URL only appears in .gitmodules. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Move max_turns and max_cost_usd checks from custom shell code in run-functional.sh into deterministic check judges in eval.yaml. The harness's score.py now enforces these via min_pass_rate: 1.0 thresholds. Extract the pre-flight annotation validation into a standalone eval/lint-cases.sh linter, wired up as `make lint-eval-cases` and included in `make test`. This runs cheaply without executing agents. Net effect: ~90 lines removed from run-functional.sh. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Extend lint-cases.sh to verify that eval.yaml declares max_turns and max_cost judges, not just that annotations.yaml declares the thresholds. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
The CLI runner receives {output_dir} which is workspace/output. Writing
metrics.json to $OUTPUT_DIR/output/ created a double-nested path that
score.py couldn't find. Write to $OUTPUT_DIR/metrics.json instead so
the file appears at the expected output/metrics.json key.
Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Ralph Bean <rbean@redhat.com>
The triage agent consistently takes ~23 turns on this case. The previous threshold of 15 was too tight. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Add three new triage eval cases covering the remaining major outcomes: - 002-needs-info-vague-crash: vague issue with no repro steps, expects action "insufficient" and needs-info label - 003-feature-request: clear feature request, expects action "sufficient" with category "feature" and triaged+feature labels (not ready-to-code) - 004-duplicate-issue: issue duplicating a seed issue, expects action "duplicate" with duplicate label Supporting changes: - setup-fixture.sh: support seed_issues in input.yaml for pre-populating issues before the main fixture (needed by duplicate test) - eval.yaml: add forbidden_labels judge to verify wrong labels are NOT applied (needs-info must not get ready-to-code, etc.) - lint-cases.sh: check for forbidden_labels judge presence Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Cases use isolated ephemeral repos with UUID suffixes, so there's no shared state. Sequential execution took ~15 min of agent time across 4 cases; parallel should bring wall-clock down to ~6 min. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
ADR 0045 was taken on main by forge-portable-harness-schema while this branch was out of date. Renumber both branch ADRs and update all cross-references. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
ADR 0046-host-side-api-server-design was merged to main while this branch was open, creating a duplicate number. Renumber 0046-functional-tests-for-agent-pipelines to 0048 and update all cross-references. Also fix stale heading numbers in ADRs 0047 and 0048. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
- Write partial metrics.json on agent failure so downstream judges can inspect behavioral data even when the agent errors out - Add CLOUD_ML_REGION to runner.env in eval.yaml (was only in execution.env) - Let teardown-fixture.sh fail visibly instead of swallowing errors with 2>/dev/null || true — the harness on_failure: continue already handles non-fatal teardown - Distinguish infrastructure failures from agent failures in run-functional.sh by checking for case output before proceeding to scoring - Fix stale reference to eval/fullsend-runner.sh in ADR 0047 (actual path is eval/scripts/run-fullsend.sh) Signed-off-by: Ralph Bean <rbean@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
Main acquired ADR numbers 0047-0050 while this branch was open. Renumber the PR's ADRs and update all cross-references. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ralph Bean <rbean@redhat.com>
|
🤖 Finished Review · ✅ Success · Started 7:12 PM UTC · Completed 7:27 PM UTC |
|
|
||
| AGENT="${1:?agent name required}" | ||
| EVAL_DIR="$(cd "$(dirname "$0")" && pwd)" | ||
| REPO_ROOT="$(cd "${EVAL_DIR}/.." && pwd)" |
There was a problem hiding this comment.
[medium] logic-error
The existence check for $EVAL_YAML_SRC (line 27) occurs after yq has already consumed it on line 24. Under set -euo pipefail, yq aborts with a cryptic error before the intended diagnostic message is shown. The guard is dead code.
Suggested fix: Move the if [[ ! -f "$EVAL_YAML_SRC" ]] check to before the mktemp / yq rewrite block.
| @@ -0,0 +1,72 @@ | |||
| --- | |||
| title: "47. agent-eval-harness for test infrastructure" | |||
| status: Accepted | |||
There was a problem hiding this comment.
[medium] internal-inconsistency
The YAML frontmatter title says 47. agent-eval-harness and the H1 heading says # 47. but the filename is 0051-. Similarly, 0052-functional-tests-for-agent-pipelines.md has title/heading number 48 but filename 0052. Cross-references by number are ambiguous.
Suggested fix: Update the YAML title and H1 heading in 0051-.md to 51. and in 0052-.md to 52. to match the filenames.
|
|
||
| # --- Create fixture --- | ||
| FIXTURE_URL="" | ||
| FIXTURE_NUMBER="" |
There was a problem hiding this comment.
[medium] edge-case
The ephemeral repo is created with gh repo create but .hook-outputs.yaml (which propagates EPHEMERAL_REPO to teardown) is written at the end of the script. A mid-script failure leaves an orphaned repo with no teardown path.
Suggested fix: Register a trap to delete the ephemeral repo on failure, or write .hook-outputs.yaml immediately after repo creation.
| } | ||
| ``` | ||
|
|
||
| When retries occur, all values are summed. The functional test cares about |
There was a problem hiding this comment.
[low] technical-doc-inaccuracy
The design spec references internal/cli/progress.go but the actual file is internal/runtime/claude_progress.go. The Files changed table also references wrong paths.
| EPHEMERAL_REPO: "${EPHEMERAL_REPO}" | ||
| FIXTURE_URL: "${FIXTURE_URL}" | ||
| FIXTURE_NUMBER: "${FIXTURE_NUMBER}" | ||
| FIXTURE_TYPE: "${FIXTURE_TYPE}" |
There was a problem hiding this comment.
[low] yaml-injection
The .hook-outputs.yaml heredoc interpolates shell variables without YAML escaping. Attack surface is limited to developers with commit access, but quoting values would be defense-in-depth.
| # OUTPUT_DIR is {output_dir} from the harness, which is workspace/output. | ||
| # score.py loads files relative to case_dir/output, so metrics.json needs | ||
| # to be at OUTPUT_DIR/metrics.json (not OUTPUT_DIR/output/metrics.json). | ||
| METRICS_FILE=$(find "$OUTPUT_DIR" -maxdepth 3 -name metrics.json -not -path "$OUTPUT_DIR/metrics.json" 2>/dev/null | head -1) |
There was a problem hiding this comment.
[low] edge-case
The metrics.json copy logic uses find ... | head -1 with non-deterministic ordering. A direct path would be more robust.
|
|
||
| # --- Create fixture --- | ||
| FIXTURE_URL="" | ||
| FIXTURE_NUMBER="" |
There was a problem hiding this comment.
[low] ephemeral-repo-visibility
Ephemeral repos are created with --public. If teardown fails, orphaned public repos persist. Content is not sensitive but --private would be defense-in-depth.
| ToolCalls int `json:"tool_calls"` | ||
| } | ||
|
|
||
| func writeMetricsJSON(dir string, m aggregateMetrics) error { |
There was a problem hiding this comment.
[low] go-error-wrapping
writeMetricsJSON returns raw errors without wrapping with descriptive context, inconsistent with the established error-handling pattern in the rest of the file.
Suggested fix: Wrap with fmt.Errorf for marshaling and writing errors.
|
🤖 Finished Retro · ✅ Success · Started 7:34 PM UTC · Completed 7:49 PM UTC |
Retro: PR #1682 — Functional test frameworkPR type: Human-authored (Ralph Bean), branch Review agent behavior was the primary concern. The
Meanwhile, human reviewers (waynesun09's 9-agent review squad, maruiz93, ascerra) found ~8 genuinely novel high/medium findings the bot never detected across all 10 rounds:
Existing issues already cover the main improvements:
This PR is a strong data point for prioritizing #1013 and #1285. The 86 duplicate comments represent significant token waste and human attention cost. The bot consumed ~10x the review budget of the human reviewers while finding fewer unique issues. No new proposals filed — all identified improvements are covered by existing open issues. |
Summary
001-bug-url-encoding) deliberately tests whether the triage agent reads source code critically vs. parroting the issue descriptionmax_turns,max_cost_usd) enforced universally by the orchestrator — agents that pass quality judges but loop or burn tokens fail the testWhat's in the box
eval/fullsend-runner.sheval/run-functional.sheval/triage/eval.yamleval/triage/cases/001-*+but issue claims it doesn't)eval/triage/repos/python-webapp/.github/workflows/functional-tests.ymleval/orinternal/scaffold/changesMakefilemake functional-teststargetinternal/cli/progress.gonum_turns,total_cost_usd, token counts from Claude Code stream eventsinternal/cli/run.gometrics.jsonwith aggregated behavioral metrics after all iterationsdocs/testing/functional-tests.mdBehavioral thresholds
Every test case declares
max_turnsandmax_cost_usdinannotations.yaml. The orchestrator validates these before running and compares againstmetrics.jsonafter each run. This catches efficiency regressions (looping agents, model cost spikes) that quality judges can't see.Thresholds are universal invariants — not per-skill judges — so every skill gets them automatically and new skills can't opt out.
Current results
The LLM judge scores the triage agent at 3/5 — correct labels and reasonable comment, but the agent accepts the issue at face value without noticing the regex actually handles
+. Threshold set to 2.5 with a TODO to raise it.Still TODO
evalsGitHub environment with secrets (EVAL_GH_TOKEN,GCP_CREDENTIALS) and vars (EVAL_ORG,ANTHROPIC_VERTEX_PROJECT_ID)execute.pyexpectations so we can use their parallelismRefs #499, #73
Closes #346
🤖 Generated with Claude Code