Skip to content

Add baseline scenarios, fixtures, and harness-failure classification for uploads#12

Merged
Swiftyos merged 3 commits into
mainfrom
baseline-fixtures
Apr 15, 2026
Merged

Add baseline scenarios, fixtures, and harness-failure classification for uploads#12
Swiftyos merged 3 commits into
mainfrom
baseline-fixtures

Conversation

@Swiftyos

@Swiftyos Swiftyos commented Apr 15, 2026

Copy link
Copy Markdown

Intent

Ship the baseline evaluation surface so runs can execute against real fixtures and distinguish harness failures from agent failures during attachment upload.

  • Add data/baseline-scenarios.yaml (330 lines) describing the baseline scenario set with attachment references.
  • Add data/fixture-manifest.json plus the full fixture tree under data/fixtures/ (CSVs, JSON, YAML, MD, PDFs, archives) so scenarios have reproducible inputs on disk.
  • Classify attachment upload errors as harness failures rather than agent failures in src/domains/evaluation/run-suite.ts and src/providers/sdk/http-endpoint.ts, with a matching helper in src/shared/utils/errors.ts.

Behavior changes

  • Baseline scenarios and fixtures are now available to the runner as on-disk sources of truth.
  • When an HTTP endpoint fails to accept an attachment upload, the run is recorded as a harness failure (infrastructure) instead of being blamed on the agent, preserving agent-quality signal in aggregate metrics.

Validation

  • ./scripts/fast-feedback.sh passed
  • New unit coverage in tests/unit/db.test.ts and tests/unit/runner.test.ts for the failure-classification path
  • Behavior docs updated (if behavior changed)

Screenshots / video

N/A for CLI-only changes.

@Swiftyos Swiftyos changed the title Baseline fixtures Add baseline scenarios, fixtures, and harness-failure classification for uploads Apr 15, 2026
@Swiftyos Swiftyos merged commit 2327faf into main Apr 15, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant