SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output by seanoc5 · Pull Request #46 · o19s/opensearch-migrations

seanoc5 · 2026-04-28T15:39:15Z

Summary

Adds cwd isolation to the eval test runner so bare-vs-guided comparisons work correctly
Proves that reference docs (specifically 10-data-migration-tooling.md) measurably change LLM output quality
Bare Claude honestly refuses to fabricate SolrReader class names; guided Claude names them correctly from the reference

What changed

claude_requests.py: _resolve_cwd() copies fixture dirs to /tmp for isolation; allowed_tools configurable per provider
eval-datastrat-proof.yaml: bare-vs-guided proof test
fixtures/skill-bare/CLAUDE.md: minimal advisor prompt
fixtures/skill-guided/CLAUDE.md: advisor prompt + inlined reference content

The problem this solves

Eval tests run from inside the skill directory. Without isolation, the agent can traverse up and find reference docs even in a "bare" test — making it impossible to prove that steering content actually changes output.

Test plan

cd tests/evals && PROMPTFOO_PYTHON=../../.venv/bin/python promptfoo eval -c eval-datastrat-proof.yaml --no-cache
Verify bare=RED, guided=GREEN in promptfoo view

Supersedes #17.

🤖 Generated with Claude Code

…utput The eval test runner (claude_requests.py) now supports cwd isolation: when a provider specifies a fixture directory as cwd, it's copied to /tmp so the agent cannot traverse up into the real skill directory and discover reference docs that should be hidden. Provider config also supports allowed_tools restriction (bare tests get no WebFetch/Skill). Includes a proof test (eval-datastrat-proof.yaml) that demonstrates: bare (no references) → FAIL: "I don't want to risk fabricating class names" guided (with CLAUDE.md) → PASS: names SolrBackupSource, SolrDocumentSource, RFS pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Claude Agent SDK spawns a subprocess; when the event loop closes after query completion, Python writes "Loop ... is closed" to stderr. Promptfoo captures all stderr as ERROR, making clean runs look broken. Filter out the harmless warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

seanoc5 · 2026-05-11T16:35:54Z

Quick note for anyone watching this PR: I briefly pushed an unrelated WIP commit (5856ca7f, steering/SKILL.md refinements) on top of this branch this morning, then realized that muddied PR 46's scope. I just git push --force-with-lease'd the branch back to its original ea32f2a6 + dcbd0659 — this PR's diff and scope are unchanged from what you originally saw.

The rewound commit lives on a separate branch feat/sma-steering-refs-discussion-wip and is referenced from #TBD (steering/references discussion). No action needed here — just flagging in case the branch-update notification caught your eye.

seanoc5 and others added 2 commits April 28, 2026 11:36

seanoc5 force-pushed the feature/sma-skill-impact branch from 5856ca7 to dcbd065 Compare May 11, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46

SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46
seanoc5 wants to merge 2 commits into
mainfrom
feature/sma-skill-impact

seanoc5 commented Apr 28, 2026

Uh oh!

seanoc5 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seanoc5 commented Apr 28, 2026

Summary

What changed

The problem this solves

Test plan

Uh oh!

seanoc5 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant