SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46
Open
seanoc5 wants to merge 2 commits into
Open
SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46seanoc5 wants to merge 2 commits into
seanoc5 wants to merge 2 commits into
Conversation
…utput The eval test runner (claude_requests.py) now supports cwd isolation: when a provider specifies a fixture directory as cwd, it's copied to /tmp so the agent cannot traverse up into the real skill directory and discover reference docs that should be hidden. Provider config also supports allowed_tools restriction (bare tests get no WebFetch/Skill). Includes a proof test (eval-datastrat-proof.yaml) that demonstrates: bare (no references) → FAIL: "I don't want to risk fabricating class names" guided (with CLAUDE.md) → PASS: names SolrBackupSource, SolrDocumentSource, RFS pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Claude Agent SDK spawns a subprocess; when the event loop closes after query completion, Python writes "Loop ... is closed" to stderr. Promptfoo captures all stderr as ERROR, making clean runs look broken. Filter out the harmless warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5856ca7 to
dcbd065
Compare
Collaborator
Author
|
Quick note for anyone watching this PR: I briefly pushed an unrelated WIP commit ( The rewound commit lives on a separate branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
10-data-migration-tooling.md) measurably change LLM output qualityWhat changed
claude_requests.py:_resolve_cwd()copies fixture dirs to/tmpfor isolation;allowed_toolsconfigurable per providereval-datastrat-proof.yaml: bare-vs-guided proof testfixtures/skill-bare/CLAUDE.md: minimal advisor promptfixtures/skill-guided/CLAUDE.md: advisor prompt + inlined reference contentThe problem this solves
Eval tests run from inside the skill directory. Without isolation, the agent can traverse up and find reference docs even in a "bare" test — making it impossible to prove that steering content actually changes output.
Test plan
cd tests/evals && PROMPTFOO_PYTHON=../../.venv/bin/python promptfoo eval -c eval-datastrat-proof.yaml --no-cachepromptfoo viewSupersedes #17.
🤖 Generated with Claude Code