Skip to content

SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46

Open
seanoc5 wants to merge 2 commits into
mainfrom
feature/sma-skill-impact
Open

SMA [7] Add eval isolation: bare-vs-guided proof that reference docs change output#46
seanoc5 wants to merge 2 commits into
mainfrom
feature/sma-skill-impact

Conversation

@seanoc5
Copy link
Copy Markdown
Collaborator

@seanoc5 seanoc5 commented Apr 28, 2026

Summary

  • Adds cwd isolation to the eval test runner so bare-vs-guided comparisons work correctly
  • Proves that reference docs (specifically 10-data-migration-tooling.md) measurably change LLM output quality
  • Bare Claude honestly refuses to fabricate SolrReader class names; guided Claude names them correctly from the reference

What changed

  • claude_requests.py: _resolve_cwd() copies fixture dirs to /tmp for isolation; allowed_tools configurable per provider
  • eval-datastrat-proof.yaml: bare-vs-guided proof test
  • fixtures/skill-bare/CLAUDE.md: minimal advisor prompt
  • fixtures/skill-guided/CLAUDE.md: advisor prompt + inlined reference content

The problem this solves

Eval tests run from inside the skill directory. Without isolation, the agent can traverse up and find reference docs even in a "bare" test — making it impossible to prove that steering content actually changes output.

Test plan

  • cd tests/evals && PROMPTFOO_PYTHON=../../.venv/bin/python promptfoo eval -c eval-datastrat-proof.yaml --no-cache
  • Verify bare=RED, guided=GREEN in promptfoo view

Supersedes #17.

🤖 Generated with Claude Code

seanoc5 and others added 2 commits April 28, 2026 11:36
…utput

The eval test runner (claude_requests.py) now supports cwd isolation:
when a provider specifies a fixture directory as cwd, it's copied to
/tmp so the agent cannot traverse up into the real skill directory and
discover reference docs that should be hidden. Provider config also
supports allowed_tools restriction (bare tests get no WebFetch/Skill).

Includes a proof test (eval-datastrat-proof.yaml) that demonstrates:
  bare (no references) → FAIL: "I don't want to risk fabricating class names"
  guided (with CLAUDE.md) → PASS: names SolrBackupSource, SolrDocumentSource, RFS pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Claude Agent SDK spawns a subprocess; when the event loop closes
after query completion, Python writes "Loop ... is closed" to stderr.
Promptfoo captures all stderr as ERROR, making clean runs look broken.
Filter out the harmless warning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@seanoc5 seanoc5 force-pushed the feature/sma-skill-impact branch from 5856ca7 to dcbd065 Compare May 11, 2026 16:35
@seanoc5
Copy link
Copy Markdown
Collaborator Author

seanoc5 commented May 11, 2026

Quick note for anyone watching this PR: I briefly pushed an unrelated WIP commit (5856ca7f, steering/SKILL.md refinements) on top of this branch this morning, then realized that muddied PR 46's scope. I just git push --force-with-lease'd the branch back to its original ea32f2a6 + dcbd0659 — this PR's diff and scope are unchanged from what you originally saw.

The rewound commit lives on a separate branch feat/sma-steering-refs-discussion-wip and is referenced from #TBD (steering/references discussion). No action needed here — just flagging in case the branch-update notification caught your eye.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant