SMA: add bare-model stumper triage harness#48
Open
seanoc5 wants to merge 1 commit into
Open
Conversation
Adds tooling to identify questions a bare LLM cannot answer about Solr→OpenSearch migration — the foundation for building targeted skill content. Runs each candidate question against two target models from /tmp isolation and applies contains-any criteria to surface knowledge gaps. - tests/evals/stumper-candidates.md: 20 candidate questions with two-criterion assertions, organized by topic (Querqy/SMUI, schema migration, query translation, AWS-managed constraints, consultant judgment). - tests/evals/triage/sweep.py: harness that runs each question against Claude Haiku 4.5 (cloud, via the `claude` CLI from /tmp for skill isolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by sha256(model + prompt) so criterion iteration is free. - tests/evals/triage/responses/*.txt: bare-model responses for all 20 questions × both models (42 files including q05's pre- and post-reframe variants). Lets reviewers see the failure mode without re-running. - tests/evals/triage/triage-results.md: pass/fail summary table.
Collaborator
Author
Test planVerifies the harness runs and emits responses + cache + summary table. cd AIAdvisor/skills/solr-opensearch-migration-advisor/tests/evals/triage
# Prereqs
which claude # Claude Code CLI (uses local OAuth credential)
ollama list | grep qwen2.5:7b # qwen2.5:7b pulled locally# Smoke test: one question, both models
python3 sweep.py q01-querqy-class
# Expected: ~10s total. responses/q01-querqy-class.{haiku,qwen}.txt
# created. triage-results.md updated with one row showing A=RED B=RED
# for both models.
# Cache test: re-run, should be 0.0s per call
python3 sweep.py q01-querqy-class
# Expected: "(cache)" tag in both lines, 0.0s timing.
# Cache-skip test
python3 sweep.py --refresh-model haiku q01-querqy-class
# Expected: haiku "(live)" ~7s, qwen "(cache)" 0.0s.
# Full 20-question sweep (~10 minutes)
python3 sweep.py
# Expected: triage-results.md table populated for all 20 questions.Smoke test results from my own run are committed at Notes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SMA: add bare-model stumper triage harness
Adds tooling under
tests/evals/triage/to identify questions a bareLLM cannot answer correctly about Solr→OpenSearch migration — the
foundation for deciding which skill content matters.
What it does
tests/evals/stumper-candidates.md, eachwith two-criterion contains-any assertions. Organized by topic:
Querqy/SMUI, schema migration, query translation, AWS-managed
constraints, consultant judgment.
tests/evals/triage/sweep.pyruns each question against ClaudeHaiku 4.5 (cloud, via
claude --printfrom/tmpfor skillisolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by
sha256(model + prompt)so iterating on criteria is free.tests/evals/triage/responses/*.txt— bare-model responses for all20 questions × 2 models (42 files including q05's pre- and
post-reframe variants). Reviewers can see the failure mode without
re-running.
tests/evals/triage/triage-results.md— pass/fail summary table.What this surfaces
Of 20 questions, 12 are stumpers (both models fail at least one
criterion). The 7-question subset used by the follow-up eval
(#49) all originate from this triage.
Test plan
(See first author comment for full run-through.)