Skip to content

SMA: add bare-model stumper triage harness#48

Open
seanoc5 wants to merge 1 commit into
mainfrom
feat/sma-stumper-triage-harness
Open

SMA: add bare-model stumper triage harness#48
seanoc5 wants to merge 1 commit into
mainfrom
feat/sma-stumper-triage-harness

Conversation

@seanoc5
Copy link
Copy Markdown
Collaborator

@seanoc5 seanoc5 commented May 11, 2026

SMA: add bare-model stumper triage harness

Adds tooling under tests/evals/triage/ to identify questions a bare
LLM cannot answer correctly about Solr→OpenSearch migration — the
foundation for deciding which skill content matters.

What it does

  • 20 candidate questions in tests/evals/stumper-candidates.md, each
    with two-criterion contains-any assertions. Organized by topic:
    Querqy/SMUI, schema migration, query translation, AWS-managed
    constraints, consultant judgment.
  • tests/evals/triage/sweep.py runs each question against Claude
    Haiku 4.5 (cloud, via claude --print from /tmp for skill
    isolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by
    sha256(model + prompt) so iterating on criteria is free.
  • tests/evals/triage/responses/*.txt — bare-model responses for all
    20 questions × 2 models (42 files including q05's pre- and
    post-reframe variants). Reviewers can see the failure mode without
    re-running.
  • tests/evals/triage/triage-results.md — pass/fail summary table.

What this surfaces

Of 20 questions, 12 are stumpers (both models fail at least one
criterion). The 7-question subset used by the follow-up eval
(#49) all originate from this triage.

Test plan

(See first author comment for full run-through.)

Adds tooling to identify questions a bare LLM cannot answer about
Solr→OpenSearch migration — the foundation for building targeted skill
content. Runs each candidate question against two target models from
/tmp isolation and applies contains-any criteria to surface knowledge gaps.

- tests/evals/stumper-candidates.md: 20 candidate questions with
  two-criterion assertions, organized by topic (Querqy/SMUI, schema
  migration, query translation, AWS-managed constraints, consultant
  judgment).
- tests/evals/triage/sweep.py: harness that runs each question against
  Claude Haiku 4.5 (cloud, via the `claude` CLI from /tmp for skill
  isolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by
  sha256(model + prompt) so criterion iteration is free.
- tests/evals/triage/responses/*.txt: bare-model responses for all 20
  questions × both models (42 files including q05's pre- and
  post-reframe variants). Lets reviewers see the failure mode without
  re-running.
- tests/evals/triage/triage-results.md: pass/fail summary table.
@seanoc5
Copy link
Copy Markdown
Collaborator Author

seanoc5 commented May 11, 2026

Test plan

Verifies the harness runs and emits responses + cache + summary table.

cd AIAdvisor/skills/solr-opensearch-migration-advisor/tests/evals/triage

# Prereqs
which claude    # Claude Code CLI (uses local OAuth credential)
ollama list | grep qwen2.5:7b   # qwen2.5:7b pulled locally
# Smoke test: one question, both models
python3 sweep.py q01-querqy-class
# Expected: ~10s total. responses/q01-querqy-class.{haiku,qwen}.txt
# created. triage-results.md updated with one row showing A=RED B=RED
# for both models.

# Cache test: re-run, should be 0.0s per call
python3 sweep.py q01-querqy-class
# Expected: "(cache)" tag in both lines, 0.0s timing.

# Cache-skip test
python3 sweep.py --refresh-model haiku q01-querqy-class
# Expected: haiku "(live)" ~7s, qwen "(cache)" 0.0s.

# Full 20-question sweep (~10 minutes)
python3 sweep.py
# Expected: triage-results.md table populated for all 20 questions.

Smoke test results from my own run are committed at
responses/q01-querqy-class.{haiku,qwen}.txt — Haiku honestly
declines, Qwen guesses the wrong package
(org.opensearch.queries.rewrite.commonrules.CommonRulesRewriter),
both A=RED.

Notes

  • cache/ and *.json are .gitignored — addressable by hash, not
    useful in diffs.
  • The claude CLI uses the local ~/.claude/.credentials.json OAuth
    credential; no ANTHROPIC_API_KEY env var needed.
  • Ollama provider hits http://localhost:11434/api/generate by
    default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant