SMA: add bare-model stumper triage harness by seanoc5 · Pull Request #48 · o19s/opensearch-migrations

seanoc5 · 2026-05-11T16:22:28Z

SMA: add bare-model stumper triage harness

Adds tooling under tests/evals/triage/ to identify questions a bare
LLM cannot answer correctly about Solr→OpenSearch migration — the
foundation for deciding which skill content matters.

What it does

20 candidate questions in tests/evals/stumper-candidates.md, each
with two-criterion contains-any assertions. Organized by topic:
Querqy/SMUI, schema migration, query translation, AWS-managed
constraints, consultant judgment.
tests/evals/triage/sweep.py runs each question against Claude
Haiku 4.5 (cloud, via claude --print from /tmp for skill
isolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by
sha256(model + prompt) so iterating on criteria is free.
tests/evals/triage/responses/*.txt — bare-model responses for all
20 questions × 2 models (42 files including q05's pre- and
post-reframe variants). Reviewers can see the failure mode without
re-running.
tests/evals/triage/triage-results.md — pass/fail summary table.

What this surfaces

Of 20 questions, 12 are stumpers (both models fail at least one
criterion). The 7-question subset used by the follow-up eval
(#49) all originate from this triage.

Test plan

(See first author comment for full run-through.)

Adds tooling to identify questions a bare LLM cannot answer about Solr→OpenSearch migration — the foundation for building targeted skill content. Runs each candidate question against two target models from /tmp isolation and applies contains-any criteria to surface knowledge gaps. - tests/evals/stumper-candidates.md: 20 candidate questions with two-criterion assertions, organized by topic (Querqy/SMUI, schema migration, query translation, AWS-managed constraints, consultant judgment). - tests/evals/triage/sweep.py: harness that runs each question against Claude Haiku 4.5 (cloud, via the `claude` CLI from /tmp for skill isolation) and qwen2.5:7b (local Ollama HTTP). Responses cached by sha256(model + prompt) so criterion iteration is free. - tests/evals/triage/responses/*.txt: bare-model responses for all 20 questions × both models (42 files including q05's pre- and post-reframe variants). Lets reviewers see the failure mode without re-running. - tests/evals/triage/triage-results.md: pass/fail summary table.

seanoc5 · 2026-05-11T16:36:21Z

Test plan

Verifies the harness runs and emits responses + cache + summary table.

cd AIAdvisor/skills/solr-opensearch-migration-advisor/tests/evals/triage

# Prereqs
which claude    # Claude Code CLI (uses local OAuth credential)
ollama list | grep qwen2.5:7b   # qwen2.5:7b pulled locally

# Smoke test: one question, both models
python3 sweep.py q01-querqy-class
# Expected: ~10s total. responses/q01-querqy-class.{haiku,qwen}.txt
# created. triage-results.md updated with one row showing A=RED B=RED
# for both models.

# Cache test: re-run, should be 0.0s per call
python3 sweep.py q01-querqy-class
# Expected: "(cache)" tag in both lines, 0.0s timing.

# Cache-skip test
python3 sweep.py --refresh-model haiku q01-querqy-class
# Expected: haiku "(live)" ~7s, qwen "(cache)" 0.0s.

# Full 20-question sweep (~10 minutes)
python3 sweep.py
# Expected: triage-results.md table populated for all 20 questions.

Smoke test results from my own run are committed at
responses/q01-querqy-class.{haiku,qwen}.txt — Haiku honestly
declines, Qwen guesses the wrong package
(org.opensearch.queries.rewrite.commonrules.CommonRulesRewriter),
both A=RED.

Notes

cache/ and *.json are .gitignored — addressable by hash, not
useful in diffs.
The claude CLI uses the local ~/.claude/.credentials.json OAuth
credential; no ANTHROPIC_API_KEY env var needed.
Ollama provider hits http://localhost:11434/api/generate by
default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMA: add bare-model stumper triage harness#48

SMA: add bare-model stumper triage harness#48
seanoc5 wants to merge 1 commit into
mainfrom
feat/sma-stumper-triage-harness

seanoc5 commented May 11, 2026 •

edited

Loading

Uh oh!

seanoc5 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seanoc5 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SMA: add bare-model stumper triage harness

What it does

What this surfaces

Test plan

Uh oh!

seanoc5 commented May 11, 2026

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seanoc5 commented May 11, 2026 •

edited

Loading