Goal: build a compatibility matrix — for each (provider, model), measure how well our installable skills work when loaded into a coding-agent-style system prompt. Identifies the minimum model per vendor family that still passes.
The tool is promptfoo. One declarative promptfooconfig.yaml at the repo root drives everything.
Every eval runs this shape:
- System prompt: the installable skills' content embedded into one system message.
ydb-coreis loaded asSKILL.mdplus itsreferences/balancing.md,references/session-lifecycle.md,references/embed/go.md, andrules/embed/go.md.ydb-tableis loaded asSKILL.mdplus itsreferences/working-with-data.md,references/embed/java.md,references/embed/go.md,rules/embed/java.md, andrules/embed/go.md— i.e. the full body of each skill, as if the agent had read every Load-Sources entry. - User prompt: supplied by the test case (
tests/<skill>/<case>.yaml). - Grader:
anthropic/claude-haiku-4.5judges eachllm-rubricassertion against a criterion block written in plain English.
This approximates the ceiling per model — every loadable file is fully in context. A model that fails this can't work in a real runtime (Claude Code, Codex, Cursor, etc.) where the agent additionally has to decide which skill to load and which references to read.
Runtime-specific behavior (how Claude Code chooses skills vs. how Codex does) is not tested here. That requires installing each runtime and running against it, which is a manual exercise — see the known gap below.
One-time:
export OPENROUTER_API_KEY="<token>"The matrix talks to public OpenRouter (https://openrouter.ai/api/v1/chat/completions); the URL is hardcoded in promptfooconfig.yaml, only the key has to come from the environment.
promptfoo runs via npx — no local install needed:
npx promptfoo@latest eval # run all tests across all providers
npx promptfoo@latest view # open the matrix in the browserDefault concurrency is set to 12 in each promptfooconfig*.yaml via
evaluateOptions.maxConcurrency. On a healthy OpenRouter key this
finishes the full matrix in roughly a third of the time of the
promptfoo default (4). If your key's monthly limit caps single-request
size or rate, lower concurrency to avoid clustered 402s:
npx promptfoo@latest eval -j 4 # CLI override for slower keysTwo providers benefit from extra config to avoid burning tokens on internal reasoning:
- Moonshot Kimi K2.6 — reasoning mode is on by default and consumes
the whole
max_tokensbudget on thinking before emitting any visible answer. We setreasoning: { enabled: false }on this provider so the budget goes to the actual response. - Google Gemini 3.1 Pro (preview) — same problem in principle, but
the OpenRouter route to Gemini rejects
reasoning: { enabled: false }with HTTP 400 ("Reasoning is mandatory for this endpoint"). Live with the thinking budget;max_tokens: 2048is just enough.
Other models on the matrix don't accept the reasoning parameter at
all and return HTTP 400 if you set it — do not blanket-apply it.
For a subset (faster iteration while writing a new skill or test):
# single provider (regex match on provider id)
npx promptfoo@latest eval --filter-providers 'qwen3-coder'
# single test (regex match on the test's `description` field)
npx promptfoo@latest eval --filter-pattern 'keyset pagination'
# first N tests only
npx promptfoo@latest eval --filter-first-n 3
# combine: one test on one provider
npx promptfoo@latest eval \
--filter-pattern 'Cloud auth' \
--filter-providers 'anthropic/claude-haiku-4.5'Results are stored in ~/.promptfoo/ and rendered as a matrix: rows = models, columns = test cases, cells = pass/fail + grader reasoning.
- Copy an existing file under
tests/<skill>/as a starting point — structure isdescription,vars.user_prompt,assert. - Write the
user_promptas a realistic user turn. Avoid toy examples — model size matters less on trivial prompts. - Write the
llm-rubriccriteria as bullet points the grader can check against the skill content. Every criterion should be something the grader can verify from the skill body alone. Do not introduce facts that aren't in any SKILL.md — the test would be impossible to pass on principle. - Run once with a cheap model and read the grader's reasoning. If the criteria are too loose (everything passes) or too strict (everything fails), tighten / relax them.
- Commit once the rubric gives stable results across two runs.
Worked example — a minimal outcome test:
description: Query · Keyset pagination in YQL + Go
vars:
user_prompt: |
Write a YQL query and the Go code to paginate a `users` table by
`created_at` (Timestamp) and `id` (Uint64) — 50 rows per page.
assert:
- type: llm-rubric
value: |
The response should:
- Use keyset pagination: `WHERE (created_at, id) > ...`. Avoid `OFFSET`.
- Declare parameters with `DECLARE` — no raw string interpolation.
- Wrap the Go query in a session-retry scope (db.Query().Do(...)).
Partial credit if pagination is correct but Go wrapper is missing.Edit promptfooconfig.yaml. Copy one of the existing provider rows:
- id: openai:chat:<vendor>/<model-slug>
label: <Human-friendly label>
config: *openrouterThe <vendor>/<model-slug> must match what OpenRouter exposes. Check the live list at https://openrouter.ai/models.
npx promptfoo@latest view opens an HTML matrix.
- Green cell (pass) — the grader judged the response to satisfy every criterion. Sanity-check by clicking the cell and reading the reasoning; occasionally the grader is too generous.
- Red cell (fail) — the grader flagged at least one criterion as unmet. Read the reasoning to see whether this is a genuine failure or a too-strict rubric.
- Yellow / partial — one or more criteria failed but others passed; the test opted into partial credit via the rubric wording.
Per-model summary: pass rate column on the right. This is what drives the "minimum model per family" decision.
- A new major model ships and is available on
/openrouter. Add a row, re-run. - A model gets deprecated. Remove the row.
- A customer asks "does our skill work on model X?" and X is not in the matrix. Add it, commit the matrix run output to
matrix.md(optional — see below).
Not done by default. The matrix changes every time any skill or test changes, so committing full results would be noisy. If you want a snapshot for an external stakeholder, run npx promptfoo@latest eval --output matrix.md --format markdown.
Independent of the promptfoo matrix, scripts/validate-skills.py enforces structural invariants over skills/ — SKILL.md frontmatter shape, no TODO(author) markers, language-agnostic top-level references/ files, relative-link resolution, and that every RULE-<PREFIX>-<NN> ID uses a prefix registered in docs/authoring.md. It runs as a pre-flight inside install.sh (failure aborts the install) and is also runnable standalone:
python3 scripts/validate-skills.pyExit code 0 = all checks passed; 1 = at least one violation, printed to stderr with file path and reason. Use this before opening a PR to catch the cheap mistakes without burning a promptfoo run.
A second promptfoo config — promptfooconfig.routing.yaml — tests the
selector behavior instead of the ceiling. The system prompt contains only
each skill's description: field (extracted from SKILL.md frontmatter
by scripts/extract-descriptions.py into tests/routing/descriptions.md);
no skill body is loaded. The model is asked to reply with one token — the
slug of the skill that should fire, or none.
This approximates what Claude Code / Cursor / Codex do internally when they decide which skill to load. It catches two failure modes the ceiling matrix can't see:
- False negatives —
description:is too narrow; valid requests don't trigger the skill at all. - False positives —
description:is too broad; the skill loads on unrelated requests (worse than not loading — the agent reads irrelevant context).
Run it the same way as the main matrix, with -c:
npx promptfoo@latest eval -c promptfooconfig.routing.yamlAsserts are deterministic regex (no LLM grader) — cheap to re-run after
any description: edit. scripts/validate-skills.py calls
extract-descriptions.py --check and fails if tests/routing/descriptions.md
is stale relative to the SKILL.md frontmatter.
This setup tests models. It does not test runtimes — Claude Code / Cursor / Windsurf / Codex / Gemini CLI each have their own skill-loading mechanics (description-first routing, varying system-prompt construction, different tool sets) that can change the outcome from what the matrix shows.
Runtime-level testing requires installing each runtime and running against it with the skill installed. That is manual today:
- Install skills into the target runtime (
./install.sh --agent=<name>). - Spin up the runtime with a target model.
- Paste a prompt from
tests/into the runtime's chat. - Eye-check the response.
A non-trivial runtime-testing harness would need per-runtime drivers (subprocess calls or embedded automation). Deferred.
- CI workflow. Promptfoo runs locally against an OpenAI-compatible endpoint. CI wiring would require that endpoint to be reachable from CI and an API key provisioned as a secret — out of scope right now.
- Cost tracking. The matrix runs every provider × every test. Use
--filter-providers/--filter-patternduring iteration to keep the spend low. - Trigger-only tests. The current layout always loads all installable skills. If you need to measure "does the model correctly pick ydb-table given only the descriptions?" — that's a second-stage rig, not built here.
- promptfoo docs — the framework itself.
- promptfoo OpenAI-compat provider — the provider type all our models use.
promptfooconfig.yaml— the matrix config.prompts/coding-agent.yaml— the shared system prompt.docs/authoring.md— how to write a skill; conventions that tests check against.