This project is a scaffold for an OpenRouter-backed LLM-as-Judge benchmark. It runs target models against a Seahorse emoji test suite, with a judge LLM evaluating each target response. OpenRouter key is read from .env
when available.
Prerequisites: Bun is installed and available on your PATH
. The benchmark script is written as a TS module and includes a shebang so it can be executed directly as a script.
- Make the script executable (required for direct bun invocation):
chmod +x ./scripts/run-benchmark.ts
- Run the full bench with the single Sonoma model (defaults shown):
bun ./scripts/run-benchmark.ts
- Output: Final JSON printed to stdout; redirect to a file if you want to persist:
bun ./scripts/run-benchmark.ts > results/sonoma_benchmark.json
- Optional: override defaults with a config file:
BENCH_CONFIG_PATH=./tests/configs/one-model.json bun ./scripts/run-benchmark.ts
- One-model setup (already in place):
- Model: Sonoma Dusk Alpha in
models/target-models.json
- Environment notes
- If you place a valid
OPENROUTER_KEY
in a.env
at the project root, real OpenRouter calls will be made. If not, the runner uses a mock path for testing. - To override the judge model used by the judge calls, set
OPENROUTER_JUDGE_MODEL
or editjudges/judge-config.json
.
- You can set a custom run-set identifier with the
BENCH_RUN_ID
environment variable. If you don't set it, the runner auto-generates an identifier likerunset-<timestamp>-<random>
for each run set. - Output for a run is now grouped under a single run-set folder:
results/<runSetId>/...
- Inside the run-set folder the layout is:
results/<runSetId>/<model_id>/<run_number>/
— files for each individual target run, for example:target_prompt.txt
— the prompt sent to the target modeltarget.txt
— the target model's responsejudge_1_response_try1.txt
,judge_1_parsed_try1.json
, etc. — raw judge responses and parsed attemptsjudge_tallies.json
andmeta.json
— per-run metadata and tallies
results/<runSetId>/<model_id>/summary.json
— per-model summary (question percentages and run paths)results/<runSetId>/summary.json
— top-level summary for the whole run-set (aggregated results and metadata)
This keeps a single run-set self-contained and easier to move/publish.
-
A helper script
scripts/release-run.ts
converts a completed run set into a release that can be checked into git:- Usage (direct):
bunx ts-node ./scripts/release-run.ts <runSetId>
- npm shortcut:
npm run release -- <runSetId>
(added topackage.json
) - The script will prompt for confirmation unless you pass
-y
or--yes
. - If a simple
rename
fromresults/<runSetId>
→results/releases/<runSetId>
is not possible (e.g., different filesystems), the script falls back to copying. - Optionally remove the original source after copying with
--remove-src
(flag--rm
is accepted too). - Example:
npm run release -- runset-1690000000000-1234 -y --remove-src
- Usage (direct):
-
.gitignore
has been updated soresults/
is still ignored by default butresults/releases/
is allowed to be checked in. The release script will ensure the project.gitignore
contains the rule!results/releases/
.
Judge outputs are often noisy (numbered lists, added prefixes, or raw yes/no tokens). Parsing improvements added:
-
The judge parsing logic (
src/judge.ts
) now tries multiple strategies to interpret judge output:- Extracts a JSON array when present and parses it into
{ question, answer }
entries. - Parses numbered or line-based answers such as
1. yes
/1) no
and maps by index or sequence. - As a fallback, scans for
yes
/no
tokens and assigns them in-order to questions.
- Extracts a JSON array when present and parses it into
-
Important improvement: when a parsed response contains a
question
string that merely prefixes or numbers the real question (for example:"1. Does the target conclude..."
), the code attempts to match that returned string to the canonical expected question and record the tally under the canonical question text.- Matching rules include:
- case-insensitive substring match
- stripping common list numbering prefixes like
1.
or1)
and matching - a normalized alphanumeric substring match (punctuation removed, multiple whitespace collapsed)
- If a match is found the tally uses the expected question text; otherwise the returned question string is used (so nothing is silently dropped).
- This reduces mismatches when judges prepend numbering or otherwise slightly alter the question text.
- Matching rules include:
scripts/run-benchmark.ts
— main runner (existing), updated to write results underresults/<runSetId>/...
and a run-setsummary.json
.scripts/release-run.ts
— new utility to move/copy a run-set toresults/releases/<runSetId>
and update.gitignore
.package.json
:bench:tsnode
— existing bench runner invocationrelease
— new npm shortcut to run the release helper:npm run release -- <runSetId>
- Run with an explicit ID:
BENCH_RUN_ID=my-run-001 bun ./scripts/run-benchmark.ts
- Release a run-set (interactive):
npm run release -- my-run-001
- Release a run-set (non-interactive, remove original):
npm run release -- my-run-001 -y --remove-src
- Matching is first-match and simple — if two expected questions are almost identical, the first matching question will be chosen. I can make matching stricter (longest-match) or add a fuzzy scoring if you prefer.
- I can add a
scripts/list-runs.ts
helper to list availableresults/<runSetId>
folders and simplify finding runSetIds to release. - I can add unit tests around the judge parsing and
matchExpectedQuestion
behavior.
If you'd like any of the follow-ups (list helper, stricter matching, tests), tell me which and I'll add them.