Seahorse Benchmark

This project is a scaffold for an OpenRouter-backed LLM-as-Judge benchmark. It runs target models against a Seahorse emoji test suite, with a judge LLM evaluating each target response. OpenRouter key is read from .env when available.

Running with Bun (native TS)

Prerequisites: Bun is installed and available on your PATH. The benchmark script is written as a TS module and includes a shebang so it can be executed directly as a script.

Make the script executable (required for direct bun invocation):

chmod +x ./scripts/run-benchmark.ts

Run the full bench with the single Sonoma model (defaults shown):

bun ./scripts/run-benchmark.ts
Output: Final JSON printed to stdout; redirect to a file if you want to persist:
bun ./scripts/run-benchmark.ts > results/sonoma_benchmark.json

Optional: override defaults with a config file:

BENCH_CONFIG_PATH=./tests/configs/one-model.json bun ./scripts/run-benchmark.ts

One-model setup (already in place):

Model: Sonoma Dusk Alpha in models/target-models.json

Environment notes

If you place a valid OPENROUTER_KEY in a .env at the project root, real OpenRouter calls will be made. If not, the runner uses a mock path for testing.
To override the judge model used by the judge calls, set OPENROUTER_JUDGE_MODEL or edit judges/judge-config.json.

Run-set IDs and output layout

You can set a custom run-set identifier with the BENCH_RUN_ID environment variable. If you don't set it, the runner auto-generates an identifier like runset-<timestamp>-<random> for each run set.
Output for a run is now grouped under a single run-set folder: results/<runSetId>/...
Inside the run-set folder the layout is:
- results/<runSetId>/<model_id>/<run_number>/ — files for each individual target run, for example:
  - target_prompt.txt — the prompt sent to the target model
  - target.txt — the target model's response
  - judge_1_response_try1.txt, judge_1_parsed_try1.json, etc. — raw judge responses and parsed attempts
  - judge_tallies.json and meta.json — per-run metadata and tallies
- results/<runSetId>/<model_id>/summary.json — per-model summary (question percentages and run paths)
- results/<runSetId>/summary.json — top-level summary for the whole run-set (aggregated results and metadata)

This keeps a single run-set self-contained and easier to move/publish.

Release workflow (publish a dataset)

A helper script scripts/release-run.ts converts a completed run set into a release that can be checked into git:
- Usage (direct): bunx ts-node ./scripts/release-run.ts <runSetId>
- npm shortcut: npm run release -- <runSetId> (added to package.json)
- The script will prompt for confirmation unless you pass -y or --yes.
- If a simple rename from results/<runSetId> → results/releases/<runSetId> is not possible (e.g., different filesystems), the script falls back to copying.
- Optionally remove the original source after copying with --remove-src (flag --rm is accepted too).
- Example: npm run release -- runset-1690000000000-1234 -y --remove-src
.gitignore has been updated so results/ is still ignored by default but results/releases/ is allowed to be checked in. The release script will ensure the project .gitignore contains the rule !results/releases/.

Robust judge parsing and question matching

Judge outputs are often noisy (numbered lists, added prefixes, or raw yes/no tokens). Parsing improvements added:

The judge parsing logic (src/judge.ts) now tries multiple strategies to interpret judge output:
- Extracts a JSON array when present and parses it into { question, answer } entries.
- Parses numbered or line-based answers such as 1. yes / 1) no and maps by index or sequence.
- As a fallback, scans for yes/no tokens and assigns them in-order to questions.
Important improvement: when a parsed response contains a question string that merely prefixes or numbers the real question (for example: "1. Does the target conclude..."), the code attempts to match that returned string to the canonical expected question and record the tally under the canonical question text.
- Matching rules include:
  - case-insensitive substring match
  - stripping common list numbering prefixes like 1. or 1) and matching
  - a normalized alphanumeric substring match (punctuation removed, multiple whitespace collapsed)
- If a match is found the tally uses the expected question text; otherwise the returned question string is used (so nothing is silently dropped).
- This reduces mismatches when judges prepend numbering or otherwise slightly alter the question text.

Scripts added / changed

scripts/run-benchmark.ts — main runner (existing), updated to write results under results/<runSetId>/... and a run-set summary.json.
scripts/release-run.ts — new utility to move/copy a run-set to results/releases/<runSetId> and update .gitignore.
package.json:
- bench:tsnode — existing bench runner invocation
- release — new npm shortcut to run the release helper: npm run release -- <runSetId>

Examples

Run with an explicit ID:
- BENCH_RUN_ID=my-run-001 bun ./scripts/run-benchmark.ts
Release a run-set (interactive):
- npm run release -- my-run-001
Release a run-set (non-interactive, remove original):
- npm run release -- my-run-001 -y --remove-src

Notes, testing and future ideas

Matching is first-match and simple — if two expected questions are almost identical, the first matching question will be chosen. I can make matching stricter (longest-match) or add a fuzzy scoring if you prefer.
I can add a scripts/list-runs.ts helper to list available results/<runSetId> folders and simplify finding runSetIds to release.
I can add unit tests around the judge parsing and matchExpectedQuestion behavior.

If you'd like any of the follow-ups (list helper, stricter matching, tests), tell me which and I'll add them.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
benchmark_landingpage		benchmark_landingpage
judges		judges
models		models
results		results
scripts		scripts
src		src
tests		tests
tools		tools
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
LLM_JUDGE_BENCHMARK_PLAN.md		LLM_JUDGE_BENCHMARK_PLAN.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seahorse Benchmark

Running with Bun (native TS)

Run-set IDs and output layout

Release workflow (publish a dataset)

Robust judge parsing and question matching

Scripts added / changed

Examples

Notes, testing and future ideas

About

Uh oh!

Releases

Packages

Languages

lachlanglasgow/seahorseBench

Folders and files

Latest commit

History

Repository files navigation

Seahorse Benchmark

Running with Bun (native TS)

Run-set IDs and output layout

Release workflow (publish a dataset)

Robust judge parsing and question matching

Scripts added / changed

Examples

Notes, testing and future ideas

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages