Skip to content

lachlanglasgow/seahorseBench

Repository files navigation

Seahorse Benchmark

This project is a scaffold for an OpenRouter-backed LLM-as-Judge benchmark. It runs target models against a Seahorse emoji test suite, with a judge LLM evaluating each target response. OpenRouter key is read from .env when available.

Running with Bun (native TS)

Prerequisites: Bun is installed and available on your PATH. The benchmark script is written as a TS module and includes a shebang so it can be executed directly as a script.

  1. Make the script executable (required for direct bun invocation):
  • chmod +x ./scripts/run-benchmark.ts
  1. Run the full bench with the single Sonoma model (defaults shown):
  • bun ./scripts/run-benchmark.ts
  • Output: Final JSON printed to stdout; redirect to a file if you want to persist:
  • bun ./scripts/run-benchmark.ts > results/sonoma_benchmark.json
  1. Optional: override defaults with a config file:
  • BENCH_CONFIG_PATH=./tests/configs/one-model.json bun ./scripts/run-benchmark.ts
  1. One-model setup (already in place):
  • Model: Sonoma Dusk Alpha in models/target-models.json
  1. Environment notes
  • If you place a valid OPENROUTER_KEY in a .env at the project root, real OpenRouter calls will be made. If not, the runner uses a mock path for testing.
  • To override the judge model used by the judge calls, set OPENROUTER_JUDGE_MODEL or edit judges/judge-config.json.

Run-set IDs and output layout

  • You can set a custom run-set identifier with the BENCH_RUN_ID environment variable. If you don't set it, the runner auto-generates an identifier like runset-<timestamp>-<random> for each run set.
  • Output for a run is now grouped under a single run-set folder: results/<runSetId>/...
  • Inside the run-set folder the layout is:
    • results/<runSetId>/<model_id>/<run_number>/ — files for each individual target run, for example:
      • target_prompt.txt — the prompt sent to the target model
      • target.txt — the target model's response
      • judge_1_response_try1.txt, judge_1_parsed_try1.json, etc. — raw judge responses and parsed attempts
      • judge_tallies.json and meta.json — per-run metadata and tallies
    • results/<runSetId>/<model_id>/summary.json — per-model summary (question percentages and run paths)
    • results/<runSetId>/summary.json — top-level summary for the whole run-set (aggregated results and metadata)

This keeps a single run-set self-contained and easier to move/publish.

Release workflow (publish a dataset)

  • A helper script scripts/release-run.ts converts a completed run set into a release that can be checked into git:

    • Usage (direct): bunx ts-node ./scripts/release-run.ts <runSetId>
    • npm shortcut: npm run release -- <runSetId> (added to package.json)
    • The script will prompt for confirmation unless you pass -y or --yes.
    • If a simple rename from results/<runSetId>results/releases/<runSetId> is not possible (e.g., different filesystems), the script falls back to copying.
    • Optionally remove the original source after copying with --remove-src (flag --rm is accepted too).
    • Example: npm run release -- runset-1690000000000-1234 -y --remove-src
  • .gitignore has been updated so results/ is still ignored by default but results/releases/ is allowed to be checked in. The release script will ensure the project .gitignore contains the rule !results/releases/.

Robust judge parsing and question matching

Judge outputs are often noisy (numbered lists, added prefixes, or raw yes/no tokens). Parsing improvements added:

  • The judge parsing logic (src/judge.ts) now tries multiple strategies to interpret judge output:

    • Extracts a JSON array when present and parses it into { question, answer } entries.
    • Parses numbered or line-based answers such as 1. yes / 1) no and maps by index or sequence.
    • As a fallback, scans for yes/no tokens and assigns them in-order to questions.
  • Important improvement: when a parsed response contains a question string that merely prefixes or numbers the real question (for example: "1. Does the target conclude..."), the code attempts to match that returned string to the canonical expected question and record the tally under the canonical question text.

    • Matching rules include:
      • case-insensitive substring match
      • stripping common list numbering prefixes like 1. or 1) and matching
      • a normalized alphanumeric substring match (punctuation removed, multiple whitespace collapsed)
    • If a match is found the tally uses the expected question text; otherwise the returned question string is used (so nothing is silently dropped).
    • This reduces mismatches when judges prepend numbering or otherwise slightly alter the question text.

Scripts added / changed

  • scripts/run-benchmark.ts — main runner (existing), updated to write results under results/<runSetId>/... and a run-set summary.json.
  • scripts/release-run.ts — new utility to move/copy a run-set to results/releases/<runSetId> and update .gitignore.
  • package.json:
    • bench:tsnode — existing bench runner invocation
    • release — new npm shortcut to run the release helper: npm run release -- <runSetId>

Examples

  • Run with an explicit ID:
    • BENCH_RUN_ID=my-run-001 bun ./scripts/run-benchmark.ts
  • Release a run-set (interactive):
    • npm run release -- my-run-001
  • Release a run-set (non-interactive, remove original):
    • npm run release -- my-run-001 -y --remove-src

Notes, testing and future ideas

  • Matching is first-match and simple — if two expected questions are almost identical, the first matching question will be chosen. I can make matching stricter (longest-match) or add a fuzzy scoring if you prefer.
  • I can add a scripts/list-runs.ts helper to list available results/<runSetId> folders and simplify finding runSetIds to release.
  • I can add unit tests around the judge parsing and matchExpectedQuestion behavior.

If you'd like any of the follow-ups (list helper, stricter matching, tests), tell me which and I'll add them.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published