A tiny, hackable eval harness for comparing local LLMs on real tasks. Built for the SundAI Small Models Hack.
Defaults to Google Gemma 4. Swap to any OpenAI-compatible model with --model and --base-url.
smallbench/
├── client.py # one OpenAI-compatible client used everywhere
├── bench.py # run tasks against a model, save JSON results
├── report.py # turn results JSON into a comparison table
└── tasks/
├── coding.py # implement-this-function tasks with executable graders
├── tool_use.py # measure schema-valid tool calls
└── extraction.py # JSON extraction from messy text
./setup.shThis installs Python deps and pulls two Gemma 4 sizes (gemma4:4b, gemma4:1b). Edit the script for other models.
python bench.py --model gemma4:4b --out results/gemma4-4b.json
python bench.py --model gemma4:1b --out results/gemma4-1b.jsonEach run executes every task in tasks/ 3 times (configurable with --trials) and writes a JSON file with per-trial outputs, latencies, and pass/fail.
python report.py results/*.jsonPrints a side-by-side table:
Task gemma4:4b gemma4:1b
coding/fizzbuzz 3/3 (1.2s) 2/3 (0.5s)
coding/palindrome 3/3 (0.9s) 3/3 (0.4s)
tool_use/weather 3/3 (1.1s) 1/3 (0.4s)
extraction/person 3/3 (0.8s) 3/3 (0.3s)
overall 100% 75%
Drop a new file in tasks/ that exposes a TASKS list. Each task is a dict with:
id: unique stringprompt: what to send the modelgrader(response: str) -> tuple[bool, str]: returns (pass, reason)- optional
tools: OpenAI-style tool schemas if the task requires tool calls
bench.py auto-discovers anything in tasks/, so just add the file.
OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-... \
python bench.py --model gpt-4o-mini --out results/gpt-4o-mini.jsonOr against Anthropic via an OpenAI-compatible proxy. The point is to see how close Gemma 4 gets.
Pick one and ship by demo time:
- The Coding Gauntlet — Replace
tasks/coding.pywith 50 real GitHub issues. Score by whether the patch passes the original test suite. Bonus: graph quality vs. context length acrossgemma4:1b,gemma4:4b,gemma4:12b. - Tool-Use Reliability — Expand
tasks/tool_use.pyto a 200-trace eval with multi-step tool chains. Measure schema-valid rate, retry count, and end-to-end task success. - Latency-That-Matters — Add a
--measure-ttftflag tobench.pythat records time-to-first-token separately from decode rate. Produce a "what hardware do I need" cheat sheet. - Refusal & Drift Audit — Add
tasks/refusal.pywith 100 benign-but-touchy prompts. Measure refusal rate per Gemma 4 size. - Quantization Sweep — Pull
gemma4:4b-q4_K_M,gemma4:4b-q8_0, andgemma4:4b-fp16. Show the quality cliff (if any).
Point --base-url at any OpenAI-compatible endpoint:
python bench.py --model my-model --base-url http://localhost:8000/v1llama.cpp's server, vLLM, LM Studio, and most local serving stacks expose this by default.