Name	Name	Last commit message	Last commit date
parent directory ..
tasks	tasks
README.md	README.md
bench.py	bench.py
client.py	client.py
report.py	report.py
setup.sh	setup.sh

smallbench

A tiny, hackable eval harness for comparing local LLMs on real tasks. Built for the SundAI Small Models Hack.

Defaults to Google Gemma 4. Swap to any OpenAI-compatible model with --model and --base-url.

What's in here

smallbench/
├── client.py          # one OpenAI-compatible client used everywhere
├── bench.py           # run tasks against a model, save JSON results
├── report.py          # turn results JSON into a comparison table
└── tasks/
    ├── coding.py      # implement-this-function tasks with executable graders
    ├── tool_use.py    # measure schema-valid tool calls
    └── extraction.py  # JSON extraction from messy text

Step-by-step

1. Setup (60 seconds)

./setup.sh

This installs Python deps and pulls two Gemma 4 sizes (gemma4:4b, gemma4:1b). Edit the script for other models.

2. Run the baseline

python bench.py --model gemma4:4b --out results/gemma4-4b.json
python bench.py --model gemma4:1b --out results/gemma4-1b.json

Each run executes every task in tasks/ 3 times (configurable with --trials) and writes a JSON file with per-trial outputs, latencies, and pass/fail.

3. Generate a report

python report.py results/*.json

Prints a side-by-side table:

Task              gemma4:4b   gemma4:1b
coding/fizzbuzz   3/3 (1.2s)  2/3 (0.5s)
coding/palindrome 3/3 (0.9s)  3/3 (0.4s)
tool_use/weather  3/3 (1.1s)  1/3 (0.4s)
extraction/person 3/3 (0.8s)  3/3 (0.3s)
overall           100%        75%

4. Add your own task

Drop a new file in tasks/ that exposes a TASKS list. Each task is a dict with:

id: unique string
prompt: what to send the model
grader(response: str) -> tuple[bool, str]: returns (pass, reason)
optional tools: OpenAI-style tool schemas if the task requires tool calls

bench.py auto-discovers anything in tasks/, so just add the file.

5. Compare against a cloud model (optional)

OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-... \
  python bench.py --model gpt-4o-mini --out results/gpt-4o-mini.json

Or against Anthropic via an OpenAI-compatible proxy. The point is to see how close Gemma 4 gets.

Suggested challenges

Pick one and ship by demo time:

The Coding Gauntlet — Replace tasks/coding.py with 50 real GitHub issues. Score by whether the patch passes the original test suite. Bonus: graph quality vs. context length across gemma4:1b, gemma4:4b, gemma4:12b.
Tool-Use Reliability — Expand tasks/tool_use.py to a 200-trace eval with multi-step tool chains. Measure schema-valid rate, retry count, and end-to-end task success.
Latency-That-Matters — Add a --measure-ttft flag to bench.py that records time-to-first-token separately from decode rate. Produce a "what hardware do I need" cheat sheet.
Refusal & Drift Audit — Add tasks/refusal.py with 100 benign-but-touchy prompts. Measure refusal rate per Gemma 4 size.
Quantization Sweep — Pull gemma4:4b-q4_K_M, gemma4:4b-q8_0, and gemma4:4b-fp16. Show the quality cliff (if any).

Adding a model from outside Ollama

Point --base-url at any OpenAI-compatible endpoint:

python bench.py --model my-model --base-url http://localhost:8000/v1

llama.cpp's server, vLLM, LM Studio, and most local serving stacks expose this by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

smallbench

What's in here

Step-by-step

1. Setup (60 seconds)

2. Run the baseline

3. Generate a report

4. Add your own task

5. Compare against a cloud model (optional)

Suggested challenges

Adding a model from outside Ollama

FilesExpand file tree

smallbench

Directory actions

More options

Directory actions

More options

Latest commit

History

smallbench

Folders and files

parent directory

README.md

smallbench

What's in here

Step-by-step

1. Setup (60 seconds)

2. Run the baseline

3. Generate a report

4. Add your own task

5. Compare against a cloud model (optional)

Suggested challenges

Adding a model from outside Ollama