Adding a new local model to the microbench

End-to-end walkthrough: from "I have a HuggingFace model name" to "PR submitted with my model's results in MMBT." Built so that a new contributor can rerun the same 12-task-family microbench against any vLLM-supported model and produce comparable, reproducible results.

TL;DR: Pull weights, add a launch command, smoke-test (5 min), full chain (3-7 hr), grade, summarize, write up. Half-day to one-day total operator time.

Two contribution shapes

This walkthrough covers both:

New model entry — a model that isn't currently in the comparison (any vLLM-supported local model with a working tool-call parser).
Same model, different quant — re-running an existing comparison-arm model at a different precision (e.g. official Qwen FP8, Unsloth UD4 GGUF, BF16) of the same base model. This is currently the highest-priority external contribution path — see ROADMAP.md item 1, motivated by field reports that the published Cyankiwi 4-bit AWQ quants underperform other precisions of the same models.

The mechanics are the same in both cases: only the HuggingFace model path and the vLLM launch flags change. Use a tag like qwen3.6-27b-fp8 or qwen3-coder-next-unsloth-ud4 so the result is comparable and clearly differentiated from the published Cyankiwi 4-bit AWQ entries.

What you need

Hardware: a Linux box with a CUDA-capable GPU. Tower2 (the published runs' hardware) has 2× RTX PRO 6000 Blackwell (96 GB each); the canonical flags assume that. For smaller GPUs (24-48 GB), drop --max-model-len to 32k-65k and --gpu-memory-utilization as needed. The tasks themselves rarely exceed 30k tokens of context — long context isn't needed.

Software: Docker with NVIDIA Container Toolkit, Python 3.10+ on the host. ~50 GB free disk for the vLLM image plus your model.

Model: any HuggingFace model that vLLM supports AND that has a working tool-call parser. See launch-commands.md § "Tool-call-parser quick reference" for a parser ↔ model-family table covering Qwen, Llama, Mistral, DeepSeek, Gemma, GLM, Phi, Hermes.

API access (optional): nothing required for the local-model path. The market-research task fetches public web pages from the sandbox; if you're behind a strict firewall, that one task will fail, the other 11 will work.

The friendly path (4 commands once setup is done)

# 1. Build the sandbox image (one-time, ~5 min)
docker build -t bench-sandbox:latest tooling/

# 2. Pull the vLLM image (one-time, ~12 GB)
docker pull vllm/vllm-openai:latest

# 3. Start vLLM serving your model (see "Step-by-step" below for the docker run)
docker run -d ... --served-model-name my-new-model --port 8001 ...

# 4. The friendly path
bash tooling/scripts/smoke_test.sh my-new-model 8001          # ~3 min
bash tooling/scripts/run_microbench.sh my-new-model 8001 my-label   # 3-7 hr
bash tooling/scripts/grade_microbench.sh my-label             # ~5 min
bash tooling/scripts/summarize.sh my-label                    # instant

That last command prints a per-task PASS/FAIL table next to the published Coder-Next + 27B reference numbers, so you can see at-a-glance where your model is stronger or weaker.

Step-by-step

Step 1: Pick a model and verify vLLM supports it

Check vLLM's supported models list. Most modern instruction-tuned models are supported. Specifically check that:

The base model architecture is supported (e.g. LlamaForCausalLM, Qwen3ForCausalLM)
A tool-call parser exists for the model family (see the table in launch-commands.md)
If the model has a thinking/reasoning mode, a reasoning parser exists or is omittable

If your model is a fresh release, check the vLLM GitHub issues for "tool calling" + the model name. Sometimes parsers land weeks after the model.

Step 2: Pull the weights

# Option A: huggingface-cli (recommended for most users)
huggingface-cli download <org>/<model> --local-dir ~/models/<org>-<model>

# Option B: curl + manifest (for slow/buffer-bloated connections — see Tower2 memory note)
# Faster than hf_transfer on links with bufferbloat; see https://huggingface.co/docs/hub/datasets-downloading

Path convention: ~/models/<org>-<model>/. The vLLM container will mount this directory read-only.

Step 3: Write a launch command

Open launch-commands.md and add a new section for your model, copying the closest existing entry as a template. The two flags that matter most:

--tool-call-parser <parser> — see the quick-reference table. Wrong parser = empty tool_calls arrays = agent never executes any tool. This is the single most common failure mode for a new model.
--reasoning-parser <parser> — only if the model has a thinking mode. Adding it to a non-thinking model causes all output to land in the reasoning field with empty content.

Example template for an unfamiliar model:

docker run -d \
  --name vllm-my-new-model \
  --gpus '"device=0"' \
  --shm-size 8g \
  -v ~/models:/models:ro \
  -p 127.0.0.1:8001:8000 \
  vllm/vllm-openai:latest \
  --model /models/<org>-<model> \
  --served-model-name my-new-model \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser <parser-from-table>

For thinking models, add --reasoning-parser <parser>. For Qwen3 it's qwen3; for DeepSeek-R1 it's deepseek_r1; for GLM-4.5 thinking it's glm45.

Wait for it to come up:

until curl -sf http://127.0.0.1:8001/v1/models >/dev/null; do sleep 5; done
echo "ready"
curl -s http://127.0.0.1:8001/v1/models | jq '.data[].id'

Step 4: Smoke test (CRITICAL)

bash tooling/scripts/smoke_test.sh my-new-model 8001

This runs the structured-extraction task at N=1 (~2-5 min wall) and grades the output. Expected:

==> Smoke test result
    verdict:  PASS
    accuracy: 0.95-1.0
    log dir:  logs/smoke_extract_...
✓ Smoke test PASS — your setup works.

If smoke test fails, do NOT proceed to the 3-7 hour full chain. Common failures:

Symptom	Cause	Fix
`tcs=0` on every iter in transcript.jsonl	Wrong `--tool-call-parser`	Try a different parser from the quick-ref table
Output in `reasoning_content`, `content` empty	`--reasoning-parser` set on non-thinking model	Drop the flag
4xx/5xx from vLLM endpoint	Container loading or model name mismatch	`docker logs vllm-...`; check `/v1/models` shows your model
Stuck-detector fires immediately	Agent isn't calling tools (parser issue) OR model is too small for the task	Verify smoke test passes on a known-good model first
OOM on container start	`--max-model-len` too high for your GPU memory	Drop to 32768 or 16384

Once smoke passes, the rest is mechanical.

Step 5: Full microbench chain

bash tooling/scripts/run_microbench.sh my-new-model 8001 my-label 3

Args: <served-model-name> <port> <run-name-tag> <N>. The chain runs all 12 task families × N replicates.

Wall time:

Tower2-class hardware (2× 96 GB Blackwell, 4-bit AWQ models): 3-7 hours for N=3
Smaller GPUs / FP8 / BF16: probably 6-15 hours for N=3
N=1 instead of N=3 cuts wall by ~3× but loses variance signal

The chain is idempotent: if you Ctrl-C and rerun, it skips runs that already have summary.json and workspace_final.tar.gz. Safe to run overnight, check, resume.

You can monitor progress with:

ls logs/p[1-3]_*_${LABEL}_v*/ | wc -l    # how many runs started
ls logs/p[1-3]_*_${LABEL}_v*/summary.json | wc -l    # how many runs finished cleanly

Step 6: Grade

bash tooling/scripts/grade_microbench.sh my-label

This iterates logs/p[1-3]_*_<label>_v*/, picks the right grader per task family, and writes grade.json next to each run's other artifacts. Idempotent. Takes ~5 min.

For the 5 Phase 3 tasks, the graders emit programmatic verdicts AND hand_rating_placeholders for subjective dimensions (prose quality, source skepticism, citation validity, audience tone fit, etc). Filling those is a separate step:

Quick path: leave them as null and report only the programmatic PASS/FAIL. This is honest and matches what the published microbench did originally.
Honest path: hand-grade them (yourself or with a second model — record the grader provenance per the existing convention; see KNOWN-LIMITATIONS.md § "claude-grading-claude" for the meta-issue).
Rigorous path: hand-grade with explicit rubric, sample-validate citations on the market-research entry. Documented in SCORECARD.md "What would change this picture" #1.

Step 7: Summarize

bash tooling/scripts/summarize.sh my-label

Prints a per-task table comparing your model against the published Coder-Next + 27B reference cells:

Microbench results for: llama3-70b

task                   pass     cost_med   wall_med   | 27B (ref)  Coder (ref)
-----------------------------------------------------------------------------------
p1_bugfix              3/3      $0.0184    14.2m      | 3/3        2/3
p1_testwrite           0/3      $0.0098    8.1m       | 0/3 †      0/3 †
...
Total: 22/36 PASS

The reference column is hard-coded to the published microbench-2026-04-28 numbers, so you can see directly where your model differs.

Step 8: Cost extraction

for d in logs/p[1-3]_*_my-label_v*/; do
  python3 tooling/scripts/extract_cost.py "$d"
done

Writes cost.json next to each run. Recorded fields: wall, completion + prompt tokens, throughput, GPU power+memory snapshot, energy upper-bound USD cost. Takes ~10s.

Step 9: Write up

A new MMBT contribution needs at minimum:

A row in SCORECARD.md — under the microbench-2026-04-28 section, add a new row format showing your model. Or, if your model is dramatically different from the existing two, propose a separate section.
A short findings note — what does your model do well vs poorly relative to Coder-Next + 27B? Particularly note any task families where it inverts the existing pattern (e.g. wins where Coder-Next wins, loses where 27B wins, etc.). 1-2 pages of markdown.
At least one full per-model entry for the highest-signal task family — copy the structure from benchmarks/microbench-2026-04-28/adversarial-hallucination/Qwen3.6-27B-AWQ/ as a template.

A more thorough contribution adds:

Lean per-model entries for all 12 task families (mirror the existing 9 lean entries' structure)
A failure-mode label.json per run, classified per FAILURE-TAXONOMY.md. The published tooling/scripts/apply_labels.py is hand-coded per run name; you'd add a LABELS dict entry per run.
Hand-graded subjective dimensions on Phase 3 with explicit _GRADER_ provenance.
Sample-validated citations on the market-research entry (the published 27B entry samples 18/33 URLs; you'd do the same on yours).

Step 10: Submit a PR

git checkout -b add-<model-label>-microbench-results
git add SCORECARD.md benchmarks/microbench-2026-04-28/<your-additions>
git commit -m "Add <model> microbench results"
git push -u origin add-<model-label>-microbench-results
gh pr create --title "Add <model> microbench results" --body "..."

PR template suggestions:

Headline pass-rate (e.g. "23/36 PASS, comparable to 27B's 17/30 excluding test-design rows")
2-3 specific cells where your model differs interestingly from the published baselines
Cost-per-attempt comparison
Anything you noticed about failure modes that isn't captured in the taxonomy

Common pitfalls

Skipping the smoke test. Wastes hours on a 3-7 hour chain that was always going to fail at iter 1. The smoke test exists for this exact reason.
Running before vLLM is fully ready. vLLM takes 30-90s to load weights and another 20-60s to compile CUDA graphs. Poll /v1/models until 200, don't assume readiness.
Reading cost.json as actual energy cost. It's cost_usd_upper_bound — assumes the GPU drew at power.limit for the entire wall time. Real draw is significantly lower. Useful for ranking, not for absolute economics.
Drawing strong conclusions from a single N=3 run. Confidence intervals on N=3 are wide. Treat point estimates as point estimates.
Cherry-picking-best-of-N without saying so. The MMBT convention for the older benchmarks is "publish best of N with explicit variance discussion." For the microbench, the convention shifted to "publish N=3 PASS rate so variance is visible." Pick one and be clear.
Not pinning the harness git SHA. Receipts capture this automatically, but if you modified harness.py for a new tool-call parser or schema, commit those changes BEFORE running the chain. Receipts with git_dirty: true aren't reproducible.

What if I'm not on Tower2-class hardware?

The published flags assume 96 GB GPUs. Smaller setups need adjustments:

Hardware	Likely changes
1× RTX 4090 (24 GB)	`--max-model-len 16384`, `--gpu-memory-utilization 0.85`. Models >7B at 4-bit might OOM.
1× RTX 5090 (32 GB)	`--max-model-len 32768`. Most 7B-13B-class works fine.
1× RTX 6000 Ada (48 GB)	`--max-model-len 65536`. 27B-class at 4-bit fits.
1× H100 (80 GB)	Default flags work for most models up to ~32B AWQ-4bit
2× 24-48 GB	Use `--tensor-parallel-size 2` for larger models

If --max-model-len is too low for a task's prompt, the harness will hit "completion_token_cap" and the run will FAIL with a non-substantive cause. The microbench tasks rarely exceed 30k tokens of context, so 32-65k should be comfortable.

Cost / time budget cheatsheet

For one new model, full microbench at N=3:

Item	Tower2-class	Smaller GPU
Smoke test	3 min	5 min
Full chain (12 tasks × N=3)	3-7 hr	6-15 hr
Grading	5 min	5 min
Cost extraction	10 sec	10 sec
Hand-grading Phase 3	30-60 min (claude) / 2-4 hr (human)	same
Citation validation on market-research	30-60 min	same
Write-up + PR	1-3 hr	same
Total	~6-12 hr	~10-20 hr

Operator time is roughly 4-6 hours of focused work spread over a day or two. Compute time is mostly unattended.

What this benchmark won't tell you

How your model performs at higher N (N=10+ on the highest-signal cells)
How your model performs on a different hardware profile (cost numbers don't normalize)
How your model performs vs cloud LLMs on these tasks (no cloud entries here yet)
How your model performs on tasks not in this suite (this is 12 task families, not the universe of agentic work)
Per-claim factual accuracy on long-form outputs (graders cover keyword presence + structure; deep fact-by-fact is hand-grading)

These are listed in KNOWN-LIMITATIONS.md and SCORECARD.md "What would change this picture." If your contribution moves the needle on any of them, that's a high-value add.

Questions / issues

Open a GitHub issue on Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests with:

Your model + quantization
Your hardware profile
The smoke-test output if it failed (logs/smoke_*/transcript.jsonl first 50 lines is usually enough to diagnose tool-call-parser issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding a new local model to the microbench

Two contribution shapes

What you need

The friendly path (4 commands once setup is done)

Step-by-step

Step 1: Pick a model and verify vLLM supports it

Step 2: Pull the weights

Step 3: Write a launch command

Step 4: Smoke test (CRITICAL)

Step 5: Full microbench chain

Step 6: Grade

Step 7: Summarize

Step 8: Cost extraction

Step 9: Write up

Step 10: Submit a PR

Common pitfalls

What if I'm not on Tower2-class hardware?

Cost / time budget cheatsheet

What this benchmark won't tell you

Questions / issues

Uh oh!

FilesExpand file tree

ADDING-A-MODEL.md

Latest commit

History

ADDING-A-MODEL.md

File metadata and controls

Adding a new local model to the microbench

Two contribution shapes

What you need

The friendly path (4 commands once setup is done)

Step-by-step

Step 1: Pick a model and verify vLLM supports it

Step 2: Pull the weights

Step 3: Write a launch command

Step 4: Smoke test (CRITICAL)

Step 5: Full microbench chain

Step 6: Grade

Step 7: Summarize

Step 8: Cost extraction

Step 9: Write up

Step 10: Submit a PR

Common pitfalls

What if I'm not on Tower2-class hardware?

Cost / time budget cheatsheet

What this benchmark won't tell you

Questions / issues