Exact docker run commands for each model under test. Each run's logs/<run>/receipt.json records the actual container Args at run time, so this file documents the intended canonical form for new launches.
- Host: Tower2 (WRX90E, TR PRO 7965WX, 2× RTX PRO 6000 Blackwell, 96 GB each)
- vLLM image:
vllm/vllm-openai:latest(resolves to vLLM 0.19.1 as of 2026-04-26 — check image digest in each run's receipt for the exact pin) - Sandbox image:
bench-sandbox:latest(built fromagent-pilot/Dockerfile) - Models live at:
~/models/<org-name>-<model>/ - Mount convention:
-v ~/models:/models:ro - GPU power cap on Tower2: 535 W per GPU (set via
sudo nvidia-smi -i 0,1 -pl 535; resets on reboot to default 600)
- Each model gets its own container; we don't share containers across models
- Container names:
vllm-<short-name> - Host port: 8000 for GPU1's primary model, 8001 for GPU0's primary model
--gpu-memory-utilization 0.92is the default (vLLM allocates ~88 GB of the 96 GB)--max-model-len 262144matches the native context for the Qwen3.x family--temperature 0.0is set per-request in the harness (not at launch); receipts record it--tensor-parallel-size 1for now (each model on a single GPU); change if both GPUs are needed for one model
The single highest-leverage flag when adding a new model. Wrong --tool-call-parser produces empty tool_calls arrays silently — the run looks like it's executing but the agent never actually calls any tools.
vLLM's tool-call parsers (as of vLLM 0.19.x):
| Model family | --tool-call-parser |
--reasoning-parser |
Notes |
|---|---|---|---|
| Qwen3.x dense (thinking) | qwen3_xml |
qwen3 |
0.6B / 1.7B / 4B / 8B / 14B / 27B / 32B / 72B (when re-released) |
| Qwen3-Coder (Coder-Next, Coder-30B-A3B) | qwen3_coder |
(omit) | Not thinking-mode; adding the reasoning parser breaks output |
| Qwen3.6-35B-A3B (MoE thinking) | qwen3_xml |
qwen3 |
Same parser as 27B |
| Qwen2.5 / Qwen2.5-Coder | hermes |
(omit) | Trained on Hermes-style tool calling |
| Llama 3.1 / 3.2 / 3.3 instruct | llama3_json |
(omit) | The _json parser handles their [<TOOLCALL>...JSON...</TOOLCALL>] format |
| Llama 4 instruct | pythonic |
(omit) | Llama 4 emits Python-call-style tool invocations |
| Mistral / Mixtral instruct | mistral |
(omit) | Native Mistral tool-call format |
| DeepSeek-V3 / R1 | deepseek_v3 |
deepseek_r1 (R1 only) |
R1 has thinking; V3 doesn't |
| DeepSeek-Coder-V2 | hermes |
(omit) | Hermes-trained variant |
| Gemma 2 / Gemma 3 instruct | pythonic |
(omit) | Pythonic tool-call format |
| GLM-4 / GLM-4.5 | glm45 |
glm45 (if thinking variant) |
Vendor-shipped parser |
| Phi-3.5 / Phi-4 | phi4_mini_json |
(omit) | JSON-style |
| Hermes-3 (Nous) | hermes |
(omit) | Origin of the Hermes parser |
If the model isn't listed above, check vllm/entrypoints/openai/tool_parsers/ in the vLLM source for available parsers, or the model's HuggingFace model card. Some new models ship with a custom Jinja chat template that bakes the parser format into the prompt — those usually work with hermes or pythonic as the closest match.
Smoke test before committing to a full run. tooling/scripts/smoke_test.sh runs one task at N=1 (~2-5 min) and reports PASS/FAIL. If the smoke test fails with tcs=0 consistently in transcript.jsonl, your tool-call-parser is wrong.
docker run -d \
--name vllm-qwen36-awq \
--gpus '"device=1"' \
--shm-size 8g \
-v ~/models:/models:ro \
-p 127.0.0.1:8000:8000 \
vllm/vllm-openai:latest \
--model /models/cyankiwi-Qwen3.6-27B-AWQ-INT4 \
--served-model-name qwen3.6-27b-awq \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xmlFlags rationale:
--reasoning-parser qwen3— model emits<think>...</think>sections; parser splits them intoreasoning_contentsocontentonly contains the final answer. Without this, the answer ends up nested insidecontentmixed with thinking.--tool-call-parser qwen3_xml— Qwen3.6 dense models emit tool calls in XML form (<function name="..."><parameter ...>). Theqwen3_xmlparser extracts these into the standard OpenAItool_callsarray.
docker run -d \
--name vllm-coder-next \
--gpus '"device=0"' \
--shm-size 8g \
-v ~/models:/models:ro \
-p 127.0.0.1:8001:8000 \
vllm/vllm-openai:latest \
--model /models/cyankiwi-Qwen3-Coder-Next-AWQ-4bit \
--served-model-name qwen3-coder-next-awq \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.92 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderFlags rationale:
- No
--reasoning-parser— Coder-Next is not a thinking-mode model. Adding--reasoning-parser qwen3causes the entire output to be misclassified into thereasoningfield, leavingcontentempty (we hit this exact bug in the first round of testing — seefindings/2026-04-26-coder-next-investment-memo-pilot.md). --tool-call-parser qwen3_coder— Coder-Next was trained on the Qwen3-Coder tool format (different fromqwen3_xml).
Swap candidate for GPU0 (alternates with Coder-Next). Used in the 2026-04-26 grid (1/6 across memo+board) and the 2026-04-27 PR-audit family. Disqualified from the daily-driver comparison after a floor failure at N=1 PR audit (
n1_35ba3b_v1: zero artifacts written in 28 iters before model_stopped) — kept here for replay.
docker rm -f vllm-coder-next # free GPU0 first
docker run -d \
--name vllm-qwen36-35ba3b \
--gpus '"device=0"' \
--shm-size 8g \
-v ~/models:/models:ro \
-p 127.0.0.1:8001:8000 \
vllm/vllm-openai:latest \
--model /models/cyankiwi-Qwen3.6-35B-A3B-AWQ-4bit \
--served-model-name qwen3.6-35ba3b-awq \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xmlFlags rationale: same pattern as the 27B since 35B-A3B is also a Qwen3.6-family thinking model.
Reference variant. 8-bit is bandwidth-heavier per token; useful for FP8-vs-AWQ comparison.
docker run -d \
--name vllm-qwen36-fp8 \
--gpus '"device=1"' \
--shm-size 8g \
-v ~/models:/models:ro \
-p 127.0.0.1:8000:8000 \
vllm/vllm-openai:latest \
--model /models/Qwen-Qwen3.6-27B-FP8 \
--served-model-name qwen3.6-27b-fp8 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'Flags rationale: only the FP8 build ships the MTP heads (mtp.safetensors in the repo); enabling --speculative-config here gets free decode speedup. AWQ builds don't include MTP, so the flag is omitted there.
Recorded in every receipt under inference_request_defaults. Current values:
temperature: configurable per-run via--temperature(default 0.0). Receipts reflect the actual value used. 0.3-0.5 is recommended for agentic tasks (see HARNESS-CHANGELOG--temperatureentry for the determinism trap that motivated making it configurable).max_tokensper request: dynamically computed asmin(180000, max_model_len - last_prompt_tokens - 14000)to leave headroom for prompt growth as conversation history accumulates. Floor is 2048.max_model_len: 262144 (the Qwen3.x family's native context)seed: 42 on every request (kept for replay reproducibility; harmless at non-zero temperature)stream: false (we capture full responses; streaming would help for live progress but complicates token accounting)tool_choice: "auto"tools:bash,write_file,read_file,done
bench-sandbox:latest rebuild:
docker build -t bench-sandbox:latest <this-tooling-dir>/Contents (see agent-pilot/Dockerfile):
- python 3.11-slim base
- system tools: git, curl, jq, poppler-utils (PDF extraction), unzip, build-essential, graphviz, font packages, cairo/pango (for python-pptx + reportlab rendering)
ghCLI (from cli.github.com's apt repo) — for tasks that read GitHub PRs/issuesdockerCLI (static binary from download.docker.com) — talks to the host daemon when--docker-socketis passed- python libs: requests, beautifulsoup4, lxml, pandas, numpy, openpyxl, yfinance, sec-edgar-downloader, markdown, reportlab, python-pptx, matplotlib, plotly, kaleido, Pillow, graphviz, pytest, pytest-cov, ruff
The image digest at run time is in each receipt under sandbox.image_id. Per-run sandbox config (whether --docker-socket, --gpus, --gh-token, --input-mount were passed) lands in sandbox.runtime.
cat agent-pilot/logs/<run>/receipt.json | jq '.vllm.containers[].args' # see exact launch flags
cat agent-pilot/logs/<run>/receipt.json | jq '.harness' # harness git SHA + dirty flag + sha256
cat agent-pilot/logs/<run>/receipt.json | jq '.task' # task file SHA
cat agent-pilot/logs/<run>/receipt.json | jq '.hardware.nvidia_smi' # GPU state at startIf any of those changed and you want to re-run identically: git checkout <harness_git_sha>, rebuild the sandbox image (use the Dockerfile from that SHA), launch vLLM with the captured args, then run the harness.