Canonical vLLM launch commands

Exact docker run commands for each model under test. Each run's logs/<run>/receipt.json records the actual container Args at run time, so this file documents the intended canonical form for new launches.

Context

Host: Tower2 (WRX90E, TR PRO 7965WX, 2× RTX PRO 6000 Blackwell, 96 GB each)
vLLM image: vllm/vllm-openai:latest (resolves to vLLM 0.19.1 as of 2026-04-26 — check image digest in each run's receipt for the exact pin)
Sandbox image: bench-sandbox:latest (built from agent-pilot/Dockerfile)
Models live at: ~/models/<org-name>-<model>/
Mount convention: -v ~/models:/models:ro
GPU power cap on Tower2: 535 W per GPU (set via sudo nvidia-smi -i 0,1 -pl 535; resets on reboot to default 600)

Conventions

Each model gets its own container; we don't share containers across models
Container names: vllm-<short-name>
Host port: 8000 for GPU1's primary model, 8001 for GPU0's primary model
--gpu-memory-utilization 0.92 is the default (vLLM allocates ~88 GB of the 96 GB)
--max-model-len 262144 matches the native context for the Qwen3.x family
--temperature 0.0 is set per-request in the harness (not at launch); receipts record it
--tensor-parallel-size 1 for now (each model on a single GPU); change if both GPUs are needed for one model

Tool-call-parser quick reference (for new models)

The single highest-leverage flag when adding a new model. Wrong --tool-call-parser produces empty tool_calls arrays silently — the run looks like it's executing but the agent never actually calls any tools.

vLLM's tool-call parsers (as of vLLM 0.19.x):

Model family	`--tool-call-parser`	`--reasoning-parser`	Notes
Qwen3.x dense (thinking)	`qwen3_xml`	`qwen3`	0.6B / 1.7B / 4B / 8B / 14B / 27B / 32B / 72B (when re-released)
Qwen3-Coder (Coder-Next, Coder-30B-A3B)	`qwen3_coder`	(omit)	Not thinking-mode; adding the reasoning parser breaks output
Qwen3.6-35B-A3B (MoE thinking)	`qwen3_xml`	`qwen3`	Same parser as 27B
Qwen2.5 / Qwen2.5-Coder	`hermes`	(omit)	Trained on Hermes-style tool calling
Llama 3.1 / 3.2 / 3.3 instruct	`llama3_json`	(omit)	The `_json` parser handles their `[<TOOLCALL>...JSON...</TOOLCALL>]` format
Llama 4 instruct	`pythonic`	(omit)	Llama 4 emits Python-call-style tool invocations
Mistral / Mixtral instruct	`mistral`	(omit)	Native Mistral tool-call format
DeepSeek-V3 / R1	`deepseek_v3`	`deepseek_r1` (R1 only)	R1 has thinking; V3 doesn't
DeepSeek-Coder-V2	`hermes`	(omit)	Hermes-trained variant
Gemma 2 / Gemma 3 instruct	`pythonic`	(omit)	Pythonic tool-call format
GLM-4 / GLM-4.5	`glm45`	`glm45` (if thinking variant)	Vendor-shipped parser
Phi-3.5 / Phi-4	`phi4_mini_json`	(omit)	JSON-style
Hermes-3 (Nous)	`hermes`	(omit)	Origin of the Hermes parser

If the model isn't listed above, check vllm/entrypoints/openai/tool_parsers/ in the vLLM source for available parsers, or the model's HuggingFace model card. Some new models ship with a custom Jinja chat template that bakes the parser format into the prompt — those usually work with hermes or pythonic as the closest match.

Smoke test before committing to a full run. tooling/scripts/smoke_test.sh runs one task at N=1 (~2-5 min) and reports PASS/FAIL. If the smoke test fails with tcs=0 consistently in transcript.jsonl, your tool-call-parser is wrong.

Per-model launch

Qwen3.6-27B AWQ (dense, thinking-mode)

docker run -d \
  --name vllm-qwen36-awq \
  --gpus '"device=1"' \
  --shm-size 8g \
  -v ~/models:/models:ro \
  -p 127.0.0.1:8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/cyankiwi-Qwen3.6-27B-AWQ-INT4 \
  --served-model-name qwen3.6-27b-awq \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

Flags rationale:

--reasoning-parser qwen3 — model emits <think>...</think> sections; parser splits them into reasoning_content so content only contains the final answer. Without this, the answer ends up nested inside content mixed with thinking.
--tool-call-parser qwen3_xml — Qwen3.6 dense models emit tool calls in XML form (<function name="..."><parameter ...>). The qwen3_xml parser extracts these into the standard OpenAI tool_calls array.

Qwen3-Coder-Next AWQ (MoE 80B/3B, agentic)

docker run -d \
  --name vllm-coder-next \
  --gpus '"device=0"' \
  --shm-size 8g \
  -v ~/models:/models:ro \
  -p 127.0.0.1:8001:8000 \
  vllm/vllm-openai:latest \
  --model /models/cyankiwi-Qwen3-Coder-Next-AWQ-4bit \
  --served-model-name qwen3-coder-next-awq \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Flags rationale:

No --reasoning-parser — Coder-Next is not a thinking-mode model. Adding --reasoning-parser qwen3 causes the entire output to be misclassified into the reasoning field, leaving content empty (we hit this exact bug in the first round of testing — see findings/2026-04-26-coder-next-investment-memo-pilot.md).
--tool-call-parser qwen3_coder — Coder-Next was trained on the Qwen3-Coder tool format (different from qwen3_xml).

Qwen3.6-35B-A3B AWQ (MoE 35B/3B, thinking-mode)

Swap candidate for GPU0 (alternates with Coder-Next). Used in the 2026-04-26 grid (1/6 across memo+board) and the 2026-04-27 PR-audit family. Disqualified from the daily-driver comparison after a floor failure at N=1 PR audit (n1_35ba3b_v1: zero artifacts written in 28 iters before model_stopped) — kept here for replay.

docker rm -f vllm-coder-next   # free GPU0 first
docker run -d \
  --name vllm-qwen36-35ba3b \
  --gpus '"device=0"' \
  --shm-size 8g \
  -v ~/models:/models:ro \
  -p 127.0.0.1:8001:8000 \
  vllm/vllm-openai:latest \
  --model /models/cyankiwi-Qwen3.6-35B-A3B-AWQ-4bit \
  --served-model-name qwen3.6-35ba3b-awq \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

Flags rationale: same pattern as the 27B since 35B-A3B is also a Qwen3.6-family thinking model.

Qwen3.6-27B-FP8 (official, 8-bit, thinking-mode)

Reference variant. 8-bit is bandwidth-heavier per token; useful for FP8-vs-AWQ comparison.

docker run -d \
  --name vllm-qwen36-fp8 \
  --gpus '"device=1"' \
  --shm-size 8g \
  -v ~/models:/models:ro \
  -p 127.0.0.1:8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/Qwen-Qwen3.6-27B-FP8 \
  --served-model-name qwen3.6-27b-fp8 \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Flags rationale: only the FP8 build ships the MTP heads (mtp.safetensors in the repo); enabling --speculative-config here gets free decode speedup. AWQ builds don't include MTP, so the flag is omitted there.

Inference-request defaults (set by harness, not at launch)

Recorded in every receipt under inference_request_defaults. Current values:

temperature: configurable per-run via --temperature (default 0.0). Receipts reflect the actual value used. 0.3-0.5 is recommended for agentic tasks (see HARNESS-CHANGELOG --temperature entry for the determinism trap that motivated making it configurable).
max_tokens per request: dynamically computed as min(180000, max_model_len - last_prompt_tokens - 14000) to leave headroom for prompt growth as conversation history accumulates. Floor is 2048.
max_model_len: 262144 (the Qwen3.x family's native context)
seed: 42 on every request (kept for replay reproducibility; harmless at non-zero temperature)
stream: false (we capture full responses; streaming would help for live progress but complicates token accounting)
tool_choice: "auto"
tools: bash, write_file, read_file, done

Sandbox image

bench-sandbox:latest rebuild:

docker build -t bench-sandbox:latest <this-tooling-dir>/

Contents (see agent-pilot/Dockerfile):

python 3.11-slim base
system tools: git, curl, jq, poppler-utils (PDF extraction), unzip, build-essential, graphviz, font packages, cairo/pango (for python-pptx + reportlab rendering)
gh CLI (from cli.github.com's apt repo) — for tasks that read GitHub PRs/issues
docker CLI (static binary from download.docker.com) — talks to the host daemon when --docker-socket is passed
python libs: requests, beautifulsoup4, lxml, pandas, numpy, openpyxl, yfinance, sec-edgar-downloader, markdown, reportlab, python-pptx, matplotlib, plotly, kaleido, Pillow, graphviz, pytest, pytest-cov, ruff

The image digest at run time is in each receipt under sandbox.image_id. Per-run sandbox config (whether --docker-socket, --gpus, --gh-token, --input-mount were passed) lands in sandbox.runtime.

Auditing a past run

cat agent-pilot/logs/<run>/receipt.json | jq '.vllm.containers[].args'      # see exact launch flags
cat agent-pilot/logs/<run>/receipt.json | jq '.harness'                      # harness git SHA + dirty flag + sha256
cat agent-pilot/logs/<run>/receipt.json | jq '.task'                         # task file SHA
cat agent-pilot/logs/<run>/receipt.json | jq '.hardware.nvidia_smi'          # GPU state at start

If any of those changed and you want to re-run identically: git checkout <harness_git_sha>, rebuild the sandbox image (use the Dockerfile from that SHA), launch vLLM with the captured args, then run the harness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Canonical vLLM launch commands

Context

Conventions

Tool-call-parser quick reference (for new models)

Per-model launch

Qwen3.6-27B AWQ (dense, thinking-mode)

Qwen3-Coder-Next AWQ (MoE 80B/3B, agentic)

Qwen3.6-35B-A3B AWQ (MoE 35B/3B, thinking-mode)

Qwen3.6-27B-FP8 (official, 8-bit, thinking-mode)

Inference-request defaults (set by harness, not at launch)

Sandbox image

Auditing a past run

Uh oh!

FilesExpand file tree

launch-commands.md

Latest commit

History

launch-commands.md

File metadata and controls

Canonical vLLM launch commands

Context

Conventions

Tool-call-parser quick reference (for new models)

Per-model launch

Qwen3.6-27B AWQ (dense, thinking-mode)

Qwen3-Coder-Next AWQ (MoE 80B/3B, agentic)

Qwen3.6-35B-A3B AWQ (MoE 35B/3B, thinking-mode)

Qwen3.6-27B-FP8 (official, 8-bit, thinking-mode)

Inference-request defaults (set by harness, not at launch)

Sandbox image

Auditing a past run