vLLM Deployment Guide

Run Wazuh OpenClaw Autopilot with self-hosted open-source models via vLLM. Zero cloud API costs, full data sovereignty.

Quick Start

1. Start vLLM with Tool Calling

Qwen3 32B (recommended — best open-source tool calling):

vllm serve Qwen/Qwen3-32B \
  --served-model-name qwen3-32b \
  --api-key "your-secure-key" \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95

Llama 3.3 70B (strong general reasoning):

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --served-model-name llama3.3-70b \
  --api-key "your-secure-key" \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95

2. Install the Config

cp openclaw/openclaw-vllm.json ~/.openclaw/openclaw.json

Edit ~/.openclaw/openclaw.json:

Update models.providers.vllm.baseUrl if vLLM is on a different host/port
Set VLLM_API_KEY to match --api-key from step 1
Adjust agents.defaults.model.primary to match the model you're running

3. Refresh the Model Catalog (MANDATORY)

openclaw models scan

This step is critical. OpenClaw's openai-completions provider only sends tool definitions to models that have toolUse: true in its capability catalog. Custom providers (like vLLM) are not in the catalog by default. Running openclaw models scan probes your vLLM endpoint and populates the metadata. Without this step, ALL agents will silently fail — they output tool calls as plain text with stopReason: "stop" instead of actually invoking tools like web_fetch.

4. Verify

# Check vLLM is serving
curl http://localhost:8000/v1/models -H "Authorization: Bearer your-secure-key"

# Start OpenClaw
openclaw start
openclaw status --all

Recommended Models

Model	Params	VRAM (FP16)	Tool Parser	Best For
Qwen3 32B	32B	~64 GB	`hermes`	Primary agent model — excellent tool calling
Llama 3.3 70B	70B	~140 GB	`llama3_json`	Complex investigation/reasoning tasks
DeepSeek-R1 70B	70B	~140 GB	`deepseek_v32`	Chain-of-thought reasoning
Mistral Large	123B	~246 GB	`mistral`	Multi-language environments
Qwen3 8B	8B	~14 GB	`hermes`	Budget/testing — single consumer GPU

Any model on HuggingFace that vLLM supports can be used. The table above lists models tested with Wazuh Autopilot's multi-step tool calling pipeline.

Model Selection Tips

For Wazuh Autopilot: Qwen3 32B is the sweet spot — reliable tool calling (hermes format), strong reasoning, fits on ~64 GB VRAM.
For investigation-heavy workloads: Llama 3.3 70B provides deeper analysis but needs ~140 GB VRAM (multi-GPU).
For reasoning models (DeepSeek-R1): Add --reasoning-parser flag and set "reasoning": true in the config.
Avoid 7B models for production — they struggle with multi-step tool calling chains required by the investigation and response planner agents.

Critical vLLM Flags

Flag	Required	Purpose
`--enable-auto-tool-choice`	Yes	Enables tool/function calling support
`--tool-call-parser <parser>`	Yes	Maps model's tool output format to OpenAI API
`--served-model-name <name>`	Recommended	Must match the model `"id"` in openclaw config
`--tensor-parallel-size N`	For multi-GPU	Split model across N GPUs
`--max-model-len <tokens>`	Recommended	Match to `"contextWindow"` in config
`--gpu-memory-utilization 0.95`	Recommended	Use most of available VRAM
`--reasoning-parser <parser>`	For reasoning models	Extracts chain-of-thought from output
`--trust-remote-code`	Some models	Required by certain HuggingFace models with custom code

Tool Call Parsers by Model Family

hermes        → Qwen3, Qwen2.5, NousResearch/Hermes
llama3_json   → Llama 3.1, Llama 3.2, Llama 3.3
mistral       → Mistral, Mixtral
deepseek_v32  → DeepSeek-V3, DeepSeek-R1
minimax_m2    → MiniMax-M2.1
internlm      → InternLM 2.5+
jamba          → AI21 Jamba

Hardware Requirements

vLLM supports both NVIDIA (CUDA) and AMD (ROCm) GPUs. The key constraint is total VRAM — pick a model that fits your available GPU memory:

Model Size	VRAM Needed (FP16)	Example Hardware
7-8B	~14 GB	Single consumer GPU (24 GB)
14B	~28 GB	Single data center GPU (40 GB)
32B	~64 GB	Single high-end GPU (80 GB)
70B	~140 GB	2 GPUs with `--tensor-parallel-size 2`
130B+	~192 GB+	Multi-GPU or high-VRAM accelerators

Quantized models (AWQ, GPTQ) halve VRAM requirements at the cost of some tool-calling accuracy. Use --quantization awq to load quantized variants.

Platform-Specific Install

NVIDIA (CUDA):

pip install vllm

AMD (ROCm):

pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700

Free GPU Cloud Credits

Several cloud providers offer free credits for GPU instances suitable for running vLLM. Check your preferred provider's developer program for available offers.

Docker Deployment

Single GPU

docker run -d \
  --gpus '"device=0"' \
  --name vllm-autopilot \
  -p 8000:8000 \
  -v /models:/models \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-32B \
  --served-model-name qwen3-32b \
  --api-key "${VLLM_API_KEY}" \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95

Multi-GPU (70B models)

docker run -d \
  --gpus all \
  --name vllm-autopilot \
  -p 8000:8000 \
  --shm-size 16g \
  -v /models:/models \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --served-model-name llama3.3-70b \
  --api-key "${VLLM_API_KEY}" \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95

AMD ROCm: Use vllm/vllm-openai-rocm:latest image with --device=/dev/kfd --device=/dev/dri --group-add video instead of --gpus. Same vLLM flags apply.

Air-Gapped Deployment

For environments without internet access, pre-download models and run offline.

Pre-download Models (on a machine with internet)

pip install huggingface-hub
huggingface-cli download Qwen/Qwen3-32B --local-dir /models/qwen3-32b

Run Offline

export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

vllm serve /models/qwen3-32b \
  --served-model-name qwen3-32b \
  --api-key "${VLLM_API_KEY}" \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95

Combine with openclaw-vllm.json (which has no cloud dependencies) for a fully air-gapped setup. See AIR_GAPPED_DEPLOYMENT.md for the full air-gapped guide.

Production Recommendations

systemd Service

[Unit]
Description=vLLM Inference Server for Wazuh Autopilot
After=network.target

[Service]
Type=simple
User=vllm
Environment=VLLM_API_KEY=your-secure-key
Environment=HF_HOME=/opt/models
ExecStart=/opt/vllm/bin/vllm serve Qwen/Qwen3-32B \
  --served-model-name qwen3-32b \
  --api-key ${VLLM_API_KEY} \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Performance Tuning

--gpu-memory-utilization 0.95: Use 95% of VRAM. Increase to 0.99 if not sharing the GPU.
--max-model-len: Set to actual context you need. Lower = more concurrent requests.
--dtype auto: vLLM auto-selects BF16/FP16. Use --dtype float16 for older GPUs without BF16.
--quantization awq: Load AWQ-quantized models to halve VRAM. Some tool-calling accuracy loss.
--enforce-eager: Disables CUDA graph compilation. Slower steady-state but faster startup.

Health Checks

# vLLM health
curl -f http://localhost:8000/health

# Model loaded and responding
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer ${VLLM_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-32b","messages":[{"role":"user","content":"ping"}],"max_tokens":5}'

Troubleshooting

"Tool web_fetch not found" or agents output tool calls as text

vLLM was started without --enable-auto-tool-choice --tool-call-parser <parser>. Restart vLLM with both flags.

404 "Model not found"

The model "id" in openclaw-vllm.json doesn't match --served-model-name. They must be identical.

Out of memory (OOM)

Reduce --max-model-len (e.g., 32768 instead of 131072)
Reduce --gpu-memory-utilization (e.g., 0.90)
Use a quantized model variant (AWQ/GPTQ)
Add more GPUs with --tensor-parallel-size

Slow inference / timeouts

Increase timeoutSeconds in config (default 900, try 1200 for 70B+ models)
Reduce maxConcurrent to 1 if running a single vLLM instance
Monitor GPU utilization: nvidia-smi -l 1 or rocm-smi

"reasoning" errors

If you see model does not support thinking, set "reasoning": false in the model config. Only models started with --reasoning-parser support reasoning mode.

OpenClaw says model has no tool support / agents output tool calls as text

This is the most common vLLM deployment issue. OpenClaw's openai-completions provider only sends tool definitions if the model has toolUse: true in its internal capability catalog. Custom providers like vLLM are not in the catalog by default.

Fix: Run openclaw models scan to probe your vLLM endpoint and populate the catalog. This must be done after every config change that adds or modifies models. If the gateway is already running, restart it after the scan.

Running Multiple Models

You can run multiple vLLM instances on different ports or GPUs:

# GPU 0: Qwen3 32B for most agents
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-32B \
  --served-model-name qwen3-32b --port 8000 ...

# GPU 1-2: Llama 3.3 70B for investigation agent
CUDA_VISIBLE_DEVICES=1,2 vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --served-model-name llama3.3-70b --port 8001 --tensor-parallel-size 2 ...

Then configure two providers in your config:

"models": {
  "providers": {
    "vllm-small": {
      "baseUrl": "http://127.0.0.1:8000/v1",
      "api": "openai-completions",
      "apiKey": "${VLLM_API_KEY}",
      "models": [{ "id": "qwen3-32b", ... }]
    },
    "vllm-large": {
      "baseUrl": "http://127.0.0.1:8001/v1",
      "api": "openai-completions",
      "apiKey": "${VLLM_API_KEY}",
      "models": [{ "id": "llama3.3-70b", ... }]
    }
  }
}

Then assign per-agent: "model": { "primary": "vllm-large/llama3.3-70b" } for the investigation agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Deployment Guide

Quick Start

1. Start vLLM with Tool Calling

2. Install the Config

3. Refresh the Model Catalog (MANDATORY)

4. Verify

Recommended Models

Model Selection Tips

Critical vLLM Flags

Tool Call Parsers by Model Family

Hardware Requirements

Platform-Specific Install

Free GPU Cloud Credits

Docker Deployment

Single GPU

Multi-GPU (70B models)

Air-Gapped Deployment

Pre-download Models (on a machine with internet)

Run Offline

Production Recommendations

systemd Service

Performance Tuning

Health Checks

Troubleshooting

"Tool web_fetch not found" or agents output tool calls as text

404 "Model not found"

Out of memory (OOM)

Slow inference / timeouts

"reasoning" errors

OpenClaw says model has no tool support / agents output tool calls as text

Running Multiple Models

FilesExpand file tree

VLLM_DEPLOYMENT.md

Latest commit

History

VLLM_DEPLOYMENT.md

File metadata and controls

vLLM Deployment Guide

Quick Start

1. Start vLLM with Tool Calling

2. Install the Config

3. Refresh the Model Catalog (MANDATORY)

4. Verify

Recommended Models

Model Selection Tips

Critical vLLM Flags

Tool Call Parsers by Model Family

Hardware Requirements

Platform-Specific Install

Free GPU Cloud Credits

Docker Deployment

Single GPU

Multi-GPU (70B models)

Air-Gapped Deployment

Pre-download Models (on a machine with internet)

Run Offline

Production Recommendations

systemd Service

Performance Tuning

Health Checks

Troubleshooting

"Tool web_fetch not found" or agents output tool calls as text

404 "Model not found"

Out of memory (OOM)

Slow inference / timeouts

"reasoning" errors

OpenClaw says model has no tool support / agents output tool calls as text

Running Multiple Models