This guide connects OpenCode directly to candle-vllm through the built-in OpenAI-compatible /v1/chat/completions endpoint.
OpenCode -> Candle-vLLM (OpenAI-compatible)
cargo run --release --features cuda,nccl,graph,flashinfer,cutlass -- \
--m Qwen/Qwen3.5-27B-FP8 \
--d 0 \
--prefix-cache \
--p 8000 \
--gpu-memory-fraction 0.7 \
--enforce-parser qwen_coderIf you prefer FlashAttention, replace flashinfer with flashattn.
curl http://localhost:8000/v1/modelsUse the returned id in the OpenCode config.
Install OpenCode:
curl -fsSL https://opencode.ai/install | bashOr:
npm i -g opencode-aiCreate ~/.config/opencode/config.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"local-candle-vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Candle-vLLM Local",
"options": {
"baseURL": "http://localhost:8000/v1"
},
"models": {
"qwen3-coder": {
"name": "Qwen/Qwen3.5-27B-FP8"
}
}
}
},
"model": "local-candle-vllm/qwen3-coder"
}opencode- Tool calls follow the normal OpenAI request/response loop.
- For Qwen coder models,
--enforce-parser qwen_coderis usually the most reliable parser setting. - If OpenCode reports a model mismatch, compare your configured model against
GET /v1/models.
Chat logger:
export CANDLE_VLLM_CHAT_LOGGER=1Reasoning routing for tool-enabled requests:
export CANDLE_VLLM_STREAM_AS_REASONING_CONTENT=1Set CANDLE_VLLM_STREAM_AS_REASONING_CONTENT=0 if the client expects reasoning
to stay inside content instead of separate reasoning_content fields.