llama-qwen

Qwen3.6-27B (Q8_0 GGUF) served over OpenAI-compatible HTTP on AMD Strix Halo.

What this is

A thin Docker Compose wrapper that runs unsloth/Qwen3.6-27B-GGUF (Q8_0) behind an OpenAI-compatible HTTP API on an AMD Ryzen AI Max+ 395 "Strix Halo" (gfx1151, RDNA 3.5, 128 GB UMA). Serves /v1/chat/completions and /v1/completions through the same endpoint. Native 256K context with quantized KV cache (q8_0) to keep the cache small enough that the full context still fits.

llama.cpp is built from source against a TheRock nightly ROCm SDK for gfx1151, with rocWMMA flash-attention kernels enabled. No patches, no forks, upstream llama.cpp works as-is for this architecture, which is a key difference from the vLLM path.

Want the official BF16 weights, vision input, or separated reasoning on /v1/responses? See the sibling repo vllm-qwen: same hardware, Qwen 3.6-27B BF16 via vLLM. Official Qwen safetensors, image understanding, Responses API with reasoning items. Trade-off: decode is ~43% the speed of this repo (4.3 vs 7.5 t/s) and tool calling currently has three open upstream parser bugs. Pick that one for vision / structured reasoning, this one for raw chat + agentic speed.

See the full side-by-side comparison below.

Stack

Layer	Version
Host OS	Ubuntu 26.04 (container base)
ROCm	TheRock `7.13.0a20260424` (S3 nightly; resolves to latest at build time)
llama.cpp	upstream HEAD (build pinned to `15fa3c493`, 3 commits behind at bench time, all unrelated to HIP/gfx1151)
Model	`unsloth/Qwen3.6-27B-GGUF`, Q8_0 quant
Build flags	`GGML_HIP=ON` + `GGML_HIP_ROCWMMA_FATTN=ON` + `GPU_TARGETS=gfx1151`

Hardware

Tested on: Ryzen AI Max+ 395 / 128 GB UMA (Radeon 8060S iGPU, gfx1151). Kernel ≥ 6.18. Docker with /dev/kfd + /dev/dri access.

This is the only supported configuration today. Other Strix Halo variants (8050S / 8040S / lower RAM) will likely work but haven't been tested.

Host memory setup (required, one-time)

Strix Halo is UMA: system RAM and GPU VRAM share the same physical pool. Out of the box the BIOS reserves a fixed chunk as "dedicated VRAM" and the Linux TTM subsystem caps how much of the rest the GPU driver may map as GTT. Both defaults are wrong for this workload: the 27 GiB Q8_0 model plus a 256K KV cache won't fit unless you widen them.

1. BIOS / UEFI: set the dedicated GPU VRAM carve-out to its minimum (2 GB / 2048 MB). You want the GPU to take memory from the shared pool on demand via GTT, not from a fixed-size pre-allocation. Menu name varies by vendor; look for UMA Frame Buffer Size, UMA Buffer Size, iGPU Memory, or GPU Shared Memory.

2. Ubuntu GRUB: raise the TTM page limit so the kernel will actually let the GPU driver map ~116 GiB of GTT. Edit /etc/default/grub, set:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ttm.pages_limit=30408704 amdgpu.noretry=0 amdgpu.gpu_recovery=1"

Then:

sudo update-grub
sudo reboot

Verify after reboot:

cat /sys/class/drm/card1/device/mem_info_gtt_total
# expect ~124554670080  (≈ 116 GiB)

ttm.pages_limit=30408704 is 30,408,704 × 4 KiB pages = 116 GiB. Leave 12 GiB for the OS and desktop. amdgpu.noretry=0 + amdgpu.gpu_recovery=1 are stability flags; keep them on for long-running inference.

Quick start

cp .env.template .env
# edit .env:
#   - LLAMA_HOST_MODELS_DIR: your HF cache directory
#   - HF_TOKEN: optional (the Unsloth GGUF repo is public, not gated)

# Fetch the Q8_0 GGUF into your cache directory
hf download unsloth/Qwen3.6-27B-GGUF \
  --include "*Q8_0*" \
  --cache-dir "$LLAMA_HOST_MODELS_DIR/hub"

# Confirm the snapshot path `hf download` created, then update
# LLAMA_MODEL_FILE in .env to point at the actual file. The sha in the
# template default is pinned to the version used at bench time and
# won't match your snapshot dir.

docker compose up -d --build

# First build: ~10 min (ROCm tarball + llama.cpp HIP source build, -j 4)
# First start: <15s (no JIT, llama.cpp compiles all kernels ahead of time)
# Subsequent starts: <10s

# Verify
curl -s http://127.0.0.1:8080/v1/models | python3 -m json.tool

First boot

Unlike the vLLM build, llama.cpp has no Triton JIT phase. Every HIP kernel is compiled once at Docker-build time (the ~10 min CMake step), then the runtime just mmaps the .so. Cold-start timing:

Phase	Time	Visible activity
Container init	~2s	dynamic linker
Weight load	~7s	28 GiB Q8_0 streaming into UMA (no mmap → full prefetch)
KV alloc + warm	~3s	allocate q8_0 cache for 256K slots
Server bind	~1s	HTTP server listens on `:8080`
Total	~13s

No "silent window" to misread as stuck. llama-server logs its load progress continuously. You can issue the first request the moment /health returns 200.

API

Endpoint	Purpose
`POST /v1/chat/completions`	OpenAI chat with `messages`. Supports streaming (`stream: true`).
`POST /v1/completions`	OpenAI text completion, raw `prompt` string.
`GET /v1/models`	Lists the loaded GGUF.
`GET /health`	Liveness probe.
`GET /props`	llama.cpp-specific: loaded model metadata, context size, chat template.

The model's native thinking mode (<think>...</think>) is on by default. Add /no_think anywhere in the user message to suppress it for a single turn (Qwen convention), the benchmark's needle tests use this so the answer lands in visible output instead of inside <think>.

No /v1/responses and no vision input. llama.cpp doesn't implement the Responses API, and the Unsloth GGUF is text-only (the vision tower isn't included in the GGUF conversion). For either, use vllm-qwen.

Benchmark

Hardware: Ryzen AI Max+ 395, 128 GB UMA, Ubuntu 26.04, TheRock 7.13 nightly. Model: unsloth/Qwen3.6-27B-GGUF Q8_0, 256K context with q8_0 KV. Warmup once, then numbers averaged over 3 iterations. Temperature 0. Server-reported timings.prompt_per_second and timings.predicted_per_second are the canonical throughput fields.

Shape	prompt_tok	completion_tok	wall (s)	prefill t/s	decode t/s
Short decode ("Hi.")	12	202	27.1	42.4	7.55
Medium needle (~8K prompt)	8,039	198	66.9	200.0	7.43

Decode is rock-steady at 7.4-7.6 t/s across configs. That's the real Q8_0 ceiling for a 27B model on gfx1151, bound by weight-streaming bandwidth through the UMA, not by compute.

Decode throughput across tuning attempts

I swept a handful of obvious knobs. None of them moved the needle: decode is pure memory bandwidth on this GPU:

Config	Short decode (t/s)	Medium prefill (t/s)
baseline (defaults)	7.53	196.91
`-t 16 -tb 32` (physical cores + full SMT)	7.56	199.97
`-b 4096 -ub 1024` (bigger prefill batches)	7.48	188.83
`GGML_HIP_UMA=1` (unified memory mode)	7.55	199.17

Bigger prefill batches actually hurt prefill (197 → 188 t/s), the kernel already saturates at its default sizes, and the extra synchronization cost dominates.

Decode vs. vLLM BF16 on the same hardware

Runtime	Precision	Weights on disk	Decode t/s
vLLM	BF16	51.2 GiB	4.30
llama.cpp	Q8_0	~27 GiB	7.50

Q8 is ~60% the weight bandwidth of BF16, so decode is ~75% faster in practice. That is the only material reason to prefer this path: quality-wise the Q8 output is essentially indistinguishable from BF16 for everyday use.

Functional verification

Spot-checked the endpoints on the current build to confirm behavior:

Endpoint / feature	Result
`/v1/chat/completions` with thinking ON (300-tok cap)	decode 7.5 t/s, prefill 53 t/s on 23-tok prompt
`/v1/completions` ("The capital of Argentina is")	decode 7.9 t/s, returns `" Buenos Aires, which is…"`
`/v1/chat/completions` single tool call (`get_weather`, Tokyo)	`finish_reason=tool_calls`, clean `{"city":"Tokyo"}`, empty content
`/v1/chat/completions` parallel tool calls (Tokyo + Rosario)	2 structured calls, zero content leakage
`/v1/chat/completions` tool call with optional arg	clean `{"city":"Rosario, Argentina","unit":"celsius"}`
`/v1/chat/completions` with `image_url`	HTTP 500: `image input is not supported - hint: if this is unexpected, you may need to provide the mmproj`, confirms the GGUF has no vision tower

Tool-call parsing on this path is not affected by the vLLM reasoning/tool-parser bugs (vllm#40783, #40785, #40787), llama.cpp has its own independent extractor.

Reproduce

python3 test/bench.py
# writes test/bench_results.json with full per-run detail

test/bench.py warms up once, then runs the three prompt shapes and records server-reported timings.* counters. No external deps beyond Python 3.

Which should I run: `llama-qwen` or `vllm-qwen`?

	`llama-qwen` (this repo)	`vllm-qwen`
Weights format	Q8_0 GGUF (Unsloth re-quant)	BF16 safetensors (official)
Decode speed	~7.5 t/s	4.3 t/s
Prefill speed (8K)	200 t/s	~38 t/s
Vision input	✘ (GGUF has no `mmproj`)	✓
`/v1/responses` + separated reasoning	✘ (not implemented in llama.cpp)	✓
Tool calling (OpenAI format)	✓ (via `--jinja`, verified clean)	⚠ broken on current vLLM commit (see vllm#40783 / #40785 / #40787)
Context	256K	256K
Memory footprint	~35 GiB total	~105 GiB total
Boot time cold	~13s	~4 min
Build from source needs patches	no	yes (12 local patches)
Official weights	✘ (Unsloth re-quant)	✓ (Qwen BF16)

Rule of thumb: if you need raw speed, agentic loops, or a fast desktop sidekick → llama-qwen. If you need vision, reasoning separated for structured pipelines, or weights directly from the Qwen team → vllm-qwen.

Known limitations

Target	Status	Root cause
Vision / `image_url`	✘ unsupported	Unsloth's GGUF conversion doesn't include the Qwen 3.6 vision tower. Fixable upstream in the GGUF export path, not here.
`/v1/responses` with separated reasoning	✘ unsupported	llama.cpp doesn't implement the OpenAI Responses API.
FP8 KV cache	n/a	RDNA 3.5 has no hardware FP8 path, Q8_0 is already the correct quant for this GPU.

Repo layout

.
├── Dockerfile              multi-stage: Ubuntu + TheRock + llama.cpp HIP source build
├── docker-compose.yml      one service, one model, host-mounted cache
├── .env.template           the one config file you need to edit
├── scripts/
│   └── install_rocm_sdk.sh TheRock S3 nightly tarball → /opt/rocm
└── test/
    └── bench.py            reproducible 3-shape benchmark harness

License

The Unlicense. Public domain dedication: use, modify, distribute, sell, fork, do whatever you want with this code. No attribution required, no warranty given.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
scripts		scripts
test		test
.env.template		.env.template
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-qwen

What this is

Stack

Hardware

Host memory setup (required, one-time)

Quick start

First boot

API

Benchmark

Decode throughput across tuning attempts

Decode vs. vLLM BF16 on the same hardware

Functional verification

Reproduce

Which should I run: `llama-qwen` or `vllm-qwen`?

Known limitations

Repo layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-qwen

What this is

Stack

Hardware

Host memory setup (required, one-time)

Quick start

First boot

API

Benchmark

Decode throughput across tuning attempts

Decode vs. vLLM BF16 on the same hardware

Functional verification

Reproduce

Which should I run: llama-qwen or vllm-qwen?

Known limitations

Repo layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Which should I run: `llama-qwen` or `vllm-qwen`?

Packages