11# AI Stack Guide
22
3- Local AI infrastructure running on dual RTX 3090s (48GB VRAM) + 400GB system RAM.
3+ Local AI infrastructure running on dual RTX 3090s (24GB each) + 400GB system RAM.
4+ Each GPU runs one workload — llama-server on GPU 0, ComfyUI on GPU 1.
45
56## Architecture
67
@@ -12,39 +13,37 @@ Local AI infrastructure running on dual RTX 3090s (48GB VRAM) + 400GB system RAM
1213 (LLM inference) (image+video) (web search)
1314 port 8080 port 8188
1415 ┌──────────┐ ┌──────────────────────┐
15- │ 4 models │ │ Z-Image-Turbo (t2i) │
16- │ via │ │ Qwen-Image-Edit (i2i)│
17- │ presets │ │ Wan 2.2 T2V (video) │
18- └──────────┘ │ Wan 2.2 I2V (video) │
19- 2x RTX 3090 │ Florence-2 (caption) │
20- (48GB VRAM) │ WD14 Tagger (tags) │
16+ │ Qwen3.5 │ │ Z-Image-Turbo (t2i) │
17+ │ 35B-A3B │ │ Qwen-Image-Edit (i2i)│
18+ │ Q4_K_XL │ │ Wan 2.2 T2V (video) │
19+ │ +mmproj │ │ Wan 2.2 I2V (video) │
20+ └──────────┘ │ Florence-2 (caption) │
21+ GPU 0 (24GB) │ WD14 Tagger (tags) │
2122 └──────────────────────┘
22- 1x RTX 3090 (24GB VRAM )
23+ GPU 1 (24GB)
2324```
2425
25- ## LLM Models (llama-cpp)
26+ ## LLM Model (llama-cpp)
2627
27- All models served via a single ` llama-server ` with multi-model routing ( ` --models-max 8 ` ).
28- Models load on-demand and swap in/out of VRAM .
28+ Single model served via ` llama-server ` on GPU 0. Qwen3.5-35B-A3B is a MoE model (3B active params)
29+ with native multimodal support (vision + language). Q4_K_XL (20.6GB) + mmproj (858MB) fits in 1x 24GB RTX 3090 .
2930
3031| Preset | Model | Active Params | VRAM | Context | Use Case |
3132| --------| -------| --------------| ------| ---------| ----------|
32- | ` reasoning - nemotron3-nano ` | Nemotron-3-Nano-30B-A3B Q4_K_XL | 3B (MoE) | ~ 15GB | 32K | Chat, background tasks (title gen, tagging) |
33- | ` coder - qwen3-coder-next ` | Qwen3-Coder-Next-80B-A3B Q3_K_XL | 3B (MoE) | ~ 37GB | 256K | Coding, tool calling, Claude Code CLI |
34- | ` vision - qwen3-vl-thinking ` | Qwen3-VL-30B-A3B-Thinking Q8_0 | 3B (MoE) | ~ 48GB | 32K | Image understanding, OCR |
35- | ` experimental slow - qwen3.5 ` | Qwen3.5-397B-A17B Q4_K_XL | 17B (MoE) | 48GB+RAM | 128K | General reasoning (~ 5-15 tok/s, uses cpu-moe) |
33+ | ` general - qwen3.5 ` | Qwen3.5-35B-A3B Q4_K_XL | 3B (MoE) | ~ 21GB | 16K | General, vision, coding, captioning |
34+
35+ Aliases: ` qwen3.5 ` , ` general ` , ` vision ` , ` image ` , ` multimodal ` , ` coder ` , ` code `
3636
3737### Key llama-server Optimizations
3838
3939| Setting | Value | Why |
4040| ---------| -------| -----|
41- | ` cache-type-k = q8_0 ` | All models | Halves KV key cache VRAM (~ 0.002 perplexity cost) |
42- | ` cache-type-v = q4_0 ` | All models | Thirds KV value cache VRAM (values tolerate aggressive quant) |
43- | ` cpu-moe = 1 ` | Qwen3.5 only | Keeps attention on GPU, offloads MoE experts to CPU. Much faster than unified memory swapping |
41+ | ` cache-type-k = q8_0 ` | KV cache | Halves KV key cache VRAM (~ 0.002 perplexity cost) |
42+ | ` cache-type-v = q8_0 ` | KV cache | Quantized KV value cache for VRAM savings |
4443| ` --no-mmap ` | Global | Prevents page fault stalls during inference (we have 400GB RAM) |
45- | ` -b 4096 -ub 1024 ` | Global | Larger batch sizes for faster prompt processing |
44+ | ` -b 2048 -ub 512 ` | Global | Batch sizes for prompt processing |
4645| ` --parallel 1 ` | Global | Single-user -- maximize VRAM for context, not concurrent slots |
47- | ` CUDA_SCALE_LAUNCH_QUEUES=4x ` | Env var | Larger CUDA command buffer for dual-GPU kernel launches |
46+ | ` --fit on ` | Global | Auto-fit dense layers to available VRAM |
4847
4948### Using with Claude Code CLI
5049
@@ -54,7 +53,7 @@ llama-server natively supports the Anthropic Messages API at `/v1/messages`. No
5453export ANTHROPIC_BASE_URL=" http://llama.vanillax.me"
5554export ANTHROPIC_AUTH_TOKEN=" no-key-required"
5655export ANTHROPIC_API_KEY=" "
57- claude --model " coder - qwen3-coder-next "
56+ claude --model " general - qwen3.5 "
5857```
5958
6059### Using with OpenClaw / Other Tools
@@ -110,8 +109,10 @@ Two options, both pre-installed in the megapak Docker image:
110109
111110** Workflows** : ` workflows/florence2-caption.json ` , ` workflows/wd14-tagger.json `
112111
113- For deeper image analysis (visual Q&A, reasoning), use ** Qwen3-VL ** via Open WebUI chat
112+ For deeper image analysis (visual Q&A, reasoning), use ** Qwen3.5 ** via Open WebUI chat
114113(upload image -> ask questions). This goes through llama-server, not ComfyUI.
114+ ComfyUI can also call llama-server's vision API directly via the ` comfyui-llamacpp-client ` node
115+ (URL: ` http://llama-cpp-service.llama-cpp.svc.cluster.local:8080 ` ).
115116
116117## Video Generation (ComfyUI)
117118
@@ -237,8 +238,7 @@ The job downloads (skips existing):
237238### Task Model
238239
239240Background tasks (title generation, chat tagging, follow-up suggestions) use
240- ` reasoning - nemotron3-nano ` -- fast 3B active MoE model. Previously this was set to
241- ` coder - qwen3-coder-next ` which was overkill for generating chat titles.
241+ ` general - qwen3.5 ` -- the single consolidated model handles all tasks.
242242
243243### RAG Tuning
244244
@@ -288,13 +288,6 @@ to be compiled with `-DGGML_CUDA_FA_ALL_QUANTS=ON`. If the pre-built `b8006` ima
288288have this, flash attention silently falls back to f16 KV cache. Test by checking VRAM usage --
289289if 262K context still uses ~ 40GB of KV cache, the flag is missing and you need a newer build.
290290
291- ### Experimental Slow Model (Qwen3.5-397B)
292-
293- Even with ` cpu-moe = 1 ` , the 397B model will be significantly slower than the 3B-active models
294- because the 17B active parameters still require substantial compute, and expert weights shuttle
295- between CPU and GPU. Expect ~ 5-15 tok/s vs ~ 70 tok/s for the coder. It's the "quality over speed"
296- option for complex reasoning. Named "experimental slow" in the model list to set expectations.
297-
298291### ComfyUI Model Swapping
299292
300293ComfyUI loads one model at a time into VRAM. Switching between image and video models
0 commit comments