Skip to content

Commit 2d76ff7

Browse files
committed
2 parents 8581474 + 1318034 commit 2d76ff7

1 file changed

Lines changed: 24 additions & 31 deletions

File tree

my-apps/ai/README.md

Lines changed: 24 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# AI Stack Guide
22

3-
Local AI infrastructure running on dual RTX 3090s (48GB VRAM) + 400GB system RAM.
3+
Local AI infrastructure running on dual RTX 3090s (24GB each) + 400GB system RAM.
4+
Each GPU runs one workload — llama-server on GPU 0, ComfyUI on GPU 1.
45

56
## Architecture
67

@@ -12,39 +13,37 @@ Local AI infrastructure running on dual RTX 3090s (48GB VRAM) + 400GB system RAM
1213
(LLM inference) (image+video) (web search)
1314
port 8080 port 8188
1415
┌──────────┐ ┌──────────────────────┐
15-
4 models │ │ Z-Image-Turbo (t2i) │
16-
via │ │ Qwen-Image-Edit (i2i)│
17-
presets │ │ Wan 2.2 T2V (video) │
18-
└──────────┘ │ Wan 2.2 I2V (video) │
19-
2x RTX 3090 │ Florence-2 (caption) │
20-
(48GB VRAM) │ WD14 Tagger (tags) │
16+
Qwen3.5 │ │ Z-Image-Turbo (t2i) │
17+
35B-A3B │ │ Qwen-Image-Edit (i2i)│
18+
Q4_K_XL │ │ Wan 2.2 T2V (video) │
19+
│ +mmproj │ │ Wan 2.2 I2V (video) │
20+
└──────────┘ │ Florence-2 (caption) │
21+
GPU 0 (24GB) │ WD14 Tagger (tags) │
2122
└──────────────────────┘
22-
1x RTX 3090 (24GB VRAM)
23+
GPU 1 (24GB)
2324
```
2425

25-
## LLM Models (llama-cpp)
26+
## LLM Model (llama-cpp)
2627

27-
All models served via a single `llama-server` with multi-model routing (`--models-max 8`).
28-
Models load on-demand and swap in/out of VRAM.
28+
Single model served via `llama-server` on GPU 0. Qwen3.5-35B-A3B is a MoE model (3B active params)
29+
with native multimodal support (vision + language). Q4_K_XL (20.6GB) + mmproj (858MB) fits in 1x 24GB RTX 3090.
2930

3031
| Preset | Model | Active Params | VRAM | Context | Use Case |
3132
|--------|-------|--------------|------|---------|----------|
32-
| `reasoning - nemotron3-nano` | Nemotron-3-Nano-30B-A3B Q4_K_XL | 3B (MoE) | ~15GB | 32K | Chat, background tasks (title gen, tagging) |
33-
| `coder - qwen3-coder-next` | Qwen3-Coder-Next-80B-A3B Q3_K_XL | 3B (MoE) | ~37GB | 256K | Coding, tool calling, Claude Code CLI |
34-
| `vision - qwen3-vl-thinking` | Qwen3-VL-30B-A3B-Thinking Q8_0 | 3B (MoE) | ~48GB | 32K | Image understanding, OCR |
35-
| `experimental slow - qwen3.5` | Qwen3.5-397B-A17B Q4_K_XL | 17B (MoE) | 48GB+RAM | 128K | General reasoning (~5-15 tok/s, uses cpu-moe) |
33+
| `general - qwen3.5` | Qwen3.5-35B-A3B Q4_K_XL | 3B (MoE) | ~21GB | 16K | General, vision, coding, captioning |
34+
35+
Aliases: `qwen3.5`, `general`, `vision`, `image`, `multimodal`, `coder`, `code`
3636

3737
### Key llama-server Optimizations
3838

3939
| Setting | Value | Why |
4040
|---------|-------|-----|
41-
| `cache-type-k = q8_0` | All models | Halves KV key cache VRAM (~0.002 perplexity cost) |
42-
| `cache-type-v = q4_0` | All models | Thirds KV value cache VRAM (values tolerate aggressive quant) |
43-
| `cpu-moe = 1` | Qwen3.5 only | Keeps attention on GPU, offloads MoE experts to CPU. Much faster than unified memory swapping |
41+
| `cache-type-k = q8_0` | KV cache | Halves KV key cache VRAM (~0.002 perplexity cost) |
42+
| `cache-type-v = q8_0` | KV cache | Quantized KV value cache for VRAM savings |
4443
| `--no-mmap` | Global | Prevents page fault stalls during inference (we have 400GB RAM) |
45-
| `-b 4096 -ub 1024` | Global | Larger batch sizes for faster prompt processing |
44+
| `-b 2048 -ub 512` | Global | Batch sizes for prompt processing |
4645
| `--parallel 1` | Global | Single-user -- maximize VRAM for context, not concurrent slots |
47-
| `CUDA_SCALE_LAUNCH_QUEUES=4x` | Env var | Larger CUDA command buffer for dual-GPU kernel launches |
46+
| `--fit on` | Global | Auto-fit dense layers to available VRAM |
4847

4948
### Using with Claude Code CLI
5049

@@ -54,7 +53,7 @@ llama-server natively supports the Anthropic Messages API at `/v1/messages`. No
5453
export ANTHROPIC_BASE_URL="http://llama.vanillax.me"
5554
export ANTHROPIC_AUTH_TOKEN="no-key-required"
5655
export ANTHROPIC_API_KEY=""
57-
claude --model "coder - qwen3-coder-next"
56+
claude --model "general - qwen3.5"
5857
```
5958

6059
### Using with OpenClaw / Other Tools
@@ -110,8 +109,10 @@ Two options, both pre-installed in the megapak Docker image:
110109

111110
**Workflows**: `workflows/florence2-caption.json`, `workflows/wd14-tagger.json`
112111

113-
For deeper image analysis (visual Q&A, reasoning), use **Qwen3-VL** via Open WebUI chat
112+
For deeper image analysis (visual Q&A, reasoning), use **Qwen3.5** via Open WebUI chat
114113
(upload image -> ask questions). This goes through llama-server, not ComfyUI.
114+
ComfyUI can also call llama-server's vision API directly via the `comfyui-llamacpp-client` node
115+
(URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`).
115116

116117
## Video Generation (ComfyUI)
117118

@@ -237,8 +238,7 @@ The job downloads (skips existing):
237238
### Task Model
238239

239240
Background tasks (title generation, chat tagging, follow-up suggestions) use
240-
`reasoning - nemotron3-nano` -- fast 3B active MoE model. Previously this was set to
241-
`coder - qwen3-coder-next` which was overkill for generating chat titles.
241+
`general - qwen3.5` -- the single consolidated model handles all tasks.
242242

243243
### RAG Tuning
244244

@@ -288,13 +288,6 @@ to be compiled with `-DGGML_CUDA_FA_ALL_QUANTS=ON`. If the pre-built `b8006` ima
288288
have this, flash attention silently falls back to f16 KV cache. Test by checking VRAM usage --
289289
if 262K context still uses ~40GB of KV cache, the flag is missing and you need a newer build.
290290

291-
### Experimental Slow Model (Qwen3.5-397B)
292-
293-
Even with `cpu-moe = 1`, the 397B model will be significantly slower than the 3B-active models
294-
because the 17B active parameters still require substantial compute, and expert weights shuttle
295-
between CPU and GPU. Expect ~5-15 tok/s vs ~70 tok/s for the coder. It's the "quality over speed"
296-
option for complex reasoning. Named "experimental slow" in the model list to set expectations.
297-
298291
### ComfyUI Model Swapping
299292

300293
ComfyUI loads one model at a time into VRAM. Switching between image and video models

0 commit comments

Comments
 (0)