|
| 1 | +# Single-GPU Llama.cpp + ComfyUI Vision Integration |
| 2 | + |
| 3 | +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. |
| 4 | +
|
| 5 | +**Goal:** Split dual 3090s so llama-cpp uses 1 GPU with Qwen3.5 Q4_K_XL (only model) and ComfyUI uses the other, with ComfyUI able to call llama-cpp for vision/captioning. |
| 6 | + |
| 7 | +**Architecture:** llama-cpp drops from 2 GPUs to 1, switches from dual-model (Coder + Qwen3.5 Q6) to single-model (Qwen3.5 Q4_K_XL with mmproj for vision). Context reduced to 16K for captioning use case. ComfyUI already has `comfyui-llamacpp-client` node installed and requests 1 GPU — no changes needed to ComfyUI deployment. Both pods schedule on the same GPU node, each claiming 1 of 2 GPUs. |
| 8 | + |
| 9 | +**Tech Stack:** llama.cpp server (CUDA), Qwen3.5-35B-A3B multimodal, ComfyUI, comfyui-llamacpp-client node |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +### Task 1: Update llama-cpp ConfigMap — single Qwen3.5 Q4_K_XL preset |
| 14 | + |
| 15 | +**Files:** |
| 16 | +- Modify: `my-apps/ai/llama-cpp/configmap.yaml` |
| 17 | + |
| 18 | +**Step 1: Replace configmap with single-model preset** |
| 19 | + |
| 20 | +Replace entire `data.presets.ini` content with: |
| 21 | + |
| 22 | +```ini |
| 23 | +# ========================================================== |
| 24 | +# QWEN3.5-35B-A3B [MULTIMODAL] — Single GPU (RTX 3090 24GB) |
| 25 | +# ========================================================== |
| 26 | +[qwen3.5] |
| 27 | +# 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention |
| 28 | +# Natively multimodal (vision + language) |
| 29 | +# Q4_K_XL (20.6GB) + mmproj (858MB) fits in single 24GB 3090 |
| 30 | +# Feb 27 2026: Updated Unsloth Dynamic 2.0 quant (MXFP4 retired from attention) |
| 31 | +# Qwen official "precise" thinking params |
| 32 | +model = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf |
| 33 | +mmproj = /models/mmproj-F16.gguf |
| 34 | +alias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, code |
| 35 | +ctx-size = 16384 |
| 36 | +n-gpu-layers = 99 |
| 37 | +temp = 0.6 |
| 38 | +top-p = 0.95 |
| 39 | +top-k = 20 |
| 40 | +min-p = 0.0 |
| 41 | +presence-penalty = 0.0 |
| 42 | +chat-template-kwargs = {"enable_thinking": true} |
| 43 | +jinja = 1 |
| 44 | +``` |
| 45 | + |
| 46 | +Key changes: |
| 47 | +- Removed Qwen3-Coder-Next preset entirely |
| 48 | +- Switched Q6_K_XL → Q4_K_XL |
| 49 | +- Removed `tensor-split = 1,1` (single GPU) |
| 50 | +- Context 131072 → 16384 (captioning/prompting use case) |
| 51 | +- Added `coder, code` aliases so existing API consumers still resolve |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +### Task 2: Update llama-cpp Deployment — single GPU, reduced resources |
| 56 | + |
| 57 | +**Files:** |
| 58 | +- Modify: `my-apps/ai/llama-cpp/deployment.yaml` |
| 59 | + |
| 60 | +**Step 1: Update server args** |
| 61 | + |
| 62 | +Change global `-c` from `131072` to `16384`. |
| 63 | + |
| 64 | +Remove `--models-max 8` (only 1 model now — remove or set to `1`). |
| 65 | + |
| 66 | +Remove `-b 4096` and `-ub 1024` (single GPU with smaller context doesn't need oversized batches). Replace with `-b 2048` and `-ub 512`. |
| 67 | + |
| 68 | +Keep: `--models-preset`, `-ngl 99`, `-fa on`, `--jinja`, `--fit on`, `--no-mmap`, `--cache-type-k q8_0`, `--cache-type-v q8_0`, `--parallel 1`, `--host`, `--port`. |
| 69 | + |
| 70 | +**Step 2: Update env vars for single GPU** |
| 71 | + |
| 72 | +```yaml |
| 73 | +env: |
| 74 | + - name: NVIDIA_VISIBLE_DEVICES |
| 75 | + value: "all" |
| 76 | + - name: CUDA_VISIBLE_DEVICES |
| 77 | + value: "0" |
| 78 | + - name: NVIDIA_DRIVER_CAPABILITIES |
| 79 | + value: "compute,utility" |
| 80 | + - name: GGML_CUDA_ENABLE_UNIFIED_MEMORY |
| 81 | + value: "1" |
| 82 | +``` |
| 83 | +
|
| 84 | +Remove: |
| 85 | +- `GGML_CUDA_PEER_MAX_BATCH_SIZE` (multi-GPU peer transfer, not needed) |
| 86 | +- `CUDA_SCALE_LAUNCH_QUEUES` (multi-GPU launch queue optimization, not needed) |
| 87 | + |
| 88 | +**Step 3: Update resource requests/limits** |
| 89 | + |
| 90 | +```yaml |
| 91 | +resources: |
| 92 | + limits: |
| 93 | + cpu: "32" |
| 94 | + memory: 64Gi # Q4_K_XL (20.6GB) + KV cache + overhead, RAM for expert paging |
| 95 | + nvidia.com/gpu: "1" # Was 2 |
| 96 | + ephemeral-storage: "50Gi" |
| 97 | + requests: |
| 98 | + cpu: "8" |
| 99 | + memory: 32Gi |
| 100 | + nvidia.com/gpu: "1" # Was 2 |
| 101 | + ephemeral-storage: "10Gi" |
| 102 | +``` |
| 103 | + |
| 104 | +**Step 4: Reduce /dev/shm** |
| 105 | + |
| 106 | +Change `sizeLimit: 32Gi` → `sizeLimit: 8Gi` (single GPU, smaller context). |
| 107 | + |
| 108 | +**Step 5: Update comments** |
| 109 | + |
| 110 | +- `terminationGracePeriodSeconds: 300` comment → update from "400GB memory unmapping" to "model unload time" |
| 111 | +- `GGML_CUDA_ENABLE_UNIFIED_MEMORY` comment → update to reference single 3090 |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +### Task 3: Create vision captioning workflow for ComfyUI |
| 116 | + |
| 117 | +**Files:** |
| 118 | +- Create: `my-apps/ai/comfyui/workflows/qwen35-vision-caption.json` |
| 119 | + |
| 120 | +This workflow: Load Image → LlamaCpp Client (vision) → Show Text |
| 121 | + |
| 122 | +The `comfyui-llamacpp-client` node needs the llama-cpp service URL: |
| 123 | +`http://llama-cpp-service.llama-cpp.svc.cluster.local:8080` |
| 124 | + |
| 125 | +Note: The exact class_type and parameter names depend on the installed version of `comfyui-llamacpp-client`. The workflow should be created in the ComfyUI UI and exported, or verified against the node's actual parameter schema. Create a minimal reference workflow: |
| 126 | + |
| 127 | +```json |
| 128 | +{ |
| 129 | + "1": { |
| 130 | + "class_type": "LoadImage", |
| 131 | + "inputs": { |
| 132 | + "image": "input.png" |
| 133 | + } |
| 134 | + }, |
| 135 | + "2": { |
| 136 | + "class_type": "LlamaCppClient", |
| 137 | + "inputs": { |
| 138 | + "server_url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080", |
| 139 | + "endpoint": "/v1/chat/completions", |
| 140 | + "prompt": "Describe this image in detail for use as a Stable Diffusion prompt. Focus on composition, lighting, colors, style, and subject matter.", |
| 141 | + "image": ["1", 0], |
| 142 | + "temperature": 0.6, |
| 143 | + "top_p": 0.95, |
| 144 | + "top_k": 20, |
| 145 | + "max_tokens": 512 |
| 146 | + } |
| 147 | + }, |
| 148 | + "3": { |
| 149 | + "class_type": "ShowText|pysssss", |
| 150 | + "inputs": { |
| 151 | + "text": ["2", 0] |
| 152 | + } |
| 153 | + } |
| 154 | +} |
| 155 | +``` |
| 156 | + |
| 157 | +**Important:** This workflow JSON is a reference template. The actual node class_type and input names must be verified from the installed `comfyui-llamacpp-client` node in the ComfyUI UI. The user may need to recreate it visually in ComfyUI to match the actual node interface. |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +### Task 4: Update ComfyUI pre-start to copy vision workflow |
| 162 | + |
| 163 | +**Files:** |
| 164 | +- Modify: `my-apps/ai/comfyui/configmap.yaml` (the `comfyui-pre-start` ConfigMap) |
| 165 | + |
| 166 | +**Step 1: Add workflow copy to pre-start.sh** |
| 167 | + |
| 168 | +After the WanVideoWrapper workflow copy section, add: |
| 169 | + |
| 170 | +```bash |
| 171 | +# ── LlamaCpp Vision Workflows ──────────────────────────── |
| 172 | +# Copy from ConfigMap-mounted workflows (if available) |
| 173 | +LLAMA_WF="/opt/workflows/qwen35-vision-caption.json" |
| 174 | +if [ -f "$LLAMA_WF" ]; then |
| 175 | + cp -f "$LLAMA_WF" "$DEST/" && \ |
| 176 | + echo "[INFO] Copied Qwen3.5 vision captioning workflow" || true |
| 177 | +fi |
| 178 | +``` |
| 179 | + |
| 180 | +**Step 2: Mount workflow as ConfigMap in ComfyUI deployment** |
| 181 | + |
| 182 | +Create a new ConfigMap from the workflow JSON and mount it, OR simply document that the workflow should be loaded manually in the ComfyUI UI. |
| 183 | + |
| 184 | +Given that workflows are typically created/edited in the UI and the JSON structure needs verification against the actual node, the simpler approach is: **skip auto-deployment** and have the user create the workflow in ComfyUI UI using these parameters: |
| 185 | +- Server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080` |
| 186 | +- Endpoint: `/v1/chat/completions` |
| 187 | +- Model alias: `qwen3.5` (or any alias from the preset) |
| 188 | + |
| 189 | +This avoids fragile JSON that might not match the node's actual schema. |
| 190 | + |
| 191 | +**Decision: Skip Task 3 and Task 4.** The workflow JSON depends on the exact node interface which is better created in the UI. Document the connection URL instead. |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +### Task 5: Commit and verify |
| 196 | + |
| 197 | +**Step 1: Commit changes** |
| 198 | + |
| 199 | +```bash |
| 200 | +git add my-apps/ai/llama-cpp/configmap.yaml my-apps/ai/llama-cpp/deployment.yaml |
| 201 | +git commit -m "feat(llama-cpp): single GPU Qwen3.5 Q4_K_XL, free GPU for ComfyUI |
| 202 | +
|
| 203 | +- Drop from 2 GPUs to 1 (frees RTX 3090 for ComfyUI) |
| 204 | +- Remove Qwen3-Coder-Next model, use only Qwen3.5-35B-A3B |
| 205 | +- Switch Q6_K_XL → Q4_K_XL (20.6GB fits in single 24GB 3090) |
| 206 | +- Reduce context 131K → 16K (captioning/prompting use case) |
| 207 | +- Remove multi-GPU env vars and tensor-split |
| 208 | +- Reduce memory/CPU requests for single-model single-GPU" |
| 209 | +``` |
| 210 | + |
| 211 | +**Step 2: Verify after ArgoCD sync** |
| 212 | + |
| 213 | +```bash |
| 214 | +# Check both pods are running (each on 1 GPU) |
| 215 | +kubectl get pods -n llama-cpp |
| 216 | +kubectl get pods -n comfyui |
| 217 | +
|
| 218 | +# Verify llama-cpp loaded model |
| 219 | +kubectl logs -n llama-cpp -l app=llama-cpp-server --tail=50 |
| 220 | +
|
| 221 | +# Verify GPU allocation (should show 1 GPU each) |
| 222 | +kubectl describe node <gpu-node> | grep -A5 "Allocated resources" |
| 223 | +
|
| 224 | +# Test vision API |
| 225 | +kubectl run -it --rm curl --image=curlimages/curl --restart=Never -- \ |
| 226 | + curl http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/health |
| 227 | +``` |
| 228 | + |
| 229 | +**Step 3: Configure ComfyUI llamacpp-client node** |
| 230 | + |
| 231 | +In ComfyUI UI: |
| 232 | +1. Add "LlamaCpp Client" node from AI/LlamaCpp category |
| 233 | +2. Set server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080` |
| 234 | +3. Connect a LoadImage node to its image input |
| 235 | +4. Set prompt: "Describe this image in detail for use as a Stable Diffusion prompt" |
| 236 | +5. Connect output to text display or directly to a prompt input |
0 commit comments