Skip to content

Commit 94759a1

Browse files
committed
single gpu llm
1 parent f5361c5 commit 94759a1

5 files changed

Lines changed: 259 additions & 46 deletions

File tree

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Single-GPU Llama.cpp + ComfyUI Vision Integration
2+
3+
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
4+
5+
**Goal:** Split dual 3090s so llama-cpp uses 1 GPU with Qwen3.5 Q4_K_XL (only model) and ComfyUI uses the other, with ComfyUI able to call llama-cpp for vision/captioning.
6+
7+
**Architecture:** llama-cpp drops from 2 GPUs to 1, switches from dual-model (Coder + Qwen3.5 Q6) to single-model (Qwen3.5 Q4_K_XL with mmproj for vision). Context reduced to 16K for captioning use case. ComfyUI already has `comfyui-llamacpp-client` node installed and requests 1 GPU — no changes needed to ComfyUI deployment. Both pods schedule on the same GPU node, each claiming 1 of 2 GPUs.
8+
9+
**Tech Stack:** llama.cpp server (CUDA), Qwen3.5-35B-A3B multimodal, ComfyUI, comfyui-llamacpp-client node
10+
11+
---
12+
13+
### Task 1: Update llama-cpp ConfigMap — single Qwen3.5 Q4_K_XL preset
14+
15+
**Files:**
16+
- Modify: `my-apps/ai/llama-cpp/configmap.yaml`
17+
18+
**Step 1: Replace configmap with single-model preset**
19+
20+
Replace entire `data.presets.ini` content with:
21+
22+
```ini
23+
# ==========================================================
24+
# QWEN3.5-35B-A3B [MULTIMODAL] — Single GPU (RTX 3090 24GB)
25+
# ==========================================================
26+
[qwen3.5]
27+
# 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention
28+
# Natively multimodal (vision + language)
29+
# Q4_K_XL (20.6GB) + mmproj (858MB) fits in single 24GB 3090
30+
# Feb 27 2026: Updated Unsloth Dynamic 2.0 quant (MXFP4 retired from attention)
31+
# Qwen official "precise" thinking params
32+
model = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
33+
mmproj = /models/mmproj-F16.gguf
34+
alias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, code
35+
ctx-size = 16384
36+
n-gpu-layers = 99
37+
temp = 0.6
38+
top-p = 0.95
39+
top-k = 20
40+
min-p = 0.0
41+
presence-penalty = 0.0
42+
chat-template-kwargs = {"enable_thinking": true}
43+
jinja = 1
44+
```
45+
46+
Key changes:
47+
- Removed Qwen3-Coder-Next preset entirely
48+
- Switched Q6_K_XL → Q4_K_XL
49+
- Removed `tensor-split = 1,1` (single GPU)
50+
- Context 131072 → 16384 (captioning/prompting use case)
51+
- Added `coder, code` aliases so existing API consumers still resolve
52+
53+
---
54+
55+
### Task 2: Update llama-cpp Deployment — single GPU, reduced resources
56+
57+
**Files:**
58+
- Modify: `my-apps/ai/llama-cpp/deployment.yaml`
59+
60+
**Step 1: Update server args**
61+
62+
Change global `-c` from `131072` to `16384`.
63+
64+
Remove `--models-max 8` (only 1 model now — remove or set to `1`).
65+
66+
Remove `-b 4096` and `-ub 1024` (single GPU with smaller context doesn't need oversized batches). Replace with `-b 2048` and `-ub 512`.
67+
68+
Keep: `--models-preset`, `-ngl 99`, `-fa on`, `--jinja`, `--fit on`, `--no-mmap`, `--cache-type-k q8_0`, `--cache-type-v q8_0`, `--parallel 1`, `--host`, `--port`.
69+
70+
**Step 2: Update env vars for single GPU**
71+
72+
```yaml
73+
env:
74+
- name: NVIDIA_VISIBLE_DEVICES
75+
value: "all"
76+
- name: CUDA_VISIBLE_DEVICES
77+
value: "0"
78+
- name: NVIDIA_DRIVER_CAPABILITIES
79+
value: "compute,utility"
80+
- name: GGML_CUDA_ENABLE_UNIFIED_MEMORY
81+
value: "1"
82+
```
83+
84+
Remove:
85+
- `GGML_CUDA_PEER_MAX_BATCH_SIZE` (multi-GPU peer transfer, not needed)
86+
- `CUDA_SCALE_LAUNCH_QUEUES` (multi-GPU launch queue optimization, not needed)
87+
88+
**Step 3: Update resource requests/limits**
89+
90+
```yaml
91+
resources:
92+
limits:
93+
cpu: "32"
94+
memory: 64Gi # Q4_K_XL (20.6GB) + KV cache + overhead, RAM for expert paging
95+
nvidia.com/gpu: "1" # Was 2
96+
ephemeral-storage: "50Gi"
97+
requests:
98+
cpu: "8"
99+
memory: 32Gi
100+
nvidia.com/gpu: "1" # Was 2
101+
ephemeral-storage: "10Gi"
102+
```
103+
104+
**Step 4: Reduce /dev/shm**
105+
106+
Change `sizeLimit: 32Gi` → `sizeLimit: 8Gi` (single GPU, smaller context).
107+
108+
**Step 5: Update comments**
109+
110+
- `terminationGracePeriodSeconds: 300` comment → update from "400GB memory unmapping" to "model unload time"
111+
- `GGML_CUDA_ENABLE_UNIFIED_MEMORY` comment → update to reference single 3090
112+
113+
---
114+
115+
### Task 3: Create vision captioning workflow for ComfyUI
116+
117+
**Files:**
118+
- Create: `my-apps/ai/comfyui/workflows/qwen35-vision-caption.json`
119+
120+
This workflow: Load Image → LlamaCpp Client (vision) → Show Text
121+
122+
The `comfyui-llamacpp-client` node needs the llama-cpp service URL:
123+
`http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
124+
125+
Note: The exact class_type and parameter names depend on the installed version of `comfyui-llamacpp-client`. The workflow should be created in the ComfyUI UI and exported, or verified against the node's actual parameter schema. Create a minimal reference workflow:
126+
127+
```json
128+
{
129+
"1": {
130+
"class_type": "LoadImage",
131+
"inputs": {
132+
"image": "input.png"
133+
}
134+
},
135+
"2": {
136+
"class_type": "LlamaCppClient",
137+
"inputs": {
138+
"server_url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080",
139+
"endpoint": "/v1/chat/completions",
140+
"prompt": "Describe this image in detail for use as a Stable Diffusion prompt. Focus on composition, lighting, colors, style, and subject matter.",
141+
"image": ["1", 0],
142+
"temperature": 0.6,
143+
"top_p": 0.95,
144+
"top_k": 20,
145+
"max_tokens": 512
146+
}
147+
},
148+
"3": {
149+
"class_type": "ShowText|pysssss",
150+
"inputs": {
151+
"text": ["2", 0]
152+
}
153+
}
154+
}
155+
```
156+
157+
**Important:** This workflow JSON is a reference template. The actual node class_type and input names must be verified from the installed `comfyui-llamacpp-client` node in the ComfyUI UI. The user may need to recreate it visually in ComfyUI to match the actual node interface.
158+
159+
---
160+
161+
### Task 4: Update ComfyUI pre-start to copy vision workflow
162+
163+
**Files:**
164+
- Modify: `my-apps/ai/comfyui/configmap.yaml` (the `comfyui-pre-start` ConfigMap)
165+
166+
**Step 1: Add workflow copy to pre-start.sh**
167+
168+
After the WanVideoWrapper workflow copy section, add:
169+
170+
```bash
171+
# ── LlamaCpp Vision Workflows ────────────────────────────
172+
# Copy from ConfigMap-mounted workflows (if available)
173+
LLAMA_WF="/opt/workflows/qwen35-vision-caption.json"
174+
if [ -f "$LLAMA_WF" ]; then
175+
cp -f "$LLAMA_WF" "$DEST/" && \
176+
echo "[INFO] Copied Qwen3.5 vision captioning workflow" || true
177+
fi
178+
```
179+
180+
**Step 2: Mount workflow as ConfigMap in ComfyUI deployment**
181+
182+
Create a new ConfigMap from the workflow JSON and mount it, OR simply document that the workflow should be loaded manually in the ComfyUI UI.
183+
184+
Given that workflows are typically created/edited in the UI and the JSON structure needs verification against the actual node, the simpler approach is: **skip auto-deployment** and have the user create the workflow in ComfyUI UI using these parameters:
185+
- Server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
186+
- Endpoint: `/v1/chat/completions`
187+
- Model alias: `qwen3.5` (or any alias from the preset)
188+
189+
This avoids fragile JSON that might not match the node's actual schema.
190+
191+
**Decision: Skip Task 3 and Task 4.** The workflow JSON depends on the exact node interface which is better created in the UI. Document the connection URL instead.
192+
193+
---
194+
195+
### Task 5: Commit and verify
196+
197+
**Step 1: Commit changes**
198+
199+
```bash
200+
git add my-apps/ai/llama-cpp/configmap.yaml my-apps/ai/llama-cpp/deployment.yaml
201+
git commit -m "feat(llama-cpp): single GPU Qwen3.5 Q4_K_XL, free GPU for ComfyUI
202+
203+
- Drop from 2 GPUs to 1 (frees RTX 3090 for ComfyUI)
204+
- Remove Qwen3-Coder-Next model, use only Qwen3.5-35B-A3B
205+
- Switch Q6_K_XL → Q4_K_XL (20.6GB fits in single 24GB 3090)
206+
- Reduce context 131K → 16K (captioning/prompting use case)
207+
- Remove multi-GPU env vars and tensor-split
208+
- Reduce memory/CPU requests for single-model single-GPU"
209+
```
210+
211+
**Step 2: Verify after ArgoCD sync**
212+
213+
```bash
214+
# Check both pods are running (each on 1 GPU)
215+
kubectl get pods -n llama-cpp
216+
kubectl get pods -n comfyui
217+
218+
# Verify llama-cpp loaded model
219+
kubectl logs -n llama-cpp -l app=llama-cpp-server --tail=50
220+
221+
# Verify GPU allocation (should show 1 GPU each)
222+
kubectl describe node <gpu-node> | grep -A5 "Allocated resources"
223+
224+
# Test vision API
225+
kubectl run -it --rm curl --image=curlimages/curl --restart=Never -- \
226+
curl http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/health
227+
```
228+
229+
**Step 3: Configure ComfyUI llamacpp-client node**
230+
231+
In ComfyUI UI:
232+
1. Add "LlamaCpp Client" node from AI/LlamaCpp category
233+
2. Set server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
234+
3. Connect a LoadImage node to its image input
235+
4. Set prompt: "Describe this image in detail for use as a Stable Diffusion prompt"
236+
5. Connect output to text display or directly to a prompt input

my-apps/ai/llama-cpp/configmap.yaml

Lines changed: 6 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,39 +6,20 @@ metadata:
66
data:
77
presets.ini: |
88
# ==========================================================
9-
# CODING (48GB VRAM) - Qwen3-Coder-Next [DEFAULT]
10-
# ==========================================================
11-
[coder - qwen3-coder-next]
12-
# Unsloth docs: temp=1.0, top_p=0.95, top_k=40, min_p=0.01
13-
# 80B total / 3B active (MoE) - Hybrid DeltaNet + Gated Attention
14-
# Primary use: Claude Code backend, OpenClaw coding agent
15-
model = /models/Qwen3-Coder-Next-UD-Q3_K_XL.gguf
16-
alias = coder, code, qwen3-coder, qwen3 coder next
17-
ctx-size = 65536
18-
n-gpu-layers = 99
19-
tensor-split = 1,1
20-
temp = 1.0
21-
top-p = 0.95
22-
top-k = 40
23-
min-p = 0.01
24-
chat-template-kwargs = {"enable_thinking": true}
25-
jinja = 1
26-
27-
# ==========================================================
28-
# GENERAL + VISION - Qwen3.5-35B-A3B [MULTIMODAL]
9+
# GENERAL + VISION + CODE - Qwen3.5-35B-A3B [MULTIMODAL]
10+
# Single GPU (1x RTX 3090) — second GPU freed for ComfyUI
2911
# ==========================================================
3012
[general - qwen3.5]
3113
# 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention
3214
# Natively multimodal (vision + language), 256K context native
33-
# Q6_K_XL (28GB) + mmproj (858MB) fits in 48GB VRAM
15+
# Q4_K_XL (20.6GB) + mmproj (858MB) fits in 1x 24GB RTX 3090
3416
# Qwen official "precise" thinking params: temp=0.6, top_p=0.95, top_k=20, presence_penalty=0.0
3517
# General thinking params use presence_penalty=1.5 but causes thinking loops on simple questions
36-
model = /models/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf
18+
model = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
3719
mmproj = /models/mmproj-F16.gguf
38-
alias = qwen3.5, qwen 3.5, general, vision, image, multimodal
39-
ctx-size = 131072
20+
alias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, code
21+
ctx-size = 16384
4022
n-gpu-layers = 99
41-
tensor-split = 1,1
4223
temp = 0.6
4324
top-p = 0.95
4425
top-k = 20

my-apps/ai/llama-cpp/deployment.yaml

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,11 @@ spec:
3838
command: ["/app/llama-server"]
3939
args:
4040
- "--models-max"
41-
- "8"
41+
- "1"
4242
- "--models-preset"
4343
- "/config/presets.ini"
4444
- "-c"
45-
- "131072"
45+
- "16384"
4646
- "-ngl"
4747
- "99"
4848
- "-fa"
@@ -56,9 +56,9 @@ spec:
5656
- "--cache-type-v"
5757
- "q8_0"
5858
- "-b"
59-
- "4096" # Larger logical batch for faster prompt processing
59+
- "2048" # Logical batch for prompt processing
6060
- "-ub"
61-
- "1024" # Larger physical batch for better GPU saturation
61+
- "512" # Physical batch for GPU saturation
6262
- "--parallel"
6363
- "1" # Single-user coding assistant - maximize context per request
6464
- "--host"
@@ -69,15 +69,11 @@ spec:
6969
- name: NVIDIA_VISIBLE_DEVICES
7070
value: "all"
7171
- name: CUDA_VISIBLE_DEVICES
72-
value: "0,1"
72+
value: "0"
7373
- name: NVIDIA_DRIVER_CAPABILITIES
7474
value: "compute,utility"
7575
- name: GGML_CUDA_ENABLE_UNIFIED_MEMORY
76-
value: "1" # Bridges VRAM and 400GB RAM for Qwen3.5-397B MoE expert offloading
77-
- name: GGML_CUDA_PEER_MAX_BATCH_SIZE
78-
value: "128"
79-
- name: CUDA_SCALE_LAUNCH_QUEUES
80-
value: "4x" # Larger command buffer reduces kernel launch bottleneck on multi-GPU
76+
value: "1" # Spill to system RAM if VRAM is tight with Q4_K_XL + KV cache
8177
ports:
8278
- name: http
8379
containerPort: 8080
@@ -108,15 +104,15 @@ spec:
108104
failureThreshold: 3
109105
resources:
110106
limits:
111-
cpu: "52"
112-
memory: 440Gi # Matches your physical 400GB + overhead
113-
nvidia.com/gpu: "2"
114-
ephemeral-storage: "150Gi"
107+
cpu: "32"
108+
memory: 64Gi
109+
nvidia.com/gpu: "1"
110+
ephemeral-storage: "50Gi"
115111
requests:
116-
cpu: "24" # Optimized for MoE expert fetching on CPU
117-
memory: 390Gi
118-
nvidia.com/gpu: "2"
119-
ephemeral-storage: "25Gi"
112+
cpu: "8"
113+
memory: 32Gi
114+
nvidia.com/gpu: "1"
115+
ephemeral-storage: "10Gi"
120116
volumeMounts:
121117
- name: models-storage
122118
mountPath: /models
@@ -138,6 +134,6 @@ spec:
138134
- name: dshm
139135
emptyDir:
140136
medium: Memory
141-
sizeLimit: 32Gi
137+
sizeLimit: 8Gi
142138
nodeSelector:
143139
feature.node.kubernetes.io/pci-0300_10de.present: "true"

my-apps/ai/open-webui/configmap.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ data:
2121
VISION_MODELS: "general - qwen3.5"
2222

2323
# Default parameters (Qwen3.5 precise thinking: temp=0.6, top_p=0.95, top_k=20)
24-
CONTEXT_WINDOW: "131072"
24+
CONTEXT_WINDOW: "16384"
2525
TEMPERATURE: "0.6"
2626
TOP_P: "0.95"
2727
MIN_P: "0.0"

my-apps/development/n8n/workflows/daily-cluster-report.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@
217217
"url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/v1/chat/completions",
218218
"sendBody": true,
219219
"specifyBody": "json",
220-
"jsonBody": "={{ JSON.stringify({ model: 'coder - qwen3-coder-next', messages: [ { role: 'system', content: 'You are a Kubernetes cluster health reporter. Summarize the following metrics into a concise daily report. Highlight any issues, warnings, or anomalies. Use these sections: **Nodes** (readiness, CPU, memory, disk), **Pods** (restarts, failures), **Storage** (Longhorn usage), **ArgoCD Apps** (sync/health status), **Backups** (PVC Plumber health). Keep it brief and actionable. Use markdown formatting.' }, { role: 'user', content: $json.metricsPayload } ], max_tokens: 2048, temperature: 0.3 }) }}",
220+
"jsonBody": "={{ JSON.stringify({ model: 'general - qwen3.5', messages: [ { role: 'system', content: 'You are a Kubernetes cluster health reporter. Summarize the following metrics into a concise daily report. Highlight any issues, warnings, or anomalies. Use these sections: **Nodes** (readiness, CPU, memory, disk), **Pods** (restarts, failures), **Storage** (Longhorn usage), **ArgoCD Apps** (sync/health status), **Backups** (PVC Plumber health). Keep it brief and actionable. Use markdown formatting.' }, { role: 'user', content: $json.metricsPayload } ], max_tokens: 2048, temperature: 0.3 }) }}",
221221
"options": {}
222222
},
223223
"id": "d0a1b2c3-0000-4000-8000-000000000012",

0 commit comments

Comments
 (0)