single gpu llm

mitchross · mitchross · commit 94759a1040fa · 2026-02-28T14:41:43.000-05:00
diff --git a/docs/plans/2026-02-28-single-gpu-llamacpp-comfyui-vision.md b/docs/plans/2026-02-28-single-gpu-llamacpp-comfyui-vision.md
@@ -0,0 +1,236 @@
+# Single-GPU Llama.cpp + ComfyUI Vision Integration
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** Split dual 3090s so llama-cpp uses 1 GPU with Qwen3.5 Q4_K_XL (only model) and ComfyUI uses the other, with ComfyUI able to call llama-cpp for vision/captioning.
+
+**Architecture:** llama-cpp drops from 2 GPUs to 1, switches from dual-model (Coder + Qwen3.5 Q6) to single-model (Qwen3.5 Q4_K_XL with mmproj for vision). Context reduced to 16K for captioning use case. ComfyUI already has `comfyui-llamacpp-client` node installed and requests 1 GPU — no changes needed to ComfyUI deployment. Both pods schedule on the same GPU node, each claiming 1 of 2 GPUs.
+
+**Tech Stack:** llama.cpp server (CUDA), Qwen3.5-35B-A3B multimodal, ComfyUI, comfyui-llamacpp-client node
+
+---
+
+### Task 1: Update llama-cpp ConfigMap — single Qwen3.5 Q4_K_XL preset
+
+**Files:**
+- Modify: `my-apps/ai/llama-cpp/configmap.yaml`
+
+**Step 1: Replace configmap with single-model preset**
+
+Replace entire `data.presets.ini` content with:
+
+```ini
+# ==========================================================
+# QWEN3.5-35B-A3B [MULTIMODAL] — Single GPU (RTX 3090 24GB)
+# ==========================================================
+[qwen3.5]
+# 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention
+# Natively multimodal (vision + language)
+# Q4_K_XL (20.6GB) + mmproj (858MB) fits in single 24GB 3090
+# Feb 27 2026: Updated Unsloth Dynamic 2.0 quant (MXFP4 retired from attention)
+# Qwen official "precise" thinking params
+model = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
+mmproj = /models/mmproj-F16.gguf
+alias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, code
+ctx-size = 16384
+n-gpu-layers = 99
+temp = 0.6
+top-p = 0.95
+top-k = 20
+min-p = 0.0
+presence-penalty = 0.0
+chat-template-kwargs = {"enable_thinking": true}
+jinja = 1
+```
+
+Key changes:
+- Removed Qwen3-Coder-Next preset entirely
+- Switched Q6_K_XL → Q4_K_XL
+- Removed `tensor-split = 1,1` (single GPU)
+- Context 131072 → 16384 (captioning/prompting use case)
+- Added `coder, code` aliases so existing API consumers still resolve
+
+---
+
+### Task 2: Update llama-cpp Deployment — single GPU, reduced resources
+
+**Files:**
+- Modify: `my-apps/ai/llama-cpp/deployment.yaml`
+
+**Step 1: Update server args**
+
+Change global `-c` from `131072` to `16384`.
+
+Remove `--models-max 8` (only 1 model now — remove or set to `1`).
+
+Remove `-b 4096` and `-ub 1024` (single GPU with smaller context doesn't need oversized batches). Replace with `-b 2048` and `-ub 512`.
+
+Keep: `--models-preset`, `-ngl 99`, `-fa on`, `--jinja`, `--fit on`, `--no-mmap`, `--cache-type-k q8_0`, `--cache-type-v q8_0`, `--parallel 1`, `--host`, `--port`.
+
+**Step 2: Update env vars for single GPU**
+
+```yaml
+env:
+  - name: NVIDIA_VISIBLE_DEVICES
+    value: "all"
+  - name: CUDA_VISIBLE_DEVICES
+    value: "0"
+  - name: NVIDIA_DRIVER_CAPABILITIES
+    value: "compute,utility"
+  - name: GGML_CUDA_ENABLE_UNIFIED_MEMORY
+    value: "1"
+```
+
+Remove:
+- `GGML_CUDA_PEER_MAX_BATCH_SIZE` (multi-GPU peer transfer, not needed)
+- `CUDA_SCALE_LAUNCH_QUEUES` (multi-GPU launch queue optimization, not needed)
+
+**Step 3: Update resource requests/limits**
+
+```yaml
+resources:
+  limits:
+    cpu: "32"
+    memory: 64Gi        # Q4_K_XL (20.6GB) + KV cache + overhead, RAM for expert paging
+    nvidia.com/gpu: "1"  # Was 2
+    ephemeral-storage: "50Gi"
+  requests:
+    cpu: "8"
+    memory: 32Gi
+    nvidia.com/gpu: "1"  # Was 2
+    ephemeral-storage: "10Gi"
+```
+
+**Step 4: Reduce /dev/shm**
+
+Change `sizeLimit: 32Gi` → `sizeLimit: 8Gi` (single GPU, smaller context).
+
+**Step 5: Update comments**
+
+- `terminationGracePeriodSeconds: 300` comment → update from "400GB memory unmapping" to "model unload time"
+- `GGML_CUDA_ENABLE_UNIFIED_MEMORY` comment → update to reference single 3090
+
+---
+
+### Task 3: Create vision captioning workflow for ComfyUI
+
+**Files:**
+- Create: `my-apps/ai/comfyui/workflows/qwen35-vision-caption.json`
+
+This workflow: Load Image → LlamaCpp Client (vision) → Show Text
+
+The `comfyui-llamacpp-client` node needs the llama-cpp service URL:
+`http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
+
+Note: The exact class_type and parameter names depend on the installed version of `comfyui-llamacpp-client`. The workflow should be created in the ComfyUI UI and exported, or verified against the node's actual parameter schema. Create a minimal reference workflow:
+
+```json
+{
+  "1": {
+    "class_type": "LoadImage",
+    "inputs": {
+      "image": "input.png"
+    }
+  },
+  "2": {
+    "class_type": "LlamaCppClient",
+    "inputs": {
+      "server_url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080",
+      "endpoint": "/v1/chat/completions",
+      "prompt": "Describe this image in detail for use as a Stable Diffusion prompt. Focus on composition, lighting, colors, style, and subject matter.",
+      "image": ["1", 0],
+      "temperature": 0.6,
+      "top_p": 0.95,
+      "top_k": 20,
+      "max_tokens": 512
+    }
+  },
+  "3": {
+    "class_type": "ShowText|pysssss",
+    "inputs": {
+      "text": ["2", 0]
+    }
+  }
+}
+```
+
+**Important:** This workflow JSON is a reference template. The actual node class_type and input names must be verified from the installed `comfyui-llamacpp-client` node in the ComfyUI UI. The user may need to recreate it visually in ComfyUI to match the actual node interface.
+
+---
+
+### Task 4: Update ComfyUI pre-start to copy vision workflow
+
+**Files:**
+- Modify: `my-apps/ai/comfyui/configmap.yaml` (the `comfyui-pre-start` ConfigMap)
+
+**Step 1: Add workflow copy to pre-start.sh**
+
+After the WanVideoWrapper workflow copy section, add:
+
+```bash
+# ── LlamaCpp Vision Workflows ────────────────────────────
+# Copy from ConfigMap-mounted workflows (if available)
+LLAMA_WF="/opt/workflows/qwen35-vision-caption.json"
+if [ -f "$LLAMA_WF" ]; then
+  cp -f "$LLAMA_WF" "$DEST/" && \
+    echo "[INFO] Copied Qwen3.5 vision captioning workflow" || true
+fi
+```
+
+**Step 2: Mount workflow as ConfigMap in ComfyUI deployment**
+
+Create a new ConfigMap from the workflow JSON and mount it, OR simply document that the workflow should be loaded manually in the ComfyUI UI.
+
+Given that workflows are typically created/edited in the UI and the JSON structure needs verification against the actual node, the simpler approach is: **skip auto-deployment** and have the user create the workflow in ComfyUI UI using these parameters:
+- Server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
+- Endpoint: `/v1/chat/completions`
+- Model alias: `qwen3.5` (or any alias from the preset)
+
+This avoids fragile JSON that might not match the node's actual schema.
+
+**Decision: Skip Task 3 and Task 4.** The workflow JSON depends on the exact node interface which is better created in the UI. Document the connection URL instead.
+
+---
+
+### Task 5: Commit and verify
+
+**Step 1: Commit changes**
+
+```bash
+git add my-apps/ai/llama-cpp/configmap.yaml my-apps/ai/llama-cpp/deployment.yaml
+git commit -m "feat(llama-cpp): single GPU Qwen3.5 Q4_K_XL, free GPU for ComfyUI
+
+- Drop from 2 GPUs to 1 (frees RTX 3090 for ComfyUI)
+- Remove Qwen3-Coder-Next model, use only Qwen3.5-35B-A3B
+- Switch Q6_K_XL → Q4_K_XL (20.6GB fits in single 24GB 3090)
+- Reduce context 131K → 16K (captioning/prompting use case)
+- Remove multi-GPU env vars and tensor-split
+- Reduce memory/CPU requests for single-model single-GPU"
+```
+
+**Step 2: Verify after ArgoCD sync**
+
+```bash
+# Check both pods are running (each on 1 GPU)
+kubectl get pods -n llama-cpp
+kubectl get pods -n comfyui
+
+# Verify llama-cpp loaded model
+kubectl logs -n llama-cpp -l app=llama-cpp-server --tail=50
+
+# Verify GPU allocation (should show 1 GPU each)
+kubectl describe node <gpu-node> | grep -A5 "Allocated resources"
+
+# Test vision API
+kubectl run -it --rm curl --image=curlimages/curl --restart=Never -- \
+  curl http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/health
+```
+
+**Step 3: Configure ComfyUI llamacpp-client node**
+
+In ComfyUI UI:
+1. Add "LlamaCpp Client" node from AI/LlamaCpp category
+2. Set server URL: `http://llama-cpp-service.llama-cpp.svc.cluster.local:8080`
+3. Connect a LoadImage node to its image input
+4. Set prompt: "Describe this image in detail for use as a Stable Diffusion prompt"
+5. Connect output to text display or directly to a prompt input
diff --git a/my-apps/ai/llama-cpp/configmap.yaml b/my-apps/ai/llama-cpp/configmap.yaml
@@ -6,39 +6,20 @@ metadata:
 data:
   presets.ini: |
     # ==========================================================
-    # CODING (48GB VRAM) - Qwen3-Coder-Next [DEFAULT]
-    # ==========================================================
-    [coder - qwen3-coder-next]
-    # Unsloth docs: temp=1.0, top_p=0.95, top_k=40, min_p=0.01
-    # 80B total / 3B active (MoE) - Hybrid DeltaNet + Gated Attention
-    # Primary use: Claude Code backend, OpenClaw coding agent
-    model = /models/Qwen3-Coder-Next-UD-Q3_K_XL.gguf
-    alias = coder, code, qwen3-coder, qwen3 coder next
-    ctx-size = 65536
-    n-gpu-layers = 99
-    tensor-split = 1,1
-    temp = 1.0
-    top-p = 0.95
-    top-k = 40
-    min-p = 0.01
-    chat-template-kwargs = {"enable_thinking": true}
-    jinja = 1
-
-    # ==========================================================
-    # GENERAL + VISION - Qwen3.5-35B-A3B [MULTIMODAL]
+    # GENERAL + VISION + CODE - Qwen3.5-35B-A3B [MULTIMODAL]
+    # Single GPU (1x RTX 3090) — second GPU freed for ComfyUI
     # ==========================================================
     [general - qwen3.5]
     # 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention
     # Natively multimodal (vision + language), 256K context native
-    # Q6_K_XL (28GB) + mmproj (858MB) fits in 48GB VRAM
+    # Q4_K_XL (20.6GB) + mmproj (858MB) fits in 1x 24GB RTX 3090
     # Qwen official "precise" thinking params: temp=0.6, top_p=0.95, top_k=20, presence_penalty=0.0
     # General thinking params use presence_penalty=1.5 but causes thinking loops on simple questions
-    model = /models/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf
+    model = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
     mmproj = /models/mmproj-F16.gguf
-    alias = qwen3.5, qwen 3.5, general, vision, image, multimodal
-    ctx-size = 131072
+    alias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, code
+    ctx-size = 16384
     n-gpu-layers = 99
-    tensor-split = 1,1
     temp = 0.6
     top-p = 0.95
     top-k = 20
diff --git a/my-apps/ai/llama-cpp/deployment.yaml b/my-apps/ai/llama-cpp/deployment.yaml
@@ -38,11 +38,11 @@ spec:
           command: ["/app/llama-server"]
           args:
             - "--models-max"
-            - "8"
+            - "1"
             - "--models-preset"
             - "/config/presets.ini"
             - "-c"
-            - "131072"
+            - "16384"
             - "-ngl"
             - "99"
             - "-fa"
@@ -56,9 +56,9 @@ spec:
             - "--cache-type-v"
             - "q8_0"
             - "-b"
-            - "4096"            # Larger logical batch for faster prompt processing
+            - "2048"            # Logical batch for prompt processing
             - "-ub"
-            - "1024"            # Larger physical batch for better GPU saturation
+            - "512"             # Physical batch for GPU saturation
             - "--parallel"
             - "1"               # Single-user coding assistant - maximize context per request
             - "--host"
@@ -69,15 +69,11 @@ spec:
             - name: NVIDIA_VISIBLE_DEVICES
               value: "all"
             - name: CUDA_VISIBLE_DEVICES
-              value: "0,1"
+              value: "0"
             - name: NVIDIA_DRIVER_CAPABILITIES
               value: "compute,utility"
             - name: GGML_CUDA_ENABLE_UNIFIED_MEMORY
-              value: "1" # Bridges VRAM and 400GB RAM for Qwen3.5-397B MoE expert offloading
-            - name: GGML_CUDA_PEER_MAX_BATCH_SIZE
-              value: "128"
-            - name: CUDA_SCALE_LAUNCH_QUEUES
-              value: "4x"   # Larger command buffer reduces kernel launch bottleneck on multi-GPU
+              value: "1" # Spill to system RAM if VRAM is tight with Q4_K_XL + KV cache
           ports:
             - name: http
               containerPort: 8080
@@ -108,15 +104,15 @@ spec:
             failureThreshold: 3
           resources:
             limits:
-              cpu: "52"
-              memory: 440Gi   # Matches your physical 400GB + overhead
-              nvidia.com/gpu: "2"
-              ephemeral-storage: "150Gi"
+              cpu: "32"
+              memory: 64Gi
+              nvidia.com/gpu: "1"
+              ephemeral-storage: "50Gi"
             requests:
-              cpu: "24"       # Optimized for MoE expert fetching on CPU
-              memory: 390Gi
-              nvidia.com/gpu: "2"
-              ephemeral-storage: "25Gi"
+              cpu: "8"
+              memory: 32Gi
+              nvidia.com/gpu: "1"
+              ephemeral-storage: "10Gi"
           volumeMounts:
             - name: models-storage
               mountPath: /models
@@ -138,6 +134,6 @@ spec:
         - name: dshm
           emptyDir:
             medium: Memory
-            sizeLimit: 32Gi
+            sizeLimit: 8Gi
       nodeSelector:
         feature.node.kubernetes.io/pci-0300_10de.present: "true"
diff --git a/my-apps/ai/open-webui/configmap.yaml b/my-apps/ai/open-webui/configmap.yaml
@@ -21,7 +21,7 @@ data:
   VISION_MODELS: "general - qwen3.5"
 
   # Default parameters (Qwen3.5 precise thinking: temp=0.6, top_p=0.95, top_k=20)
-  CONTEXT_WINDOW: "131072"
+  CONTEXT_WINDOW: "16384"
   TEMPERATURE: "0.6"
   TOP_P: "0.95"
   MIN_P: "0.0"
diff --git a/my-apps/development/n8n/workflows/daily-cluster-report.json b/my-apps/development/n8n/workflows/daily-cluster-report.json
@@ -217,7 +217,7 @@
         "url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/v1/chat/completions",
         "sendBody": true,
         "specifyBody": "json",
-        "jsonBody": "={{ JSON.stringify({ model: 'coder - qwen3-coder-next', messages: [ { role: 'system', content: 'You are a Kubernetes cluster health reporter. Summarize the following metrics into a concise daily report. Highlight any issues, warnings, or anomalies. Use these sections: **Nodes** (readiness, CPU, memory, disk), **Pods** (restarts, failures), **Storage** (Longhorn usage), **ArgoCD Apps** (sync/health status), **Backups** (PVC Plumber health). Keep it brief and actionable. Use markdown formatting.' }, { role: 'user', content: $json.metricsPayload } ], max_tokens: 2048, temperature: 0.3 }) }}",
+        "jsonBody": "={{ JSON.stringify({ model: 'general - qwen3.5', messages: [ { role: 'system', content: 'You are a Kubernetes cluster health reporter. Summarize the following metrics into a concise daily report. Highlight any issues, warnings, or anomalies. Use these sections: **Nodes** (readiness, CPU, memory, disk), **Pods** (restarts, failures), **Storage** (Longhorn usage), **ArgoCD Apps** (sync/health status), **Backups** (PVC Plumber health). Keep it brief and actionable. Use markdown formatting.' }, { role: 'user', content: $json.metricsPayload } ], max_tokens: 2048, temperature: 0.3 }) }}",
         "options": {}
       },
       "id": "d0a1b2c3-0000-4000-8000-000000000012",