Skip to content

Commit 9ddc6b9

Browse files
Copilotmitchross
andcommitted
fix llama-cpp: add --fit on arg, fix GGML comment, update qwen3.5 preset with override-tensor
Co-authored-by: mitchross <6330506+mitchross@users.noreply.github.com>
1 parent 5ab43dd commit 9ddc6b9

2 files changed

Lines changed: 7 additions & 5 deletions

File tree

my-apps/ai/llama-cpp/configmap.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,18 +54,18 @@ data:
5454
# 397B total / 17B active (MoE) - Unsloth Dynamic Q4_K_XL
5555
# WARNING: ~5-15 tok/s due to cpu-moe offloading. Quality over speed.
5656
# Natively multimodal (vision + language), 256K context native
57-
# cpu-moe keeps attention on GPU, experts on CPU - MUCH faster than
57+
# override-tensor keeps attention on GPU, experts on CPU - MUCH faster than
5858
# unified memory swapping (targeted offload vs indiscriminate CUDA paging)
5959
model = /models/UD-Q4_K_XL/Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf
6060
alias = qwen3.5, qwen 3.5, general, experimental slow
6161
ctx-size = 32768
6262
n-gpu-layers = 99
6363
tensor-split = 1,1
64+
override-tensor = exps=CPU
6465
cache-type-k = q8_0
6566
cache-type-v = q4_0
66-
cpu-moe = 1
67-
temp = 0.6
67+
temp = 0.7
6868
top-p = 0.95
69-
top-k = 20
69+
top-k = 40
7070
min-p = 0.0
7171
jinja = 1

my-apps/ai/llama-cpp/deployment.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ spec:
4848
- "-fa"
4949
- "on" # Explicitly set to 'on' so --jinja is read correctly
5050
- "--jinja"
51+
- "--fit" # Auto-fit dense layers to available VRAM
52+
- "on"
5153
- "--no-mmap" # Prevent page fault stalls - we have 400GB RAM to spare
5254
- "-b"
5355
- "4096" # Larger logical batch for faster prompt processing
@@ -67,7 +69,7 @@ spec:
6769
- name: NVIDIA_DRIVER_CAPABILITIES
6870
value: "compute,utility"
6971
- name: GGML_CUDA_ENABLE_UNIFIED_MEMORY
70-
value: "1" # Vital for Kimi-K2 1T model to bridge VRAM and 400GB RAM
72+
value: "1" # Bridges VRAM and 400GB RAM for Qwen3.5-397B MoE expert offloading
7173
- name: GGML_CUDA_PEER_MAX_BATCH_SIZE
7274
value: "128"
7375
- name: CUDA_SCALE_LAUNCH_QUEUES

0 commit comments

Comments
 (0)