Skip to content

Commit 4d945fd

Browse files
Evrard-Nilclaude
andcommitted
perf: add DeepGEMM cache, multithreaded loading, and context length limit
- Mount deepgemm_cache volume to persist JIT-compiled kernels across restarts - Add --model-loader-extra-config for multithreaded model loading (64 threads) - Set --context-length 202000 to avoid EAGLE off-by-two crash near max pos embeddings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3352b5e commit 4d945fd

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

GLM-5.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,11 +83,14 @@ services:
8383
--speculative-num-draft-tokens 4
8484
--mem-fraction-static 0.90
8585
--max-running-requests 16
86+
--context-length 202000
87+
--model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}'
8688
--port 8000
8789
--host 0.0.0.0
8890
--enable-cache-report
8991
volumes:
9092
- hugginface_cache:/root/.cache/huggingface
93+
- deepgemm_cache:/root/.deep_gemm
9194
environment:
9295
- HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
9396
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
@@ -111,6 +114,7 @@ networks:
111114

112115
volumes:
113116
hugginface_cache:
117+
deepgemm_cache:
114118
certs:
115119
external: true
116120
name: certs

0 commit comments

Comments
 (0)