Releases: nearai/cvm-compose-files
Releases · nearai/cvm-compose-files
v0.0.44
Changes
- feat: Add model-proxy registrar sidecar to all model configs (DeepSeek-V3.1, GLM-5, Qwen3.5-122B, small-models) for automatic endpoint/model registration with the proxy fleet
- fix: Remove
prefill_token_shiftandnum_draft_tokensfrom Qwen3-30B speculative config — these params were removed in vLLM v0.16.0
v0.0.43
Changes
- Remove LMCache entirely — lmcache image, env vars, and
--kv-transfer-configflags removed to fix crashes - Upgrade all vLLM images to v0.16.0 (
sha256:4801151759655c57606c844662e5213403c032a62d149c7ce61d615759a821ef) - GPT-OSS-120B:
--max-num-seqs128→64,--max-num-batched-tokens8K→16K - Qwen3-30B-A3B:
--max-num-batched-tokens16K→24K - Qwen3-VL-30B-A3B: add
--gpu-memory-utilization 0.95,--max-model-len 32768,--max-num-seqs 64,--max-num-batched-tokens 16K(was completely unconfigured)
Add cloud-api usage reporting & JSON logs
- Add
CLOUD_API_URL=https://cloud-api.near.aito all Rust proxy services (small-models, Qwen3.5-122B) - Fix
MODEL_NAMEin GLM-5.yaml:zai-org/GLM-5→zai-org/GLM-5-FP8(was causing 404 on usage reporting) - Add
LOG_FORMAT=jsonto all proxy services for structured logging
v0.0.41
fix: use pip instead of uv for transformers git install (glm_moe_dsa support)
v0.0.40
- DeepGEMM JIT cache volume for persistent kernel compilation
- Multithreaded model loading (
num_threads: 64) - Context length limit (
--context-length 202000) to avoid EAGLE off-by-two crash - Inline Dockerfile with latest transformers for
glm_moe_dsasupport - Switch small-models proxy from Python to Rust (
vllm-proxy-rs)
v0.0.39
Changes
- small-models: Switch all proxy services from Python (
vllm-proxy) to Rust (vllm-proxy-rs)
v0.0.38
Changes (GLM-5)
- Fix: Install latest transformers from source via inline Dockerfile (fixes
glm_moe_dsaarchitecture not recognized) - Fix: Add
--max-running-requests 16to prevent server hanging under load with EAGLE speculative decoding - Fix: Set
--context-length 202000to avoid EAGLE off-by-two crash near max position embeddings - Perf: Mount
deepgemm_cachevolume to persist JIT-compiled kernels across container restarts - Perf: Enable multithreaded model loading (
--model-loader-extra-configwith 64 threads) - Image: Update sglang from
glm5-hoppertoglm5-hopper-patched
v0.0.37
Changes
- GLM-5: Use inline Dockerfile to install latest transformers from source (fixes
glm_moe_dsaarchitecture not recognized inglm5-hopper-patchedimage) - GLM-5: Add
--max-running-requests 16to prevent server hanging under load
v0.0.34
Changes
- GLM-5: Add
--max-running-requests 16to prevent server hanging under load (EAGLE speculative decoding default of 48 is too aggressive at 90% memory fraction) - GLM-5: Update sglang image from
glm5-hoppertoglm5-hopper-patched(Feb 25)
v0.0.33
Changes
- Enable cache report in glm service configuration
- Update server names and add TLS configurations in GLM and small-models YAML files