Releases · nearai/cvm-compose-files · GitHub

04 Mar 12:57

Evrard-Nil

v0.0.44

Changes

feat: Add model-proxy registrar sidecar to all model configs (DeepSeek-V3.1, GLM-5, Qwen3.5-122B, small-models) for automatic endpoint/model registration with the proxy fleet
fix: Remove prefill_token_shift and num_draft_tokens from Qwen3-30B speculative config — these params were removed in vLLM v0.16.0

Assets 2

02 Mar 11:36

Evrard-Nil

v0.0.43

Changes

Remove LMCache entirely — lmcache image, env vars, and --kv-transfer-config flags removed to fix crashes
Upgrade all vLLM images to v0.16.0 (sha256:4801151759655c57606c844662e5213403c032a62d149c7ce61d615759a821ef)
GPT-OSS-120B: --max-num-seqs 128→64, --max-num-batched-tokens 8K→16K
Qwen3-30B-A3B: --max-num-batched-tokens 16K→24K
Qwen3-VL-30B-A3B: add --gpu-memory-utilization 0.95, --max-model-len 32768, --max-num-seqs 64, --max-num-batched-tokens 16K (was completely unconfigured)

Assets 2

02 Mar 08:01

Evrard-Nil

Add cloud-api usage reporting & JSON logs

Add CLOUD_API_URL=https://cloud-api.near.ai to all Rust proxy services (small-models, Qwen3.5-122B)
Fix MODEL_NAME in GLM-5.yaml: zai-org/GLM-5 → zai-org/GLM-5-FP8 (was causing 404 on usage reporting)
Add LOG_FORMAT=json to all proxy services for structured logging

Assets 2

01 Mar 11:55

Evrard-Nil

v0.0.41

fix: use pip instead of uv for transformers git install (glm_moe_dsa support)

Assets 2

01 Mar 10:14

Evrard-Nil

v0.0.40

DeepGEMM JIT cache volume for persistent kernel compilation
Multithreaded model loading (num_threads: 64)
Context length limit (--context-length 202000) to avoid EAGLE off-by-two crash
Inline Dockerfile with latest transformers for glm_moe_dsa support
Switch small-models proxy from Python to Rust (vllm-proxy-rs)

Assets 2

01 Mar 09:47

Evrard-Nil

v0.0.39

Changes

small-models: Switch all proxy services from Python (vllm-proxy) to Rust (vllm-proxy-rs)

Assets 2

01 Mar 09:43

Evrard-Nil

v0.0.38

Changes (GLM-5)

Fix: Install latest transformers from source via inline Dockerfile (fixes glm_moe_dsa architecture not recognized)
Fix: Add --max-running-requests 16 to prevent server hanging under load with EAGLE speculative decoding
Fix: Set --context-length 202000 to avoid EAGLE off-by-two crash near max position embeddings
Perf: Mount deepgemm_cache volume to persist JIT-compiled kernels across container restarts
Perf: Enable multithreaded model loading (--model-loader-extra-config with 64 threads)
Image: Update sglang from glm5-hopper to glm5-hopper-patched

Assets 2

01 Mar 08:30

Evrard-Nil

v0.0.37

Changes

GLM-5: Use inline Dockerfile to install latest transformers from source (fixes glm_moe_dsa architecture not recognized in glm5-hopper-patched image)
GLM-5: Add --max-running-requests 16 to prevent server hanging under load

Assets 2

28 Feb 09:44

Evrard-Nil

v0.0.34

Changes

GLM-5: Add --max-running-requests 16 to prevent server hanging under load (EAGLE speculative decoding default of 48 is too aggressive at 90% memory fraction)
GLM-5: Update sglang image from glm5-hopper to glm5-hopper-patched (Feb 25)

Assets 2

26 Feb 07:59

Evrard-Nil

v0.0.33

Changes

Enable cache report in glm service configuration
Update server names and add TLS configurations in GLM and small-models YAML files

Assets 2