·
9 commits
to release/0.2.0
since this release
Overview
RTP-LLM First Release Version:0.2.0(2025.09)
Features
Framkework Advanced Feature
- PD Disaggregation && PD Entrance Transpose
- Attention Support more Backend: XQA, FlashInfer
- Speculative Decoding
- EPLB
- MicroBatch & Overlapping
- MTP
- DeepEP
- LoadBalance
- 3FS
- FP8 KVCache
- REUSE KV CACHE
- Quantization
- MultiLoRA
- Attention FFN Disaggregation
- Frontend/Backend Disaggregation
New Models
| Model Family (Variants) | Example HuggingFace Identifier | Description | Support CardType |
|---|---|---|---|
| DeepSeek (v1, v2, v3/R1) | deepseek-ai/DeepSeek-R1 |
Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. RTP-LLM provides Deepseek v3/R1 model-specific optimizations |
NV ✅ AMD ✅ |
| Kimi (Kimi-K2) | moonshotai/Kimi-K2-Instruct |
Moonshot's MoE LLMs with 1 trillion parameters, exceptional on agentic intellegence | NV ✅ AMD ✅ |
| Qwen (v1, v1.5, v2, v2.5, v3, QWQ, Qwen3-Coder) | Qwen/Qwen3-235B-A22B |
Series of advanced reasoning-optimized models, Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. Enhanced 256K long-context understanding capabilities. |
NV ✅ AMD ✅ |
| QwenVL (VL2, VL2.5, VL3) | Qwen/Qwen2-VL-2B |
Series of advanced Vision-language model series based on Qwen2.5/Qwen3 | NV ✅ AMD ❌ |
| Llama | meta-llama/Llama-4-Scout-17B-16E-Instruct |
Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. | NV ✅ AMD ✅ |
Bug Fixs
- P/D Disaggregation dead lock casuse by request cancel/failed before remote running
- Raw Request stream stop_words cause fake hang
Question of omission
- In 3fs Case need more MEM or set FRONTEND_SERVER_COUNT=1 to reduce frontend_server mem usage in P/D when Use Frontend Disaggregation.
- too many dynamic lora need more reserver_runtime_mem_mb
- AMD not support MoE models
- MoE model without shared_experter cannot use enable-layer-micro-batch
- P/D Disaggregation with EPLB and MTP step > 1 may cause Prefill Hang
- Embedding of VL Model is not ok cause by position id is wrong
Performance
Compatibility
Package
Docker Image
| CardType | image | tag |
|---|---|---|
| CUDA-SM9x | ali-hangzhou-hub-registry.cn-hangzhou.cr.aliyuncs.com/isearch/rtp_llm_sm9x_opensource | 0.2.0_0.2.0_2025_10_31_10_23_d1e93ce12 |
| CUDA-SM8x | ali-hangzhou-hub-registry.cn-hangzhou.cr.aliyuncs.com/isearch/rtp_llm_sm8x_opensource | 0.2.0_0.2.0_2025_10_31_10_23_d1e93ce12 |
| CUDA-SM7x | ali-hangzhou-hub-registry.cn-hangzhou.cr.aliyuncs.com/isearch/rtp_llm_sm7x_opensource | 0.2.0_0.2.0_2025_10_31_10_23_d1e93ce12 |
| Frontend | ali-hangzhou-hub-registry.cn-hangzhou.cr.aliyuncs.com/isearch/rtp_llm_frontend-opensource | 0.2.0_0.2.0_2025_10_31_10_23_d1e93ce12 |
| AMD | ali-hangzhou-hub-registry.cn-hangzhou.cr.aliyuncs.com/isearch/rtp_llm_rocm_opensource | 0.2.0_0.2.0_2025_10_31_10_23_d1e93ce12 |
Wheel
Images of Master is comming soon