PaddlePaddle · bobby-cloudforge · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/docs/best_practices/MiniMax-M1.md b/docs/best_practices/MiniMax-M1.md
@@ -0,0 +1,67 @@
+[简体中文](../zh/best_practices/MiniMax-M1.md)
+
+# MiniMax-M1 Model
+
+## I. Environment Preparation
+
+### 1.1 Support Requirements
+
+MiniMax-M1 support in FastDeploy uses a hybrid decoder stack. Details:
+
+- Standard full-attention layers run through the existing FastDeploy attention backend.
+- Linear-attention layers use the Lightning Attention Triton kernels in `fastdeploy/model_executor/ops/triton_ops/lightning_attn.py`.
+- Current first-pass support targets BF16 inference.
+
+### 1.2 Installing FastDeploy
+
+Installation process reference document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md)
+
+## II. How to Use
+
+### 2.1 Basics: Starting the Service
+
+```shell
+MODEL_PATH=/models/MiniMax-Text-01
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "$MODEL_PATH" \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --max-model-len 32768 \
+    --max-num-seqs 32
+```
+
+### 2.2 Quantized Deployment
+
+MiniMax-M1 (456B params) requires quantization for practical deployment. Approximate GPU requirements:
+
+| Mode | GPU Memory | Example Config |
+|------|-----------|----------------|
+| BF16 | ~912 GB | 12× A800-80GB, `--tensor-parallel-size 12` |
+| FP8 | ~456 GB | 6× A800-80GB, `--tensor-parallel-size 6` |
+| WINT4 | ~228 GB | 4× A800-80GB, `--tensor-parallel-size 4` |
+
+```shell
+# WINT4 quantization (recommended minimum)
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "$MODEL_PATH" \
+    --quantization wint4 \
+    --tensor-parallel-size 4 \
+    --port 8180 \
+    --max-model-len 4096 \
+    --max-num-seqs 4
+```
+
+### 2.3 Model Notes
+
+- HuggingFace architecture: `MiniMaxText01ForCausalLM`
+- Hybrid layer layout: 70 linear-attention layers and 10 full-attention layers
+- MoE routing: 32 experts, top-2 experts per token
+
+## III. Known Limitations
+
+- This initial integration is focused on model structure and backend wiring.
+- Low-bit quantization support still requires follow-up validation against MiniMax-M1 weights.
+- Production validation should include GPU runtime checks for Lightning Attention decode/prefill paths.
+- Linear attention KV history uses instance variables, which needs migration to slot-based cache for proper multi-request isolation (TODO in code).
diff --git a/docs/supported_models.md b/docs/supported_models.md
@@ -38,6 +38,7 @@ These models accept text input.
 |⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
 |⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
 |⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
+|MINIMAX-M1|BF16|[MiniMaxAI/MiniMax-Text-01](./best_practices/MiniMax-M1.md);<br>MiniMaxAI/MiniMax-Text-01-Large, etc.|
 |⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
 |⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;<br>zai-org/GLM-4.6<br>&emsp;[最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|
 

diff --git a/docs/zh/best_practices/MiniMax-M1.md b/docs/zh/best_practices/MiniMax-M1.md
@@ -0,0 +1,67 @@
+[English](../../best_practices/MiniMax-M1.md)
+
+# MiniMax-M1 模型
+
+## 一、环境准备
+
+### 1.1 支持说明
+
+FastDeploy 中的 MiniMax-M1 模型采用混合解码器结构：
+
+- 全注意力层复用 FastDeploy 现有 Attention 后端。
+- 线性注意力层使用 `fastdeploy/model_executor/ops/triton_ops/lightning_attn.py` 中的 Lightning Attention Triton kernel。
+- 当前首版支持以 BF16 推理为主。
+
+### 1.2 安装 FastDeploy
+
+安装流程可参考 [FastDeploy GPU 安装文档](../get_started/installation/nvidia_gpu.md)
+
+## 二、使用方式
+
+### 2.1 基础启动命令
+
+```shell
+MODEL_PATH=/models/MiniMax-Text-01
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "$MODEL_PATH" \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --max-model-len 32768 \
+    --max-num-seqs 32
+```
+
+### 2.2 量化部署
+
+MiniMax-M1（456B 参数）在实际部署中需要量化。不同模式的 GPU 显存需求参考：
+
+| 模式 | 显存需求 | 配置示例 |
+|------|---------|----------|
+| BF16 | ~912 GB | 12× A800-80GB, `--tensor-parallel-size 12` |
+| FP8 | ~456 GB | 6× A800-80GB, `--tensor-parallel-size 6` |
+| WINT4 | ~228 GB | 4× A800-80GB, `--tensor-parallel-size 4` |
+
+```shell
+# WINT4 量化部署（推荐最小配置）
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "$MODEL_PATH" \
+    --quantization wint4 \
+    --tensor-parallel-size 4 \
+    --port 8180 \
+    --max-model-len 4096 \
+    --max-num-seqs 4
+```
+
+### 2.3 模型特性
+
+- HuggingFace 架构名：`MiniMaxText01ForCausalLM`
+- 层类型分布：70 层线性注意力 + 10 层全注意力
+- MoE 路由：32 个专家，每个 token 选择 top-2 专家
+
+## 三、当前限制
+
+- 当前版本优先完成模型组网与后端接线。
+- 各类低比特量化推理能力还需要结合真实权重进一步验证。
+- Lightning Attention 的 prefill/decode 路径仍需在 GPU 环境完成端到端验证。
+- 线性注意力的 KV history 当前使用实例变量存储，多请求并发场景下需迁移至 slot-based cache（已有 TODO 标注）。
diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md
@@ -36,6 +36,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 |⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
 |⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
 |⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
+|MINIMAX-M1|BF16|[MiniMaxAI/MiniMax-Text-01](./best_practices/MiniMax-M1.md);<br>MiniMaxAI/MiniMax-Text-01-Large, etc.|
 |⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
 |⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;<br>zai-org/GLM-4.6<br>&emsp;[最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|