[Feature]【Hackathon 10th Spring No.47】Add MiniMax-M1 model for FastDeploy#7510
[Feature]【Hackathon 10th Spring No.47】Add MiniMax-M1 model for FastDeploy#7510bobby-cloudforge wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
- rotary_embedding.py: use 'MiniMax' prefix to match both MiniMaxM1 and MiniMaxText01 architectures (was missing HF alias → wrong RoPE) - test_minimax_m1.py: assert residual is None (DeepNorm folds residual into hidden_states, so decoder returns None)
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-20 17:05:29\n\n## 📋 Review 摘要\n\nPR 概述:为 FastDeploy 新增 MiniMax-M1(456B MoE, 70 层线性注意力 + 10 层全注意力)模型支持,包含 Lightning Attention Triton kernel、双路径权重加载和双架构注册。\n变更范围:model_executor/models/、model_executor/ops/triton_ops/、model_executor/layers/rotary_embedding.py、docs/、tests/\n影响面 Tag:ModelsOP\n\n### 问题\n\n| 级别 | 文件 | 概述 |\n|------|------|------|\n| 🔴 Bug |minimax_m1.py:358| 线性注意力 KV history 使用实例变量存储,多请求并发时状态互相污染 |\n| 🟡 建议 |minimax_m1.py:685| v0set_state_dict与 v1load_weights处理逻辑不一致,且缺少不完整 buffer 警告 |\n| 🟡 建议 |lightning_attn.py:576|_fwd_none_diag_kernelgrid 维度与 kernel 内部program_id(2)不匹配 |\n| ❓ 疑问 |rotary_embedding.py:344| MiniMax 复用 Qwen RoPE 路径是否确认兼容 |\n\n### 总体评价\n\n整体模型实现架构清晰,MoE、DeepNorm、双权重加载路径等复杂特性覆盖较完整。核心阻塞问题是 Linear Attention 的_kv_history实例变量存储方式在并发 serving 场景下会导致正确性问题,建议在 merge 前至少添加显式的限制说明或 warning。Triton kernel 实现质量较高,但 grid 维度配置需与 kernel 保持一致。"
| # Retrieve or initialize KV history for recurrent state persistence. | ||
| # TODO: Migrate to ForwardMeta.caches / slot-based cache management for | ||
| # proper multi-request isolation in production serving scenarios. | ||
| if not hasattr(self, "_kv_history") or self._kv_history is None or self._kv_history.shape[0] != batch_size: |
There was a problem hiding this comment.
🔴 Bug 线性注意力的 KV history 使用实例变量 _kv_history 存储,在多请求并发 serving 场景下会导致不同请求共享同一份 KV 状态,产生错误的推理结果。
虽然代码中已标注 TODO,但这是一个阻塞性的正确性问题:FastDeploy 作为推理服务框架,并发 serving 是核心场景。当前实现下:
- 状态互相污染:request A 的 KV history 会被 request B 覆盖,导致两个请求的注意力输出都是错误的
- 状态丢失:
batch_size变化时会直接用零张量重置_kv_history(丢弃所有历史状态),这在 continuous batching 场景下会频繁触发
建议:
- 将 KV history 迁移至
ForwardMeta.caches或 slot-based cache,确保每个请求有独立的 KV 状态 - 至少在本 PR 中添加
logger.warning提示当前实现仅支持单请求场景,以避免在生产中被误用
| name = re.sub(r"\.w2\.", ".down_proj.", name) | ||
| renamed[name] = weight | ||
| # Full attention: merge separate q/k/v into qkv_proj | ||
| elif ".self_attn.q_proj." in name or ".self_attn.k_proj." in name or ".self_attn.v_proj." in name: |
There was a problem hiding this comment.
🟡 建议 set_state_dict(v0 路径)与 load_weights(v1 路径)处理 q/k/v 权重合并的逻辑不一致。
- v1 路径明确区分了 linear attention 层(使用 concat,无 shard_id)和 full attention 层(使用 shard_id),还对不完整的 qkv buffer 打印 warning
- v0 路径对所有层统一做
concat([q,k,v], axis=0),且不完整 buffer 被静默丢弃
当前碰巧结果正确(两种 qkv_proj 都期望 axis=0 concat),但建议:
- 添加与 v1 相同的不完整 buffer warning 日志
- 在注释中说明为何 v0 路径不需要区分 linear/full attention 的原因,避免未来维护者混淆
| ) | ||
|
|
||
| # Step 4: Compute non-diagonal blocks of attention | ||
| grid = (b * h, NUM_BLOCK * NUM_CBLOCK) |
There was a problem hiding this comment.
🟡 建议 _fwd_none_diag_kernel 内部使用了 tl.program_id(2) 来索引 off_e(feature block 偏移),但此处 grid 只有二维 (b * h, NUM_BLOCK * NUM_CBLOCK)。
当前 NUM_FBLOCK = 1 时 program_id(2) 默认返回 0,结果正确。但这种 grid 维度与 kernel 内部 program_id 不匹配的写法存在隐患:
- 如果未来
NUM_FBLOCK > 1,kernel 只会执行第一个 feature block,导致静默的计算错误 - 代码可读性差——阅读 kernel 代码会以为有第三维并行
建议显式使用三维 grid:
grid = (b * h, NUM_BLOCK * NUM_CBLOCK, NUM_FBLOCK)|
|
||
| architecture = model_config.architectures[0] | ||
| if architecture.startswith("Qwen"): | ||
| if architecture.startswith("Qwen") or architecture.startswith("MiniMax"): |
There was a problem hiding this comment.
❓ 疑问 MiniMax-M1 的全注意力层复用了 QwenRotaryEmbedding,请确认 MiniMax-M1 的 RoPE 实现与 Qwen 系列完全兼容(包括 rope_theta、rope_scaling 等参数处理方式)。
另外,architecture.startswith("MiniMax") 匹配范围较广,未来若有其他 MiniMax 架构使用不同 RoPE 实现,建议使用更精确的匹配(如 architecture in ("MiniMaxM1ForCausalLM", "MiniMaxText01ForCausalLM"))。
Motivation
为 FastDeploy 增加部署 MiniMaxAI/MiniMax-M1-40k 系列模型的能力。
This PR adds support for deploying the MiniMax-M1 (456B MoE, 45.9B active) model family in FastDeploy, as required by Hackathon 10th Spring No.47.
MiniMax-M1 is a hybrid-attention Mixture-of-Experts LLM with:
MiniMaxM1ForCausalLMandMiniMaxText01ForCausalLMModifications
Model Code (
fastdeploy/model_executor/models/minimax_m1.py, ~800 lines)9 classes implementing the full model:
MiniMaxM1MLP: Gate/up merged projection with SiLU activationMiniMaxM1MoE: FusedMoE with 32 experts, top-2 routing, renormalize=True, quantization-awareweight_key_map(w4a8, w4afp8 static/dynamic, tensor_wise_fp8, block_wise_fp8)MiniMaxM1FullAttention: Standard GQA with RoPE, used in 10 out of 80 layersMiniMaxM1LinearAttention: Lightning attention with SiLU-gated QKV, output_gate (sigmoid), RMSNorm, persistent KV state historyMiniMaxM1DecoderLayer: Dispatches to linear/full attention based onattn_type_list, DeepNorm scaling with separate alpha/beta per attention type, postnorm supportMiniMaxM1Model: Full transformer with embedding and final RMSNormMiniMaxM1ForCausalLM: Causal LM wrapper with dual weight loading (v0set_state_dict+ v1load_weights)MiniMaxM1PretrainedModel: Tensor parallel column/row split mappingsLightning Attention Kernels (
fastdeploy/model_executor/ops/triton_ops/lightning_attn.py, 711 lines)Triton kernels for O(n) linear attention with exponential decay:
_fwd_kernel: Intra-block attention with causal masking and decay factors_fwd_kv_kernel: Inter-block KV state accumulation with block-level decaylightning_attention(): Python wrapper dispatching to Triton with automatic block size, dtype management, and KV history persistenceDocumentation
docs/best_practices/MiniMax-M1.md+docs/zh/best_practices/MiniMax-M1.md: Bilingual deployment guidedocs/supported_models.md+docs/zh/supported_models.md: Added MiniMax-M1 to LLM model tableUsage or Command
python -m fastdeploy.entrypoints.openai.api_server \ --model MiniMaxAI/MiniMax-M1-40k \ --tensor-parallel-size 8 \ --max-model-len 40960 \ --max-num-seqs 64See docs/best_practices/MiniMax-M1.md for full deployment guide.
Accuracy Tests
Unit Tests (36/36 passed — CI verified on H20 GPU)
tests/model_executor/test_minimax_m1.py(576 lines, 36 function-based tests)_build_attn_type_listcorrectness,_build_slope_tensorshape and valuesModelRegistry, class identityweight_key_mapmonkeypatch.setattr+ lightweight stubsPre-commit Validation
All hooks passing: black, isort, flake8, ruff, clang-format, trailing whitespace, large file check.
Checklist
minimax_m1.py, ~800 lines) — 9 classes with full weight loading + quantizationlightning_attn.py, 711 lines)MiniMaxM1ForCausalLM+MiniMaxText01ForCausalLM