Add MiMo model support by Belugaaaa · Pull Request #4588 · PaddlePaddle/PaddleFormers

Belugaaaa · 2026-06-01T09:31:15Z

Before submitting

Lint code. If there are lint issues, please format the code first.

Basic checks passed locally:

python3 -m py_compile scripts/mimo/*.py paddleformers/transformers/mimo/*.py tests/transformers/mimo/*.py
bash -n scripts/mimo/*.sh tests/integration_test/mimo_sft_single_card.sh

Unit tests passed locally:

CUDA_VISIBLE_DEVICES=0 python -m unittest tests.transformers.mimo.test_modeling -v

Result: Ran 22 tests ... OK (skipped=3).

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

Added test coverage:

tests/transformers/mimo/test_modeling.py
tests/integration_test/mimo_sft_single_card.sh
tests/config/ci/mimo_sft_single.yaml
tests/config/benchmark/config/sft/MiMo-7B-Base.yaml
tests/config/benchmark/config/sft/MiMo-7B-Base-Reduced-Depth-FullWidth.yaml

PR types

New features

PR changes

Models, Docs

Description

This PR adds PaddleFormers support for MiMo, including:

MiMo model/config implementation.
AutoConfig and AutoModelForCausalLM registration.
HF-to-Paddle checkpoint conversion helpers.
Forward/generation alignment helpers.
Tiny and reduced-depth full-width asset generation helpers.
GSM8K SFT configs for full and reduced-depth validation.
Compiler on/off inference and training benchmark helpers.
Unit tests and single-card SFT CI smoke config.
MiMo acceptance and migration status documentation.

Local validation results:

Full official MiMo FP32 forward/generation alignment passed after converting HF safetensors to Paddle native bf16 checkpoint.
Full official FP32 logits: max_diff=0.003246307373046875, mean_diff=4.875975355389528e-05.
Greedy generation first 10 tokens match Transformers.
Reduced-depth full-width Paddle GSM8K 300-step SFT completed successfully.
Paddle reduced-depth SFT final eval_loss=2.16945743560791, train_loss=3.152836615641912.
Reduced-depth inference compiler benchmark passed: dynamic 10840.92 tokens/s, to_static 17253.67 tokens/s, speedup 59.15%.
LoRA training compiler fallback completed dynamic/static 30-step runs with the same final train_loss=11.528600597381592; dynamic 7906.42 tokens/s, static 8369.32 tokens/s, speedup 5.85%.

Known remaining acceptance items:

Paddle and ms-swift GSM8K loss curves both decrease, but are not numerically aligned yet. We aligned and ruled out visible differences including system prompt, weight_decay, adam_beta2, scheduler type, fused vs non-fused AdamW, sample shuffle order, MTP vs no-MTP checkpoint structure, and sampled mapped weights.
Current explanation for the remaining loss gap is framework-level training semantics: ms-swift/HF normalizes accumulated loss by total non-ignored label tokens across the gradient accumulation window, while PaddleFormers computes micro-batch mean loss and averages gradients over gradient_accumulation_steps. On variable-length GSM8K samples this changes per-sample weighting during optimization.
Full-parameter static training reached the optimizer step but hit local GPU memory pressure while creating optimizer states. The LoRA run is documented as a resource-constrained fallback validation and is not treated as satisfying the formal full-training 20% speedup target.
CE tiny checkpoint still needs to be uploaded to an approved location, then CE baseline losses/generation tokens can be filled in.

Paddle-CI-Bot · 2026-06-02T03:24:28Z

PaddleFormers Log Analysis

Run #26796278351 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Unittest GPU CI	其他（网络/图片下载失败）	在 `video_processing_utils.py::_decode_and_sample_videos` 中校验 HTTP 响应状态码，或将测试远程图片 URL 替换为本地 fixture	报错代码

失败的测试 case：

FAILED tests/transformers/qwen2_5_vl/test_processor.py::Qwen2_5_VLProcessorTest::test_apply_chat_template_video_frame_sampling

==== 1 failed, 4222 passed, 314 skipped, 15 warnings in 1357.65s (0:22:37) =====

根本原因分析：

测试的最后一个分支将视频输入设置为帧图片 URL 列表（"url": ["https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg", ...]），由 video_processing_utils.py:286 的 _decode_and_sample_videos 直接透传给上游 transformers.image_processing_base.fetch_images → httpx.get(url).content → PIL.Image.open(BytesIO(content))。CI 环境中 httpx 请求未抛出异常（说明 HTTP 连接本身成功），但返回的内容不是有效图片（可能是 BCEBos 的限速/错误响应页），PIL 无法识别格式，抛出 PIL.UnidentifiedImageError。本 PR 本身（Add MiMo model support）未直接修改 video_processing_utils.py 或 qwen2_5_vl/processor.py，失败点是已有路径因网络抖动暴露出的缺少响应内容校验问题。

修复建议：

校验 HTTP 响应内容：在 video_processing_utils.py::_decode_and_sample_videos 中，当 videos 元素为 URL list（帧模式）时，检查 httpx.get() 的 response.status_code == 200 且 len(content) > 0，否则抛出带 URL 信息的明确异常，方便排查。
消除测试的网络依赖：将 test_apply_chat_template_video_frame_sampling 中的 BCEBos 远程帧图片 URL，替换为用 tempfile 写入的本地合法 JPEG/PNG 文件，彻底消除网络抖动引起的偶发失败。
短期 workaround：rerun 该流水线验证是否为偶发网络问题；若稳定复现，则按第 1/2 点修复。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

Belugaaaa added 3 commits May 26, 2026 21:39

WIP add MiMo model migration

c9da26e

Add MiMo acceptance helpers and validation notes

5eca069

Finalize MiMo acceptance notes and LoRA compile fallback

5027071

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiMo model support#4588

Add MiMo model support#4588
Belugaaaa wants to merge 3 commits into
PaddlePaddle:developfrom
Belugaaaa:feat/mimo

Belugaaaa commented Jun 1, 2026

Uh oh!

Paddle-CI-Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Belugaaaa commented Jun 1, 2026

Before submitting

PR types

PR changes

Description

Uh oh!

Paddle-CI-Bot commented Jun 2, 2026

PaddleFormers Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants