PaddlePaddle · Belugaaaa · May 26, 2026 · May 31, 2026 · Jun 1, 2026
diff --git a/docs/mimo_acceptance_status.md b/docs/mimo_acceptance_status.md
diff --git a/docs/mimo_migration.md b/docs/mimo_migration.md
@@ -0,0 +1,171 @@
+# MiMo PaddleFormers Migration Notes
+
+## Scope
+
+This migration adds XiaomiMiMo/MiMo-7B-Base as a Qwen2-compatible decoder model with MiMo-specific configuration and MTP layer registration.
+
+## Implemented
+
+- `MiMoConfig` with `model_type = "mimo"` and MiMo-7B-Base defaults.
+- `MiMoModel`, `MiMoForCausalLM`, and `MiMoForCausalLMPipe` Auto registration.
+- MTP layers are represented in the model state tree so full HF checkpoints can map the additional weights.
+- Tiny model unit tests under `tests/transformers/mimo`.
+- Reduced-depth full-width asset and benchmark helpers under `scripts/mimo`.
+- GSM8K 300-step SFT configs:
+  - `examples/config/sft/mimo_gsm8k_300.yaml`
+  - `examples/config/sft/mimo_gsm8k_reduced_depth_fullwidth_300.yaml`
+  - `tests/config/benchmark/config/sft/MiMo-7B-Base.yaml`
+  - `tests/config/benchmark/config/sft/MiMo-7B-Base-Reduced-Depth-FullWidth.yaml`
+
+## Useful Commands
+
+Create a tiny CI smoke checkpoint:
+
+```bash
+python scripts/mimo/create_tiny_random.py --output-dir ./.cache/mimo/tiny-random-mimo
+```
+
+Prepare both tiny and reduced CE assets, optionally copying a Qwen2/MiMo tokenizer:
+
+```bash
+TOKENIZER_DIR=/path/to/mimo-or-qwen2-tokenizer bash scripts/mimo/prepare_ce_assets.sh
+```
+
+Create a same-width reduced-depth checkpoint for acceptance fallback:
+
+```bash
+LAYERS=4 OUTPUT_DIR=./.cache/mimo/reduced-depth-4l-fullwidth-random bash scripts/mimo/prepare_reduced_assets.sh
+```
+
+Local Paddle native assets are loaded in training configs with:
+
+```yaml
+convert_from_hf: false
+load_checkpoint_format: sharding_io
+save_checkpoint_format: flex_checkpoint
+```
+
+Compare full HF and Paddle logits/generation:
+
+```bash
+python scripts/mimo/compare_forward.py --model XiaomiMiMo/MiMo-7B-Base --dtype bfloat16
+```
+
+For the official checkpoint, the current validated local path is to convert HF
+safetensors to Paddle native first:
+
+```bash
+python scripts/mimo/convert_hf_to_paddle_native.py \
+  --hf-dir /path/to/MiMo-7B-Base \
+  --output-dir /path/to/MiMo-7B-Base-paddle-bf16 \
+  --dtype bfloat16
+```
+
+Create a reduced-depth full-width checkpoint from that converted checkpoint:
+
+```bash
+python scripts/mimo/create_reduced_from_paddle_checkpoint.py \
+  --source-dir /path/to/MiMo-7B-Base-paddle-bf16 \
+  --output-dir ./.cache/mimo/reduced-depth-4l-fullwidth \
+  --num-hidden-layers 4
+```
+
+Run 300-step GSM8K SFT on the reduced-depth checkpoint:
+
+```bash
+CUDA_VISIBLE_DEVICES=2 \
+PATH=/path/to/paddle-env/bin:$PATH \
+PADDLEFORMERS_DIST_LOG=/tmp/mimo_assets/dist_log \
+paddleformers-cli train /tmp/mimo_reduced_real_sft_300.yaml \
+  2>&1 | tee /tmp/mimo_assets/logs/mimo_reduced_real_sft_300.log
+```
+
+Validated local result for the true-weight 4-layer full-width checkpoint:
+`eval_loss=2.16945743560791`, `train_loss=3.152836615641912`, and
+`Total_Tokens_per_second_per_gpu=666.939347597846`.
+
+Create the matching reduced-depth HF checkpoint and run the ms-swift baseline:
+
+```bash
+python scripts/mimo/create_reduced_from_hf_checkpoint.py \
+  --source-dir /path/to/MiMo-7B-Base \
+  --output-dir /path/to/MiMo-7B-Base-reduced-4l-hf-bf16 \
+  --num-hidden-layers 4
+
+CUDA_VISIBLE_DEVICES=2 swift sft \
+  --model /path/to/MiMo-7B-Base-reduced-4l-hf-bf16 \
+  --template qwen \
+  --system 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' \
+  --dataset /tmp/mimo_assets/ms_swift/gsm8k_train.jsonl \
+  --val_dataset /tmp/mimo_assets/ms_swift/gsm8k_test.jsonl \
+  --tuner_type full \
+  --torch_dtype bfloat16 \
+  --attn_impl eager \
+  --max_length 512 \
+  --per_device_train_batch_size 1 \
+  --per_device_eval_batch_size 1 \
+  --gradient_accumulation_steps 4 \
+  --learning_rate 1e-5 \
+  --warmup_steps 20 \
+  --weight_decay 0.0 \
+  --adam_beta2 0.999 \
+  --max_steps 300 \
+  --eval_steps 50 \
+  --save_steps 100 \
+  --logging_steps 1 \
+  --output_dir /tmp/mimo_assets/ms_swift/output-reduced-4l-300-paddle-aligned \
+  --report_to none \
+  --save_total_limit 1 \
+  --seed 23 \
+  --data_seed 23
+```
+
+Validated local ms-swift result for the same 4-layer full-width checkpoint:
+`eval_loss=3.2244072`, `train_loss=4.205`. The curve decreases but remains
+higher than Paddle's curve, so this acceptance item is not fully closed yet.
+
+A second ms-swift run with `--lr_scheduler_type linear` was also completed to
+match Paddle's printed LR schedule. It ended at `eval_loss=3.267` and
+`train_loss=4.266`, so scheduler mismatch is not the primary source of the
+remaining loss gap. Tokenized sample inputs/labels and sampled mapped weights
+match between the reduced HF and Paddle checkpoints. Follow-up controls also
+ruled out `adamw_torch_fused` versus `adamw_torch` and train-sample shuffle
+order: the non-fused run ended at `eval_loss=3.280`, and the no-shuffle run
+ended at `eval_loss=3.265`. A single-sample initial-loss check gave HF shifted
+loss `14.7959` and Paddle shifted loss `14.6266`, so the remaining gap appears
+after optimization starts; the leading suspect is framework-level training
+semantics, especially mixed precision/master weights and gradient/loss
+normalization during gradient accumulation. A Paddle control with
+`fp16_opt_level: O1` was attempted, but it OOMed at the first optimizer step
+while initializing AdamW accumulators, so this cannot be isolated locally
+without a larger card or further memory reductions.
+
+Compare compiler on/off inference and training:
+
+```bash
+bash scripts/mimo/compare_inference_compile_reduced.sh
+bash scripts/mimo/compare_training_compile_reduced.sh
+```
+
+Validated local inference compiler result for the true-weight reduced-depth
+checkpoint: dynamic `10840.92 tokens/s`, to_static `17253.67 tokens/s`,
+speedup `59.15%`.
+
+Training compiler inference passed locally with a `59.15%` speedup. For
+training, full-parameter static SFT reached the optimizer step and then hit
+local GPU memory pressure while creating optimizer states. The LoRA fallback
+completed dynamic and static 30-step runs with the same final loss and a
+`5.85%` speedup; it is recorded as a resource-constrained static-path
+validation, not as the formal full-training 20% target.
+
+## Acceptance Items To Run With Full Assets
+
+1. Single-card forward alignment against Transformers: target logits diff at `1e-2`.
+2. Greedy generation alignment: first 10 generated tokens match Transformers.
+3. GSM8K SFT for 300 steps with the hyperparameters in `examples/config/sft/mimo_gsm8k_300.yaml`. Reduced-depth Paddle and ms-swift runs are complete, but the loss curves are not numerically aligned yet.
+4. CI/CE tiny model upload and CE config wiring after a PaddleFormers/tiny-random-mimo checkpoint is available.
+5. Compiler on/off train and inference benchmark: inference exceeds the 20% target locally; training static mode passes with the LoRA fallback, while full-parameter static training needs a freer/larger GPU for the formal speedup target.
+
+## Notes
+
+MiMo's HF remote code subclasses Qwen2 and does not call the MTP layers during normal causal LM forward. PaddleFormers follows the same behavior: base logits and generation use the Qwen2-compatible main decoder, while the MTP layers are present for checkpoint compatibility and future speculative decoding work.
diff --git a/docs/zh/model_capability.md b/docs/zh/model_capability.md
@@ -7,6 +7,7 @@
 |GLM-4.5|✓|✓|✓|✓|✓|
 |GPT-OSS|✓|✓|✓|x|x|
 |LLaMA3|✓|✓|✓|✓|✓|
+|MiMo|x|✓|✓|x|x|
 |Phi4|✓|✓|✓|✓|✓|
 |Qwen2|✓|✓|✓|✓|✓|
 |Qwen3|✓|✓|✓|✓|✓|
@@ -25,6 +26,7 @@
 |GLM-4.5|✓|✓|✓|✓|✓|✓|
 |GPT-OSS|✓|✓|x|x|✓|✓|
 |LLaMA3|✓|✓|-|x|✓|✓|
+|MiMo|✓|✓|-|x|✓|✓|
 |Phi4|✓|✓|-|x|✓|✓|
 |Qwen2|✓|✓|x|x|✓|✓|
 |Qwen3|✓|✓|✓|✓|✓|✓|

diff --git a/examples/config/sft/mimo_gsm8k_300.yaml b/examples/config/sft/mimo_gsm8k_300.yaml
@@ -0,0 +1,57 @@
+### data
+train_dataset_type: erniekit
+eval_dataset_type: erniekit
+train_dataset_path: ./data/gsm8k_erniekit/train.jsonl
+train_dataset_prob: "1.0"
+eval_dataset_path: ./data/gsm8k_erniekit/test.jsonl
+eval_dataset_prob: "1.0"
+max_seq_len: 512
+packing: false
+mix_strategy: concat
+template_backend: custom
+template: qwen
+
+### model
+model_name_or_path: XiaomiMiMo/MiMo-7B-Base
+_attn_implementation: eager
+
+### finetuning
+stage: SFT
+fine_tuning: full
+seed: 23
+do_train: true
+do_eval: true
+per_device_eval_batch_size: 1
+per_device_train_batch_size: 1
+num_train_epochs: 1
+max_steps: 300
+eval_steps: 50
+evaluation_strategy: steps
+save_steps: 100
+save_strategy: steps
+logging_steps: 1
+save_total_limit: 1
+gradient_accumulation_steps: 4
+logging_dir: ./vdl_log_mimo_gsm8k
+output_dir: ./checkpoints/mimo-gsm8k-sft-300
+disable_tqdm: true
+eval_accumulation_steps: 16
+
+### train
+warmup_steps: 20
+learning_rate: 1.0e-5
+
+### performance
+tensor_model_parallel_size: 1
+pipeline_model_parallel_size: 1
+sharding: stage2
+recompute_granularity: full
+recompute_method: uniform
+recompute_num_layers: 1
+bf16: true
+fp16_opt_level: O2
+unified_checkpoint: false
+save_checkpoint_format: flex_checkpoint
+load_checkpoint_format: flex_checkpoint
+
+benchmark: true
diff --git a/examples/config/sft/mimo_gsm8k_reduced_depth_fullwidth_300.yaml b/examples/config/sft/mimo_gsm8k_reduced_depth_fullwidth_300.yaml
@@ -0,0 +1,58 @@
+### data
+train_dataset_type: erniekit
+eval_dataset_type: erniekit
+train_dataset_path: ./data/gsm8k_erniekit/train.jsonl
+train_dataset_prob: "1.0"
+eval_dataset_path: ./data/gsm8k_erniekit/test.jsonl
+eval_dataset_prob: "1.0"
+max_seq_len: 512
+packing: false
+mix_strategy: concat
+template_backend: custom
+template: qwen
+
+### model
+model_name_or_path: ./.cache/mimo/reduced-depth-4l-fullwidth-random
+_attn_implementation: eager
+convert_from_hf: false
+
+### finetuning
+stage: SFT
+fine_tuning: full
+seed: 23
+do_train: true
+do_eval: true
+per_device_eval_batch_size: 1
+per_device_train_batch_size: 1
+num_train_epochs: 1
+max_steps: 300
+eval_steps: 50
+evaluation_strategy: steps
+save_steps: 100
+save_strategy: steps
+logging_steps: 1
+save_total_limit: 1
+gradient_accumulation_steps: 4
+logging_dir: ./vdl_log_mimo_gsm8k_reduced_depth_fullwidth
+output_dir: ./checkpoints/mimo-gsm8k-reduced-depth-fullwidth-sft-300
+disable_tqdm: true
+eval_accumulation_steps: 16
+
+### train
+warmup_steps: 20
+learning_rate: 1.0e-5
+
+### performance
+tensor_model_parallel_size: 1
+pipeline_model_parallel_size: 1
+sharding: stage2
+recompute_granularity: full
+recompute_method: uniform
+recompute_num_layers: 1
+bf16: true
+fp16_opt_level: O2
+unified_checkpoint: false
+save_checkpoint_format: flex_checkpoint
+load_checkpoint_format: sharding_io
+
+benchmark: true
diff --git a/paddleformers/cli/utils/llm_utils.py b/paddleformers/cli/utils/llm_utils.py
@@ -109,7 +109,7 @@ def get_lora_target_modules(model):
             ".*mlp.w2.*",
             ".*mlp.c_proj.*",
         ]
-    elif model.config.model_type == "qwen2":
+    elif model.config.model_type in {"qwen2", "mimo"}:
         target_modules = [
             ".*qkv_proj.*",
             ".*up_gate_proj.*",

diff --git a/paddleformers/nn/norm.py b/paddleformers/nn/norm.py
@@ -62,13 +62,14 @@ def __init__(self, config: PretrainedConfig, hidden_size=None, norm_eps=None, in
             default_initializer=nn.initializer.Constant(1.0),
         )
         self.config = config
+        self.use_fused_rms_norm = config.get("fuse_rms_norm", True)
 
         if input_is_parallel:
             self.enable_sequence_parallel()
 
     def forward(self, hidden_states):
         current_device = detect_device()
-        if self.config.get("fuse_rms_norm", True) and current_device != "iluvatar_gpu":
+        if self.use_fused_rms_norm and current_device != "iluvatar_gpu":
             return fused_rms_norm_ext(hidden_states, self.weight, self.variance_epsilon)[0].astype(self.weight.dtype)
 
         with paddle.amp.auto_cast(False):

diff --git a/paddleformers/peft/lora/lora_layers.py b/paddleformers/peft/lora/lora_layers.py
@@ -49,11 +49,12 @@
 from ...utils.import_utils import is_paddlefleet_available
 from .utils import rng_ctx
 
-# Conditionally import paddlefleet modules
-if is_paddlefleet_available():
+try:
+    if not is_paddlefleet_available():
+        raise ImportError("paddlefleet is not available")
     from paddlefleet.transformer.moe.moe_expert import BMMFunction, DeepGEMMBMMFunction
-else:
-    # Define mock objects or alternative implementations when paddlefleet is not available
+except ImportError:
+    # Define mock objects or alternative implementations when paddlefleet is not available.
     class BMMFunction:
         pass
 

diff --git a/paddleformers/peft/lora/lora_model.py b/paddleformers/peft/lora/lora_model.py
@@ -38,8 +38,11 @@
 from ...transformers.model_utils import VLMS
 from ...utils.import_utils import is_paddlefleet_available
 
-# Conditionally import paddlefleet modules
-if is_paddlefleet_available():
+# Conditionally import paddlefleet modules. A partial paddlefleet installation
+# can exist without paddlefleet_ops, so guard the actual imports as well.
+try:
+    if not is_paddlefleet_available():
+        raise ImportError("paddlefleet is not installed.")
     from paddlefleet.models.gpt import GPTModel as FleetGPTModel
     from paddlefleet.parallel_state import (
         get_tensor_model_parallel_group,
@@ -50,7 +53,7 @@
     )
     from paddlefleet.tensor_parallel import RowParallelLinear as FleetRowParallelLinear
     from paddlefleet.transformer.moe.moe_expert import GroupedMLPExpert
-else:
+except ImportError:
     # Define mock objects or alternative implementations when paddlefleet is not available
     def get_tensor_model_parallel_group():
         return None