Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 314 additions & 0 deletions docs/mimo_acceptance_status.md

Large diffs are not rendered by default.

171 changes: 171 additions & 0 deletions docs/mimo_migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# MiMo PaddleFormers Migration Notes

## Scope

This migration adds XiaomiMiMo/MiMo-7B-Base as a Qwen2-compatible decoder model with MiMo-specific configuration and MTP layer registration.

## Implemented

- `MiMoConfig` with `model_type = "mimo"` and MiMo-7B-Base defaults.
- `MiMoModel`, `MiMoForCausalLM`, and `MiMoForCausalLMPipe` Auto registration.
- MTP layers are represented in the model state tree so full HF checkpoints can map the additional weights.
- Tiny model unit tests under `tests/transformers/mimo`.
- Reduced-depth full-width asset and benchmark helpers under `scripts/mimo`.
- GSM8K 300-step SFT configs:
- `examples/config/sft/mimo_gsm8k_300.yaml`
- `examples/config/sft/mimo_gsm8k_reduced_depth_fullwidth_300.yaml`
- `tests/config/benchmark/config/sft/MiMo-7B-Base.yaml`
- `tests/config/benchmark/config/sft/MiMo-7B-Base-Reduced-Depth-FullWidth.yaml`

## Useful Commands

Create a tiny CI smoke checkpoint:

```bash
python scripts/mimo/create_tiny_random.py --output-dir ./.cache/mimo/tiny-random-mimo
```

Prepare both tiny and reduced CE assets, optionally copying a Qwen2/MiMo tokenizer:

```bash
TOKENIZER_DIR=/path/to/mimo-or-qwen2-tokenizer bash scripts/mimo/prepare_ce_assets.sh
```

Create a same-width reduced-depth checkpoint for acceptance fallback:

```bash
LAYERS=4 OUTPUT_DIR=./.cache/mimo/reduced-depth-4l-fullwidth-random bash scripts/mimo/prepare_reduced_assets.sh
```

Local Paddle native assets are loaded in training configs with:

```yaml
convert_from_hf: false
load_checkpoint_format: sharding_io
save_checkpoint_format: flex_checkpoint
```

Compare full HF and Paddle logits/generation:

```bash
python scripts/mimo/compare_forward.py --model XiaomiMiMo/MiMo-7B-Base --dtype bfloat16
```

For the official checkpoint, the current validated local path is to convert HF
safetensors to Paddle native first:

```bash
python scripts/mimo/convert_hf_to_paddle_native.py \
--hf-dir /path/to/MiMo-7B-Base \
--output-dir /path/to/MiMo-7B-Base-paddle-bf16 \
--dtype bfloat16
```

Create a reduced-depth full-width checkpoint from that converted checkpoint:

```bash
python scripts/mimo/create_reduced_from_paddle_checkpoint.py \
--source-dir /path/to/MiMo-7B-Base-paddle-bf16 \
--output-dir ./.cache/mimo/reduced-depth-4l-fullwidth \
--num-hidden-layers 4
```

Run 300-step GSM8K SFT on the reduced-depth checkpoint:

```bash
CUDA_VISIBLE_DEVICES=2 \
PATH=/path/to/paddle-env/bin:$PATH \
PADDLEFORMERS_DIST_LOG=/tmp/mimo_assets/dist_log \
paddleformers-cli train /tmp/mimo_reduced_real_sft_300.yaml \
2>&1 | tee /tmp/mimo_assets/logs/mimo_reduced_real_sft_300.log
```

Validated local result for the true-weight 4-layer full-width checkpoint:
`eval_loss=2.16945743560791`, `train_loss=3.152836615641912`, and
`Total_Tokens_per_second_per_gpu=666.939347597846`.

Create the matching reduced-depth HF checkpoint and run the ms-swift baseline:

```bash
python scripts/mimo/create_reduced_from_hf_checkpoint.py \
--source-dir /path/to/MiMo-7B-Base \
--output-dir /path/to/MiMo-7B-Base-reduced-4l-hf-bf16 \
--num-hidden-layers 4

CUDA_VISIBLE_DEVICES=2 swift sft \
--model /path/to/MiMo-7B-Base-reduced-4l-hf-bf16 \
--template qwen \
--system 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' \
--dataset /tmp/mimo_assets/ms_swift/gsm8k_train.jsonl \
--val_dataset /tmp/mimo_assets/ms_swift/gsm8k_test.jsonl \
--tuner_type full \
--torch_dtype bfloat16 \
--attn_impl eager \
--max_length 512 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--warmup_steps 20 \
--weight_decay 0.0 \
--adam_beta2 0.999 \
--max_steps 300 \
--eval_steps 50 \
--save_steps 100 \
--logging_steps 1 \
--output_dir /tmp/mimo_assets/ms_swift/output-reduced-4l-300-paddle-aligned \
--report_to none \
--save_total_limit 1 \
--seed 23 \
--data_seed 23
```

Validated local ms-swift result for the same 4-layer full-width checkpoint:
`eval_loss=3.2244072`, `train_loss=4.205`. The curve decreases but remains
higher than Paddle's curve, so this acceptance item is not fully closed yet.

A second ms-swift run with `--lr_scheduler_type linear` was also completed to
match Paddle's printed LR schedule. It ended at `eval_loss=3.267` and
`train_loss=4.266`, so scheduler mismatch is not the primary source of the
remaining loss gap. Tokenized sample inputs/labels and sampled mapped weights
match between the reduced HF and Paddle checkpoints. Follow-up controls also
ruled out `adamw_torch_fused` versus `adamw_torch` and train-sample shuffle
order: the non-fused run ended at `eval_loss=3.280`, and the no-shuffle run
ended at `eval_loss=3.265`. A single-sample initial-loss check gave HF shifted
loss `14.7959` and Paddle shifted loss `14.6266`, so the remaining gap appears
after optimization starts; the leading suspect is framework-level training
semantics, especially mixed precision/master weights and gradient/loss
normalization during gradient accumulation. A Paddle control with
`fp16_opt_level: O1` was attempted, but it OOMed at the first optimizer step
while initializing AdamW accumulators, so this cannot be isolated locally
without a larger card or further memory reductions.

Compare compiler on/off inference and training:

```bash
bash scripts/mimo/compare_inference_compile_reduced.sh
bash scripts/mimo/compare_training_compile_reduced.sh
```

Validated local inference compiler result for the true-weight reduced-depth
checkpoint: dynamic `10840.92 tokens/s`, to_static `17253.67 tokens/s`,
speedup `59.15%`.

Training compiler inference passed locally with a `59.15%` speedup. For
training, full-parameter static SFT reached the optimizer step and then hit
local GPU memory pressure while creating optimizer states. The LoRA fallback
completed dynamic and static 30-step runs with the same final loss and a
`5.85%` speedup; it is recorded as a resource-constrained static-path
validation, not as the formal full-training 20% target.

## Acceptance Items To Run With Full Assets

1. Single-card forward alignment against Transformers: target logits diff at `1e-2`.
2. Greedy generation alignment: first 10 generated tokens match Transformers.
3. GSM8K SFT for 300 steps with the hyperparameters in `examples/config/sft/mimo_gsm8k_300.yaml`. Reduced-depth Paddle and ms-swift runs are complete, but the loss curves are not numerically aligned yet.
4. CI/CE tiny model upload and CE config wiring after a PaddleFormers/tiny-random-mimo checkpoint is available.
5. Compiler on/off train and inference benchmark: inference exceeds the 20% target locally; training static mode passes with the LoRA fallback, while full-parameter static training needs a freer/larger GPU for the formal speedup target.

## Notes

MiMo's HF remote code subclasses Qwen2 and does not call the MTP layers during normal causal LM forward. PaddleFormers follows the same behavior: base logits and generation use the Qwen2-compatible main decoder, while the MTP layers are present for checkpoint compatibility and future speculative decoding work.
2 changes: 2 additions & 0 deletions docs/zh/model_capability.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
|GLM-4.5|✓|✓|✓|✓|✓|
|GPT-OSS|✓|✓|✓|x|x|
|LLaMA3|✓|✓|✓|✓|✓|
|MiMo|x|✓|✓|x|x|
|Phi4|✓|✓|✓|✓|✓|
|Qwen2|✓|✓|✓|✓|✓|
|Qwen3|✓|✓|✓|✓|✓|
Expand All @@ -25,6 +26,7 @@
|GLM-4.5|✓|✓|✓|✓|✓|✓|
|GPT-OSS|✓|✓|x|x|✓|✓|
|LLaMA3|✓|✓|-|x|✓|✓|
|MiMo|✓|✓|-|x|✓|✓|
|Phi4|✓|✓|-|x|✓|✓|
|Qwen2|✓|✓|x|x|✓|✓|
|Qwen3|✓|✓|✓|✓|✓|✓|
Expand Down
57 changes: 57 additions & 0 deletions examples/config/sft/mimo_gsm8k_300.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
### data
train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./data/gsm8k_erniekit/train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: ./data/gsm8k_erniekit/test.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 512
packing: false
mix_strategy: concat
template_backend: custom
template: qwen

### model
model_name_or_path: XiaomiMiMo/MiMo-7B-Base
_attn_implementation: eager

### finetuning
stage: SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
num_train_epochs: 1
max_steps: 300
eval_steps: 50
evaluation_strategy: steps
save_steps: 100
save_strategy: steps
logging_steps: 1
save_total_limit: 1
gradient_accumulation_steps: 4
logging_dir: ./vdl_log_mimo_gsm8k
output_dir: ./checkpoints/mimo-gsm8k-sft-300
disable_tqdm: true
eval_accumulation_steps: 16

### train
warmup_steps: 20
learning_rate: 1.0e-5

### performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage2
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O2
unified_checkpoint: false
save_checkpoint_format: flex_checkpoint
load_checkpoint_format: flex_checkpoint

benchmark: true
58 changes: 58 additions & 0 deletions examples/config/sft/mimo_gsm8k_reduced_depth_fullwidth_300.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
### data
train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./data/gsm8k_erniekit/train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: ./data/gsm8k_erniekit/test.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 512
packing: false
mix_strategy: concat
template_backend: custom
template: qwen

### model
model_name_or_path: ./.cache/mimo/reduced-depth-4l-fullwidth-random
_attn_implementation: eager
convert_from_hf: false

### finetuning
stage: SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
num_train_epochs: 1
max_steps: 300
eval_steps: 50
evaluation_strategy: steps
save_steps: 100
save_strategy: steps
logging_steps: 1
save_total_limit: 1
gradient_accumulation_steps: 4
logging_dir: ./vdl_log_mimo_gsm8k_reduced_depth_fullwidth
output_dir: ./checkpoints/mimo-gsm8k-reduced-depth-fullwidth-sft-300
disable_tqdm: true
eval_accumulation_steps: 16

### train
warmup_steps: 20
learning_rate: 1.0e-5

### performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage2
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O2
unified_checkpoint: false
save_checkpoint_format: flex_checkpoint
load_checkpoint_format: sharding_io

benchmark: true
2 changes: 1 addition & 1 deletion paddleformers/cli/utils/llm_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def get_lora_target_modules(model):
".*mlp.w2.*",
".*mlp.c_proj.*",
]
elif model.config.model_type == "qwen2":
elif model.config.model_type in {"qwen2", "mimo"}:
target_modules = [
".*qkv_proj.*",
".*up_gate_proj.*",
Expand Down
3 changes: 2 additions & 1 deletion paddleformers/nn/norm.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,14 @@ def __init__(self, config: PretrainedConfig, hidden_size=None, norm_eps=None, in
default_initializer=nn.initializer.Constant(1.0),
)
self.config = config
self.use_fused_rms_norm = config.get("fuse_rms_norm", True)

if input_is_parallel:
self.enable_sequence_parallel()

def forward(self, hidden_states):
current_device = detect_device()
if self.config.get("fuse_rms_norm", True) and current_device != "iluvatar_gpu":
if self.use_fused_rms_norm and current_device != "iluvatar_gpu":
return fused_rms_norm_ext(hidden_states, self.weight, self.variance_epsilon)[0].astype(self.weight.dtype)

with paddle.amp.auto_cast(False):
Expand Down
9 changes: 5 additions & 4 deletions paddleformers/peft/lora/lora_layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,12 @@
from ...utils.import_utils import is_paddlefleet_available
from .utils import rng_ctx

# Conditionally import paddlefleet modules
if is_paddlefleet_available():
try:
if not is_paddlefleet_available():
raise ImportError("paddlefleet is not available")
from paddlefleet.transformer.moe.moe_expert import BMMFunction, DeepGEMMBMMFunction
else:
# Define mock objects or alternative implementations when paddlefleet is not available
except ImportError:
# Define mock objects or alternative implementations when paddlefleet is not available.
class BMMFunction:
pass

Expand Down
9 changes: 6 additions & 3 deletions paddleformers/peft/lora/lora_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,11 @@
from ...transformers.model_utils import VLMS
from ...utils.import_utils import is_paddlefleet_available

# Conditionally import paddlefleet modules
if is_paddlefleet_available():
# Conditionally import paddlefleet modules. A partial paddlefleet installation
# can exist without paddlefleet_ops, so guard the actual imports as well.
try:
if not is_paddlefleet_available():
raise ImportError("paddlefleet is not installed.")
from paddlefleet.models.gpt import GPTModel as FleetGPTModel
from paddlefleet.parallel_state import (
get_tensor_model_parallel_group,
Expand All @@ -50,7 +53,7 @@
)
from paddlefleet.tensor_parallel import RowParallelLinear as FleetRowParallelLinear
from paddlefleet.transformer.moe.moe_expert import GroupedMLPExpert
else:
except ImportError:
# Define mock objects or alternative implementations when paddlefleet is not available
def get_tensor_model_parallel_group():
return None
Expand Down
Loading
Loading