You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LoRA training compiler fallback completed dynamic/static 30-step runs with the same final train_loss=11.528600597381592; dynamic 7906.42 tokens/s, static 8369.32 tokens/s, speedup 5.85%.
Known remaining acceptance items:
Paddle and ms-swift GSM8K loss curves both decrease, but are not numerically aligned yet. We aligned and ruled out visible differences including system prompt, weight_decay, adam_beta2, scheduler type, fused vs non-fused AdamW, sample shuffle order, MTP vs no-MTP checkpoint structure, and sampled mapped weights.
Current explanation for the remaining loss gap is framework-level training semantics: ms-swift/HF normalizes accumulated loss by total non-ignored label tokens across the gradient accumulation window, while PaddleFormers computes micro-batch mean loss and averages gradients over gradient_accumulation_steps. On variable-length GSM8K samples this changes per-sample weighting during optimization.
Full-parameter static training reached the optimizer step but hit local GPU memory pressure while creating optimizer states. The LoRA run is documented as a resource-constrained fallback validation and is not treated as satisfying the formal full-training 20% speedup target.
CE tiny checkpoint still needs to be uploaded to an approved location, then CE baseline losses/generation tokens can be filled in.
🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统
🔄 每次 Re-run 后自动更新
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before submitting
Basic checks passed locally:
Unit tests passed locally:
Result:
Ran 22 tests ... OK (skipped=3).testsfolder. If there are codecov issues, please add tests cases first.Added test coverage:
tests/transformers/mimo/test_modeling.pytests/integration_test/mimo_sft_single_card.shtests/config/ci/mimo_sft_single.yamltests/config/benchmark/config/sft/MiMo-7B-Base.yamltests/config/benchmark/config/sft/MiMo-7B-Base-Reduced-Depth-FullWidth.yamlPR types
New features
PR changes
Models, Docs
Description
This PR adds PaddleFormers support for MiMo, including:
Local validation results:
max_diff=0.003246307373046875,mean_diff=4.875975355389528e-05.eval_loss=2.16945743560791,train_loss=3.152836615641912.10840.92 tokens/s, to_static17253.67 tokens/s, speedup59.15%.train_loss=11.528600597381592; dynamic7906.42 tokens/s, static8369.32 tokens/s, speedup5.85%.Known remaining acceptance items:
weight_decay,adam_beta2, scheduler type, fused vs non-fused AdamW, sample shuffle order, MTP vs no-MTP checkpoint structure, and sampled mapped weights.gradient_accumulation_steps. On variable-length GSM8K samples this changes per-sample weighting during optimization.