12x Faster MoE Training + Embedding support! #4020

danielhanchen · 2026-02-10T15:25:11Z

danielhanchen
Feb 10, 2026
Maintainer

Our first release of 2026! This year we’ve got a lot of exciting things coming and to kick things off, we’re introducing faster MoE training, embedding model support, and ultra long context for Reinforcement Learning. We’ll also be launching our brand new UI very soon.

We’d like to thank all of you for 50K stars on GitHub! ⭐

We’ve also added support for many new models that you can now run and fine-tune locally, including DeepSeek-OCR 2, GLM-4.7-Flash, Kimi-2.5, and more.

🚀 Faster MoE training

You can now train MoE models 12× faster with 35% less VRAM and 6x longer context via our new Triton and math kernels (no accuracy loss). gpt-oss-20b works on 12.8GB VRAM. Qwen3-30B-A3B (16-bit LoRA) uses 63GB.

Unsloth supports fast training for gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1/V3 arch and GLM (4.7, Flash) models.

Faster MoE Blog

🔎 Embedding models now train 2× faster

We collaborated with Hugging Face to enable 1.8-3.3x faster embedding, BERT and classifier model training with 20% less VRAM, 2x longer context & no accuracy loss vs. FA2 setups.

Embedding model Blog

💡 Ultra Long Context RL is here

We’re introducing new batching algorithms to enable ~7x longer context (can be more than 12x) RL training with no accuracy or speed degradation vs. other optimized setups that use FA3, kernels & chunked losses.

Unsloth now trains gpt-oss QLoRA with 380K context on a single 192GB NVIDIA B200 GPU

Long Context RL Blog

🔮 New models

🐳 DeepSeek-OCR 2 - Run and fine-tune the new OCR model.
🥝 Kimi 2.5 - Run the SOTA model locally with Unsloth GGUFs.
⚡ GLM-4.7-Flash - Run and fine-tune the best-in-class 30B LLM.

🎉 Extra Updates

As part of our MoE release, we also made Gemma-3 now use Flex-Attention by default, and this works in float16 settings as well (there were infinities which we solved a while back). Gemma-3 now uses O(N) memory and not O(N^2) memory, and trains >3x faster (scales even better with context length). Previous Unsloth versions would OOM.
Vision fine-tuning now accepts mixed data of only images and text data!
trl==0.27.1 and transformers==5.1.0 are supported well - previous coverage was 30% of all our 120 notebooks, but now we have >80% coverage - we plan to make it 100% over the next few days.

📖 New Guides

</> How To Use Claude Code + Codex with local LLMs: Guide
👾 Train & deploy to LM Studio for local inference: Guide
🎨 Run Diffusion image models with Unsloth GGUFs: Guide

Tip

Update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
If you want PyTorch 2.9: pip install --upgrade unsloth unsloth_zoo

February is shaping up to be an amazing month for LLM releases, and we hope you’re just as excited as we are. 😊

What's Changed

[FIX] [Transformers] VLM input embeds fix for gradients by @Datta0 in [FIX] [Transformers] VLM input embeds fix for gradients #3715
[fbgemm] Silence tma fbgemm by @Datta0 in [fbgemm] Silence tma fbgemm #3735
[hf_hub] Token login by @Datta0 in [hf_hub] Token login #3739
Do not overwrite slots by @Datta0 in Do not overwrite slots #3752
Fix VLM + DDP checkpointing by @djsaunde in Fix VLM + DDP checkpointing #3751
Enable 4-bit quantization on AMD Radeon GPUs by @sstamenk in Enable 4-bit quantization on AMD Radeon GPUs #3748
Nightly by @danielhanchen in Nightly #3753
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in [pre-commit.ci] pre-commit autoupdate #3760
Nightly by @danielhanchen in Nightly #3767
Add missing import of inspect by @sstamenk in Add missing import of inspect #3778
Clarify NotImplementedError for fast_inference with full_finetuning by @Fizza-Mukhtar in Clarify NotImplementedError for fast_inference with full_finetuning #3768
Update FUNDING.yml by @danielhanchen in Update FUNDING.yml #3792
fix(trainer): import psutil to prevent NameError in _prepare_dataset by @alkinun in fix(trainer): import psutil to prevent NameError in _prepare_dataset #3780
fastrope fix for zero strided tensors by @f14-bertolotti in fastrope fix for zero strided tensors #3782
Fix crash when trl.experimental.openenv is unavailable by @Fizza-Mukhtar in Fix crash when trl.experimental.openenv is unavailable #3787
Fix Boolean value of Tensor ambiguity error in mistral.py by @yurekami in Fix Boolean value of Tensor ambiguity error in mistral.py #3790
fix: add support for init_lora_weights="corda" in get_peft_model by @majiayu000 in fix: add support for init_lora_weights="corda" in get_peft_model #3794
Fix correctness bugs in rl.py, rl_replacements.py, and vision.py by @danielhanchen in Fix correctness bugs in rl.py, rl_replacements.py, and vision.py #3811
Fix correctness bugs across multiple model files by @danielhanchen in Fix correctness bugs across multiple model files #3813
Fix 3D tensor support for bitsandbytes 8-bit matmul in forward pass by @Fizza-Mukhtar in Fix 3D tensor support for bitsandbytes 8-bit matmul in forward pass #3806
FIX: weight tying for LoRA embeddings and lm_head by @oKatanaaa in FIX: weight tying for LoRA embeddings and lm_head #3711
Fix Gemma3 QAT training instability with int8-int4 scheme by @danielhanchen in Fix Gemma3 QAT training instability with int8-int4 scheme #3818
Add helpful error messages for fast_generate when fast_inference=False by @danielhanchen in Add helpful error messages for fast_generate when fast_inference=False #3820
Bug fixes by @danielhanchen in Bug fixes #3821
Make llama.cpp CURL dependency optional when building from source by @Fizza-Mukhtar in Make llama.cpp CURL dependency optional when building from source #3822
remove redundant code of has_block by @ykaitao in remove redundant code of has_block #3832
rl.py fixes: buffer reset, safer attribute access, typo fix by @danielhanchen in rl.py fixes: buffer reset, safer attribute access, typo fix #3834
Respect user quantization_config by @danielhanchen in Respect user quantization_config #3835
Fix vLLM PDL bug on Blackwell GPUs (B200/B100) by @danielhanchen in Fix vLLM PDL bug on Blackwell GPUs (B200/B100) #3841
Sync chat_template from tokenizer to vLLM by @danielhanchen in Sync chat_template from tokenizer to vLLM #3842
remove unused variable BlockDiagonalCausalMask by @ykaitao in remove unused variable BlockDiagonalCausalMask #3836
Replace GitHub API check with vLLM version check for PDL fix by @danielhanchen in Replace GitHub API check with vLLM version check for PDL fix #3849
GRPO: restore model mode after generate (stacked on Fix model training state restoration in GRPO trainer #3754) by @danielhanchen in GRPO: restore model mode after generate (stacked on #3754) #3851
Fix model training state restoration in GRPO trainer by @numb3r33 in Fix model training state restoration in GRPO trainer #3754
Unify Version usage and fix TRL version handling by @danielhanchen in Unify Version usage and fix TRL version handling #3843
[ModelScope] Disable stats when modelscope is being used by @Datta0 in [ModelScope] Disable stats when modelscope is being used #3857
Fix FBGEMM/CUTLASS errors on SM100 (Blackwell) GPUs by @danielhanchen in Fix FBGEMM/CUTLASS errors on SM100 (Blackwell) GPUs #3863
Feature/raw text dataprep by @Vangmay in Feature/raw text dataprep #3612
Fix Kaggle telemetry misclassification when COLAB_ keys exist by @hnxnq7 in Fix Kaggle telemetry misclassification when COLAB_ keys exist #3869
reduce code duplication by _offload_frozen_module_for_training by @ykaitao in reduce code duplication by _offload_frozen_module_for_training #3865
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in [pre-commit.ci] pre-commit autoupdate #3881
wrong number of dimensions by @f14-bertolotti in wrong number of dimensions #3880
Disable gradient checkpointing when explicitly off for vision by @ducviet00 in Disable gradient checkpointing when explicitly off for vision #3879
[trl] use non lora model as base for RL by @Datta0 in [trl] use non lora model as base for RL #3895
Chunk Across Batch and Context length for logprob calculations for grpo by @pluesclues in Chunk Across Batch and Context length for logprob calculations for grpo #3628
add weight-only int8 QAT scheme and update tests for torchao 0.15.0 by @electroglyph in add weight-only int8 QAT scheme and update tests for torchao 0.15.0 #3859
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in [pre-commit.ci] pre-commit autoupdate #3905
Fix vllm ipykernel patch by @pluesclues in Fix vllm ipykernel patch #3907
Handle Transformers 5 vLLM import errors by @danielhanchen in Handle Transformers 5 vLLM import errors #3908
add FastSentenceTransformer for easily finetuning SentenceTransformer models by @electroglyph in add FastSentenceTransformer for easily finetuning SentenceTransformer models #3719
Guard torch.compile on ROCm when triton_key is missing by @hnxnq7 in Guard torch.compile on ROCm when triton_key is missing #3923
Grpo compile settings update by @pluesclues in Grpo compile settings update #3927
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in [pre-commit.ci] pre-commit autoupdate #3937
chore: Update outdated GitHub Actions version by @pgoslatara in chore: Update outdated GitHub Actions version #3936
[trl] vllm trl topk fixup by @Datta0 in [trl] vllm trl topk fixup #3935
[fix] qwen3-guard tokenizer by @Datta0 in [fix] qwen3-guard tokenizer #3959
fix for intel devices torch compile configs by @leizhenyuan in fix for intel devices torch compile configs #3952
Use standard gradient checkpointing for small sequence lengths by @danielhanchen in Use standard gradient checkpointing for small sequence lengths #3867
reduce code duplication by @ykaitao in reduce code duplication #3877
Fix TRL 0.27.0 GRPO compatibility and PEFT model handling by @danielhanchen in Fix TRL 0.27.0 GRPO compatibility and PEFT model handling #3969
Fix Vision GRPO string prompts and OpenEnv async compatibility by @danielhanchen in Fix Vision GRPO string prompts and OpenEnv async compatibility #3964
Fix num_train_epochs=None causing TypeError in GRPOConfig by @danielhanchen in Fix num_train_epochs=None causing TypeError in GRPOConfig #3972
Add TRL truncation regression and metadata loss fixes (Fixes 1 and 3) by @danielhanchen in Add TRL truncation regression and metadata loss fixes (Fixes 1 and 3) #3971
Add vLLM + torch < 2.9.0 + SM100 compatibility check by @danielhanchen in Add vLLM + torch < 2.9.0 + SM100 compatibility check #3973
Fix torchvision compatibility check for source builds and future torch versions by @danielhanchen in Fix torchvision compatibility check for source builds and future torch versions #3978
Trl 0.27.0 update by @pluesclues in Trl 0.27.0 update #3965
Prefer flex attention when available by @danielhanchen in Prefer flex attention when available #3979
Fix GPT-OSS BlockMask error during inference by @danielhanchen in Fix GPT-OSS BlockMask error during inference #3982
Silence third-party deprecation warnings and fix socket leak by @danielhanchen in Silence third-party deprecation warnings and fix socket leak #3983
Silence non-actionable TRL trainer import failures by @danielhanchen in Silence non-actionable TRL trainer import failures #3980
Add PyTorch 2.10 and xformers 0.0.34 support by @danielhanchen in Add PyTorch 2.10 and xformers 0.0.34 support #3985
[MoE] Improve moe kernels for unsloth fine tuning by @Datta0 in [MoE] Improve moe kernels for unsloth fine tuning #3812
Fix RuntimeError not caught when torchcodec fails to load by @danielhanchen in Fix RuntimeError not caught when torchcodec fails to load #3987
Fix cutlass inductor options for PyTorch < 2.8.0 by @danielhanchen in Fix cutlass inductor options for PyTorch < 2.8.0 #3988
Disable torchcodec in transformers when FFmpeg is missing by @danielhanchen in Disable torchcodec in transformers when FFmpeg is missing #3989
Update rl_replacements.py to filter through correct trl version by @pluesclues in Update rl_replacements.py to filter through correct trl version #3990
Fix multiprocessing crash on Windows/macOS and unify num_proc logic by @danielhanchen in Fix multiprocessing crash on Windows/macOS and unify num_proc logic #3999
Fix triton 3.6.0 + torch 2.9.x torch.compile crash (missing cluster_dims) by @danielhanchen in Fix triton 3.6.0 + torch 2.9.x torch.compile crash (missing cluster_dims) #4001
Add push_to_hub_gguf support for FastSentenceTransformer by @Etherll in Add push_to_hub_gguf support for FastSentenceTransformer #4002
[Feature] seperate gguf file path by @RektPunk in [Feature] seperate gguf file path #3934
Refactor Ollama template wiring and harden packing helpers by @mmangkad in Refactor Ollama template wiring and harden packing helpers #3890
Fix multi-GPU loading for quantized models in distributed training by @Fizza-Mukhtar in Fix multi-GPU loading for quantized models in distributed training #3917
Fix broken documentation links, typos, and formatting in README by @danielhanchen in Fix broken documentation links, typos, and formatting in README #4003
fix: inputs_embeds ignored when input_ids is not None in _fast_prepare_inputs_for_generation by @siddhudonda in fix: inputs_embeds ignored when input_ids is not None in _fast_prepare_inputs_for_generation #3814
Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27 by @danielhanchen in Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27 #3998
Fix VLM model + text-only dataset ValueError in TRL 0.22.x by @danielhanchen in Fix VLM model + text-only dataset ValueError in TRL 0.22.x #4004
Fix trl.experimental thin wrapper compilation and OOM from peft_config overwrite by @danielhanchen in Fix trl.experimental thin wrapper compilation and OOM from peft_config overwrite #4006
Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training by @danielhanchen in Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training #4005
Silence TRL's batch_size=1 padding-free warning in compiled trainer source by @danielhanchen in Silence TRL's batch_size=1 padding-free warning in compiled trainer source #4007
Silence peft target_parameters RuntimeWarning for MoE models by @danielhanchen in Silence peft target_parameters RuntimeWarning for MoE models #4008
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in [pre-commit.ci] pre-commit autoupdate #4009
Suppress vLLM v1 executor sleep/wake log messages by @danielhanchen in Suppress vLLM v1 executor sleep/wake log messages #4011
Inject model reference for dynamic token_type_ids detection in SFTTrainer by @danielhanchen in Inject model reference for dynamic token_type_ids detection in SFTTrainer #4012
Fix EmbeddingGemma float16 NaN via FORCE_FLOAT32 for gemma3_text by @danielhanchen in Fix EmbeddingGemma float16 NaN via FORCE_FLOAT32 for gemma3_text #4014
Fix [Bug] - Recent update broke trainer ; endless loop during tokenization #3397: Prevent trainer tokenization hang with safe num_proc by @Fizza-Mukhtar in Fix #3397: Prevent trainer tokenization hang with safe num_proc #4013
add llama.cpp prefix to gguf conversion help messages by @rolandtannous in add llama.cpp prefix to gguf conversion help messages #4016
[Misc] Fixes by @Datta0 in [Misc] Fixes #4015
FP8: Load model on-the-fly in vLLM by @andrewor14 in FP8: Load model on-the-fly in vLLM #3717
Fix Gemma3 4B training on transformers 5.x (token_type_ids) by @danielhanchen in Fix Gemma3 4B training on transformers 5.x (token_type_ids) #4017
Fix warmup_ratio deprecation for transformers >= 5.0 by @danielhanchen in Fix warmup_ratio deprecation for transformers >= 5.0 #4019
Misc fixes by @Datta0 in Misc fixes #4018

Unsloth Zoo Changes

Fix training crash when using DoRA + 4-bit quantization by @Etherll in Fix training crash when using DoRA + 4-bit quantization unsloth-zoo#394
fix for Hooks for calling FastLanguageModel.for_inference() when model.eval() is called #392, transformers 5 by @electroglyph in fix for #392, transformers 5 unsloth-zoo#393
fix: adds missing import for torch.distributed by @namekian-mystifier in fix: adds missing import for torch.distributed unsloth-zoo#422
Fix dtype mismatch in full finetuning + float16 inference by @danielhanchen in Fix dtype mismatch in full finetuning + float16 inference unsloth-zoo#424
Fix undefined variable 'e' in Version() function by @danielhanchen in Fix undefined variable 'e' in Version() function unsloth-zoo#425
Fix correctness bugs in logging_utils.py and loss_utils.py by @danielhanchen in Fix correctness bugs in logging_utils.py and loss_utils.py unsloth-zoo#426
Fix execute_with_time_limit start_method bug by @danielhanchen in Fix execute_with_time_limit start_method bug unsloth-zoo#428
Fix OpenEnv PYTHONPATH auto-detection for compatibility by @danielhanchen in Fix OpenEnv PYTHONPATH auto-detection for compatibility unsloth-zoo#429
Fix VARIANT_KWARG_KEYS import for peft >= 0.18.0 by @danielhanchen in Fix VARIANT_KWARG_KEYS import for peft >= 0.18.0 unsloth-zoo#430
Fix ZeroDivisionError in fused cross entropy when GPU memory exhausted by @GabrielArpini in Fix ZeroDivisionError in fused cross entropy when GPU memory exhausted unsloth-zoo#432
Only enable gradient checkpointing when requested by @danielhanchen in Only enable gradient checkpointing when requested unsloth-zoo#433
Removing import check in compiler.py by @Vidit-Ostwal in Removing import check in compiler.py unsloth-zoo#431

New Contributors

@sstamenk made their first contribution in Enable 4-bit quantization on AMD Radeon GPUs #3748
@Fizza-Mukhtar made their first contribution in Clarify NotImplementedError for fast_inference with full_finetuning #3768
@alkinun made their first contribution in fix(trainer): import psutil to prevent NameError in _prepare_dataset #3780
@f14-bertolotti made their first contribution in fastrope fix for zero strided tensors #3782
@yurekami made their first contribution in Fix Boolean value of Tensor ambiguity error in mistral.py #3790
@majiayu000 made their first contribution in fix: add support for init_lora_weights="corda" in get_peft_model #3794
@ykaitao made their first contribution in remove redundant code of has_block #3832
@numb3r33 made their first contribution in Fix model training state restoration in GRPO trainer #3754
@Vangmay made their first contribution in Feature/raw text dataprep #3612
@hnxnq7 made their first contribution in Fix Kaggle telemetry misclassification when COLAB_ keys exist #3869
@ducviet00 made their first contribution in Disable gradient checkpointing when explicitly off for vision #3879
@electroglyph made their first contribution in add weight-only int8 QAT scheme and update tests for torchao 0.15.0 #3859
@pgoslatara made their first contribution in chore: Update outdated GitHub Actions version #3936
@RektPunk made their first contribution in [Feature] seperate gguf file path #3934
@mmangkad made their first contribution in Refactor Ollama template wiring and harden packing helpers #3890
@siddhudonda made their first contribution in fix: inputs_embeds ignored when input_ids is not None in _fast_prepare_inputs_for_generation #3814

Full Changelog: December-2025...February-2026

This discussion was created from the release 12x Faster MoE Training + Embedding support!.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

12x Faster MoE Training + Embedding support! #4020

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

12x Faster MoE Training + Embedding support! #4020

Uh oh!

danielhanchen Feb 10, 2026 Maintainer

🚀 Faster MoE training

🔎 Embedding models now train 2× faster

💡 Ultra Long Context RL is here

🔮 New models

🎉 Extra Updates

📖 New Guides

What's Changed

Unsloth Zoo Changes

New Contributors

Replies: 0 comments

danielhanchen
Feb 10, 2026
Maintainer