fix(deps): update dependency accelerate to v1.14.0 by konflux-internal-p02[bot] · Pull Request #321 · red-hat-data-services/lm-evaluation-harness

konflux-internal-p02 · 2025-11-07T04:24:51Z

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package	Change	Age	Confidence
accelerate	`==1.0.1` → `==1.14.0`

Release Notes

huggingface/accelerate (accelerate)

`v1.14.0`: : AMD ROCm support, FSDP2 hardening

Compare Source

FSDP2 Improvements

This release brings a large batch of FSDP2 fixes and quality-of-life improvements: correct dtype handling on load, sharding of embeddings/norms, QLoRA crash prevention, and a more robust auto-wrap policy.

Fsdp2 fully_shard embedding and norm by @SunMarc in #4015
Fix fsdp2 load full state dict dtype mismatch by @SunMarc in #4021
Fix region compilation fsdpv2 by @SunMarc in #4022
[FSDP2] Cast model to uniform dtype before fully_shard to fix mixed-dtype AssertionError by @roycho96 in #3985
[FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash by @roycho96 in #3987
fix(FSDP2): auto-wrap policy ignoring _no_split_modules fallback by @JohnGiorgi in #3999
fix: use key-based matching in fsdp2_load_full_state_dict by @roycho96 in #3982
fix: add missing model_has_params4bit guard to fsdp2_load_full_state_dict call by @roycho96 in #3981
Fix to-fsdp2: drop REMOVED / NOT_YET_IMPLEMENTED FSDP1 keys instead of leaking them by @lollinng in #4065
Prevent double-wrapping models in prepare_model() by @joshuaswanson in #3977

AMD ROCm support

Accelerate now works end-to-end on AMD ROCm devices. Thanks @Abdennacer-Badaoui!

Make accelerate work end-to-end on AMD ROCm by @Abdennacer-Badaoui in #4025

Neuron

Further Neuron improvements to reduce recompilation and cover missing device cases.

Add padded allgather and broadcast for Neuron devices to reduce recompilation by @czkkkkkk in #4000
fix: add missing neuron device case by @michaelbenayoun in #4042

Quantization & Offloading

We improved offloading support for quantized models, including Torchao, int8, and tied-weight handling.

Torchao offload by @SunMarc in #3973
Fix int8 offload hook detachment statistics restoration by @jiqing-feng in #4044
Fix keep_in_fp32_modules not working for tied weights in load_and_quantize_model by @jiqing-feng in #4043
Fix dtype_byte_size for FP8 fnuz / e8m0fnu dtypes by @lollinng in #4063

Data Loading

Feat: Support dynamic batch size in BatchSamplerShard with even_batches by @yuxinyuan in #3969
Fix iterable dataset sharding condition when n_shards == num_processes by @SunMarc in #3958
Fix implicit padding in split_between_processes when apply_padding=False and num_samples < num_processes by @3manifold in #4052

Minor fixes

[DeepSpeed] allow kernels flash-attn in SP by @kashif in #3959
Fix: Conditionally import torch.distributed.algorithms.join in accelerator.py by @0xDELUXA in #3962
Fix is_hf_initialized attribute by @SunMarc in #3976
feat(utils): add max reduction type by @imstevenpmwork in #4027
fix(state): make MLU backend part of the _prepare_backend elif chain by @Anai-Guo in #4057
fix notebook launcher cuda init by @SunMarc in #4059
pytorch-triton-xpu rename to triton-xpu by @sywangyi in #4007
Relax numerical tolerance for XPU in test_big_modeling by @YangKai0616 in #4001
Fix gloo backend error in test_load_checkpoint_and_dispatch_with_broadcast on XPU by @kaixuanliu in #4056
Raise ValueError instead of a bare string in ParallelismConfig.get_device_mesh by @lollinng in #4064
tests: Gracefully handle missing set_device for mps by @booxter in #4028
test: add regression test for no_split_module_classes accepting set type by @UFO0506 in #4048
Fix all tests by @SunMarc in #4072
docs: add aggregate profiler memory example by @aryanputta in #4054
DOC: document missing parameters in load_accelerator_state, find_executable_batch_size, and send_to_device by @kratos0718 in #4051
docs: Fix docstring of fsdp2_prepare_auto_wrap_policy by @slocoro in #4037
Fix DistributedType documentation by @3manifold in #3980
Fix grammar, spelling, and consistency issues across docs and examples by @cihandemir in #3961
docs: fix typos in docstrings, comments, and user docs by @mokashang in #4040
chore: update doc-builder workflow SHA by @rtrompier in #4009
chore: bump doc-builder SHA for main doc build workflow by @rtrompier in #4018
[CI] Bump style-bot SHA + switch to GitHub App by @paulinebm in #4031
Fix TrackioTracker.log() ignoring step parameter by @joshuaswanson in #3975
fix: pass step parameter in TrackioTracker.log() by @liuyun7345 in #3970
fix(tracking): default step=None on tracker.log and accept extra kwargs in MLflowTracker by @1fanwang in #4039
Fix MLflowTracker.store_init_configuration mutating the caller's config dict by @ATOM00blue in #4046
fix(tracker): guard init_trackers and log against None kwargs by @xodn348 in #4026
🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #3992
chore: update build-docker-images-release.yml by @hf-security-analysis[bot] in #4069
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #4049
Bump the actions group with 8 updates by @dependabot[bot] in #4068

Full Changelog: huggingface/accelerate@v1.13.0...v1.14.0

`v1.13.0`: : Neuron support, IPEX removal, and distributed training fixes

Compare Source

AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.

Neuron integration by @michaelbenayoun in #3935

XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

using spawn instead of fork for XPU device by @kaixuanliu in #3884
Remove ipex by @yao-matrix in #3883
enhance new codes to XPU, and make them be device agnostic by @yao-matrix in #3890
Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in
#3912

FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

Upcast FSDP2 parameters only if requires_grad by @ojh31 in #3848
Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in #3878
bug: fsdp cannot load optimizer state using dcp by @flymin in #3904
fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in #3905
Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in #3924

DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.

[SP] fix loss computation example by @kashif in #3858
[SP and CP] error out if both CP and SP enabled by @kashif in #3862
DeepSpeed has its own process group by @kashif in #3916
[Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in #3914
Enable evaluation during deepspeed Sequence Parallel by @jp1924 in #3917

FP8

We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.

Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in #3831
Fix execution with Transformer Engine by @ksivaman in #3852
add MS-AMP deprecation warnings by @neha222222 in #3857

Performance

Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.

Faster import by @SunMarc in #3953
lazy compile disable by @SunMarc in #3947
Disable hook compile by @SunMarc in #3888

Minor fixes

Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in #3850
fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in #3845
Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in #3881
Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in #3895
Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in #3931
Fix stateful dataloader DDP by @SunMarc in #3952
Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in #3886
Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in #3872
Fix logging logic when in_order is set to True by @yuxinyuan in #3280
Fix cpu offload check by @SunMarc in #3946
fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in #3910
Fix async compatibility across python versions by @SunMarc in #3901
fix tp only bug by @sywangyi in #3908
fix parallelism_config None error by @jp1924 in #3927
Np parall fix by @sywangyi in #3900
change the default value of fsdp_min_num_params to int by @CodeMan62 in #3902
Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in #3944
Prepare TP fix by @michaelbenayoun in #3945
feat: added fine tuning example focused on TPUs by @tengomucho in #3847
Remove 8bit force hook for bnb by @SunMarc in #3907
docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in #3929
fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in #3922
Updating support of Megatron-LM by @pengdurice in #3842
Update support of Megatron-LM PR 2 by @pengdurice in #3887
Fix RNG state setting for HPU by @michaelbenayoun in #3936
fix: load the HPU RNG state by @michaelbenayoun in #3937

`v1.12.0`: : Deepspeed Ulysses/ALST

Compare Source

Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.

To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)

Then, you need to make sure to compute the correct loss as described on our docs

        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)

Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR from @stas00: huggingface/transformers#41832

Minor changes

Remove warning for cpu_ram_efficient_loading by @SunMarc in #3816
update typo in bnb quantisation 4bit flag docstring by @hbraith in #3828
ArXiv -> HF Papers by @qgallouedec in #3834
Fix typo in broadcast_object_list docstring by @wsntxxn in #3823
[Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in #3835
use self hosted runner by @SunMarc in #3841
device type helper by @kashif in #3843

New Contributors

@hbraith made their first contribution in #3828
@wsntxxn made their first contribution in #3823
@naomili0924 made their first contribution in #3835

Full Changelog: huggingface/accelerate@v1.11.0...v1.12.0

`v1.11.0`: : TE MXFP8, FP16/BF16 with MPS, Python 3.10

Compare Source

TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)

Add support for TE MXFP8 recipe in accelerate by @pstjohn in #3688

FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)

Add bf16/fp16 support for amp with mps device by @SunMarc in #3373

FSDP updates

The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:

feat: add ignored_params support for fsdp2 by @kmehant in #3731
fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in #3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:

feat: allow mixed precision policy as dtype by @kmehant in #3751

Nd-parallel updates

Some minor updates concerning nd-parallelism.

Context Parallelism docs typos fixed by @sergiopaniego in #3761
Feat: add to_json by @S1ro1 in #3743
make torch_native_parallelism examples device agnostic by @yao-matrix in #3759
[ND Parallel] Update examples, cleanup by @S1ro1 in #3737

Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

Bump to python3.10 + update linter by @SunMarc in #3809

Lots of minor fixes:

fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in #3740
xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in #3756
Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in #3744
fix: specify device for process_tensor in example usage by @qgallouedec in #3755
Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in #3776
Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta by @Qubitium in #3796
Fix convert LayerNorm without bias to fp8 by @mjun0812 in #3725
Add optional typing by @cyyever in #3769
refactor: Use with in Accelerator.autocast()instead of __enter__() and __exit__() for more elegant style. by @EquationWalker in #3767
switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in #3773
fix FSDP2 test case failure on XPU by @yao-matrix in #3771
Fix tests by @SunMarc in #3722
Protect import for device_mesh by @SunMarc in #3742
Fix SWANLAB_MODE by @SunMarc in #3808
Fix tracking swanlab by @SunMarc in #3810
refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in #3815
Remove deprecated FindTiedParametersResult by @cyyever in #3786
Add optional typing by @cyyever in #3769
remove mlflow from testing by @SunMarc in #3783
enable 2 model hook ut cases on XPU by @yao-matrix in #3774
Added Tip for better rendering by @sergiopaniego in #3781
Fix typos by @cyyever in #3753
fix: torch_npu import error in some envs by @yanyongyu in #3764
Fix: typo makes tests fail by @S1ro1 in #3765
fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in #3779
use reset_peak_memory_stats on xpu by @yao-matrix in #3772

New Contributors

@mjun0812 made their first contribution in #3725
@sergiopaniego made their first contribution in #3761
@EquationWalker made their first contribution in #3762
@yanyongyu made their first contribution in #3764
@RicardoDominguez made their first contribution in #3779
@SamuelBarryCS made their first contribution in #3776
@Qubitium made their first contribution in #3796

Full Changelog: huggingface/accelerate@v1.10.1...v1.11.0

`v1.10.1`: : Patchfix

Compare Source

Feat: add to_json by @S1ro1 in #3743
Protect import for device_mesh by @SunMarc in #3742.

Full Changelog: huggingface/accelerate@v1.10.0...v1.10.1

`v1.10.0`: : N-D Parallelism

Compare Source

N-D Parallelism

Training large models across multiple GPUs can be complex, especially when combining different parallelism strategies (e.g TP, CP, DP). To simplify this process, we've collaborated with Axolotl to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a ParallelismConfig specifying the size of each parallelism type—it's that simple.
Learn more about how it works in our latest blogpost.

parallelism_config = ParallelismConfig(
    dp_shard_size=2,
    dp_replicate_size=2,
    cp_size=2,
    tp_size=2,
)
accelerator = Accelerator(
    parallelism_config=parallelism_config,
   ...
)
model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)
model = accelerator.prepare(model)

Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) by @SalmanMohammadi in #3682
Feat: context parallel v2.0 by @S1ro1 in #3700
set default submesh_tp_size to prevent unset local variable error by @winglian in #3687
Add Parallelism getter property to Accelerator class by @WoosungMyung in #3703
Fix: prepare works even if nothing except tp specified (rare) by @S1ro1 in #3707
Set parallelism_config in constructor due to Trainer reset of State by @winglian in #3713
Fix: tp size wouldn't read from env by @S1ro1 in #3716
Remove ParallelismConfig from PartialState by @SunMarc in #3720

FSDP improvements

We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains q_proj and v_proj parameters. This is especially important for fine-tuning gpt-oss model.

ENH: Allow FSDP ignored modules to be regex by @BenjaminBossan in #3698
TST Add test for FSDP ignored_modules as str by @BenjaminBossan in #3719

Minor improvements

feature: CpuOffload pre_forward don't attempt to move if already on device by @JoeGaffney in #3695
Fix: Ensure environment variable values are case-insensitive in Accelerate by @jp1924 in #3712
remove use_ipex by @SunMarc in #3721

New Contributors

@SalmanMohammadi made their first contribution in #3682
@WoosungMyung made their first contribution in #3703
@jp1924 made their first contribution in #3712
@JoeGaffney made their first contribution in #3695

Full Changelog: huggingface/accelerate@v1.9.0...v1.10.0

`v1.9.0`: : Trackio support, Model loading speedup, Minor distributed improvements

Compare Source

Trackio tracker support

We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces.

Main features are:

Local-first design: dashboard runs locally by default. You can also host it on Spaces by specifying a space_id.
Persists logs locally (or in a private Hugging Face Dataset)
Visualize experiments with a Gradio dashboard locally (or on Hugging Face Spaces)
Everything here, including hosting on Hugging Faces, is free!

To use it with accelerate, you need to set log_with and initialize the trackers

accelerator = Accelerator(log_with="trackio")
config={"learning_rate": 0.001, "batch_size": 32}

# init_kwargs in order to host the dashboard on spaces
init_kwargs = {"trackio": {"space_id": "hf_username/space_name"}
accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs})

Thanks @pcuenca for the integration !

trackio by @pcuenca in #3669

Model loading speedup when relying `set_module_tensor_to_device`

Setting tensor while clearing cache is very slow, so we added clear_device option to disable it.
Another small optimization is using non_blocking everywhere and syncing just before returning control to the user. This makes the loading slightly faster.

Speedup model loading by 4-5x in Diffusers ⚡ by @a-r-r-o-w in #3674

FDSP, Deepspeed, FP8 minor improvements

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in #3640
Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring by @pstjohn in #3677
Bunch of FSDP improvements by @S1ro1 in #3671
Fix: properly error when DDP + Dtensor model by @S1ro1 in #3629
Fix fsdp2 example typo by @shimizust in #3657
Added a check in no_sync() to avoid errors when using deepspeed zero2/3 by @xliu0105 in #3656

🚨🚨🚨 Breaking changes 🚨🚨🚨

find_executable_batch_size() will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity.

“Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by @SunMarc in #3684

What's Changed

[typo] shards instead of shard by @SunMarc in #3645
Docs: Fix typos in gradient accumulation guide by @kilavvy in #3649
xpu enablement on left cases by @yao-matrix in #3654
unpin datasets in examples requirements by @SunMarc in #3681
fix: wandb config not saved in offline mode by @ved1beta in #3648
accelerate/data_loader.py: do not yield if the base_dataloader is empty by @0xnightwind in #3659
warn for invalid keys by @ved1beta in #3613
Update Gaudi runner image to latest SynapseAI and enable previously disabled tests by @IlyasMoutawwakil in #3653

New Contributors

@kilavvy made their first contribution in #3649
@shimizust made their first contribution in #3657
@xliu0105 made their first contribution in #3656
@0xnightwind made their first contribution in #3659

Full Changelog: huggingface/accelerate@v1.8.1...v1.9.0

`v1.8.1`: : Patchfix

Compare Source

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in #3640
shards by @SunMarc in #3645

Full Changelog: huggingface/accelerate@v1.8.0...v1.8.1

`v1.8.0`: : FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation

Compare Source

FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!

[FSDP2] Refactor + FP8 by @S1ro1 in #3585

Faster Distributed Training on Intel CPUs

We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.

Set ccl and KMP param in simple launch by @jiqing-feng in #3575

Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!

Fix deepspeed regional compilation by @IlyasMoutawwakil in #3609

ipex.optimize deprecation

ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

remove ipex.optimize in accelerate by @yao-matrix in #3608

Better XPU Support

We've greatly expanded and stabilized support for Intel XPUs:

enable fsdp2 benchmark on XPU by @yao-matrix in #3590
enable big_model_inference on xpu by @yao-matrix in #3595
enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
enable test_cli & test_example cases on XPU by @yao-matrix in #3578
enable torchao and pippy test cases on XPU by @yao-matrix in #3599
enable regional_compilation benchmark on xpu by @yao-matrix in #3592
fix xpu 8bit value loading by @jiqing-feng in #3623
add device-agnostic GradScaler by @yao-matrix in #3588
add xpu support in TorchTensorParallelPlugin by @yao-matrix in #3627

Trackers

We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.

Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in #3581

What's Changed

Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Update Gaudi Runners by @IlyasMoutawwakil in #3593
goodbye torch_ccl by @yao-matrix in #3580
Add support for standalone mode when default port is occupied on single node by @laitifranz in #3576
Resolve logger warnings by @emmanuel-ferdman in #3582
Add kwargs to optimizer, scheduler and dataloader using function accelerator().load_state() by @luiz0992 in #3540
[docs] no hard-coded cuda in the ddp documentation by @faaany in #3589
change to use torch.device by @yao-matrix in #3594
Fix: list object has no attribute keys by @S1ro1 in #3603
Update Gaudi Runners by @IlyasMoutawwakil in #3593
Fix bf16 training with TP by @SunMarc in #3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in #3611
Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in #3587
Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in #3619
Add fp8_e5m2 support in dtype_byte_size by @SunMarc in #3625
[Deepspeed] deepspeed auto grad accum by @kashif in #3630
Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in #3631
Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in #3605
Fix Typos in Documentation and Comments by @leopardracer in #3621
feat: use datas

✂ Note

PR body was truncated to here.

Configuration

📅 Schedule: (UTC)

Branch creation
- At any time (no schedule defined)
Automerge
- At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

If you want to rebase/retry this PR, check this box

To execute skipped test pipelines write comment /ok-to-test.

Documentation

Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.

Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com>

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 6411323 to 6dda555 Compare November 21, 2025 17:12

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.11.0~~ chore(deps): update dependency accelerate to v1.12.0 Nov 21, 2025

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch 2 times, most recently from c8504f6 to 2467f98 Compare December 18, 2025 18:59

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 2467f98 to 48e29cf Compare January 8, 2026 05:13

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 48e29cf to 97f4b91 Compare January 16, 2026 21:13

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 97f4b91 to b6bd8a2 Compare February 10, 2026 17:35

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.12.0~~ chore(deps): update dependency accelerate to v1.12.0 - autoclosed Feb 18, 2026

konflux-internal-p02 Bot closed this Feb 18, 2026

konflux-internal-p02 Bot deleted the konflux/mintmaker/main/accelerate-1.x branch February 18, 2026 16:59

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.12.0 - autoclosed~~ chore(deps): update dependency accelerate to v1.12.0 Feb 21, 2026

konflux-internal-p02 Bot reopened this Feb 21, 2026

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch 2 times, most recently from b6bd8a2 to 5c4e700 Compare February 21, 2026 01:36

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 5c4e700 to 74510d8 Compare March 4, 2026 21:39

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.12.0~~ chore(deps): update dependency accelerate to v1.13.0 Mar 4, 2026

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 74510d8 to 4423db8 Compare March 11, 2026 17:47

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 4423db8 to 84a870a Compare March 26, 2026 17:49

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 84a870a to c9e9643 Compare April 3, 2026 01:46

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from c9e9643 to 0b3cf29 Compare April 13, 2026 23:07

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 0b3cf29 to d232ac4 Compare May 8, 2026 16:14

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from d232ac4 to 6869687 Compare June 16, 2026 01:00

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.13.0~~ chore(deps): update dependency accelerate to v1.14.0 Jun 16, 2026

konflux-internal-p02 Bot changed the title ~~chore(deps): update dependency accelerate to v1.14.0~~ fix(deps): update dependency accelerate to v1.14.0 Jun 25, 2026

fix(deps): update dependency accelerate to v1.14.0

c73814a

Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com>

konflux-internal-p02 Bot force-pushed the konflux/mintmaker/main/accelerate-1.x branch from 6869687 to c73814a Compare June 29, 2026 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency accelerate to v1.14.0#321

fix(deps): update dependency accelerate to v1.14.0#321
konflux-internal-p02[bot] wants to merge 1 commit into
mainfrom
konflux/mintmaker/main/accelerate-1.x

konflux-internal-p02 Bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Uh oh!

Conversation

konflux-internal-p02 Bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v1.14.0: : AMD ROCm support, FSDP2 hardening

FSDP2 Improvements

AMD ROCm support

Neuron

Quantization & Offloading

Data Loading

Minor fixes

v1.13.0: : Neuron support, IPEX removal, and distributed training fixes

AWS Neuron support

XPU Improvements

FSDP2 Improvements

DeepSpeed Sequence Parallelism

FP8

Performance

Minor fixes

v1.12.0: : Deepspeed Ulysses/ALST

Deepspeed Ulysses/ALST integration

Minor changes

New Contributors

v1.11.0: : TE MXFP8, FP16/BF16 with MPS, Python 3.10

TE MXFP8 support

FP16/BF16 Training for MPS devices

FSDP updates

Nd-parallel updates

Bump to Python 3.10

Lots of minor fixes:

New Contributors

v1.10.1: : Patchfix

v1.10.0: : N-D Parallelism

N-D Parallelism

FSDP improvements

Minor improvements

New Contributors

v1.9.0: : Trackio support, Model loading speedup, Minor distributed improvements

Trackio tracker support

Model loading speedup when relying set_module_tensor_to_device

FDSP, Deepspeed, FP8 minor improvements

🚨🚨🚨 Breaking changes 🚨🚨🚨

What's Changed

New Contributors

v1.8.1: : Patchfix

v1.8.0: : FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation

FSDPv2 refactor + FP8 support

Faster Distributed Training on Intel CPUs

Regional Compilation for DeepSpeed

ipex.optimize deprecation

Better XPU Support

Trackers

What's Changed

Configuration

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

konflux-internal-p02 Bot commented Nov 7, 2025 •

edited

Loading

`v1.14.0`: : AMD ROCm support, FSDP2 hardening

`v1.13.0`: : Neuron support, IPEX removal, and distributed training fixes

`v1.12.0`: : Deepspeed Ulysses/ALST

`v1.11.0`: : TE MXFP8, FP16/BF16 with MPS, Python 3.10

`v1.10.1`: : Patchfix

`v1.10.0`: : N-D Parallelism

`v1.9.0`: : Trackio support, Model loading speedup, Minor distributed improvements

Model loading speedup when relying `set_module_tensor_to_device`

`v1.8.1`: : Patchfix

`v1.8.0`: : FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation