Tuning Wan 2.2 on Intel ARC B580 Battlemage XPUs #544

Nuitari · 2025-09-07T16:48:27Z

Nuitari
Sep 7, 2025

This isn't a success story yet because I only finally got it to start going through the training steps. But so far here are the changes and issues I had. This is from the tag v0.2.10

I have a pair of Intel ARC B580 Battlemage. To say that the AI journey has been painful is an understatement. Lots of software assumes that the only GPUs available are CUDA based. Each card has 12Gb of VRAM. There is 64Gb of RAM and the CPU is a Ryzen 9 5900XT.

This is using wan2.2_t2v_low_noise_14B_fp16.safetensors as the model.
Loading with --fp8_base fails because the resulting cast is causing all sorts of drama.

Pretraining "hanging", both vae and t5: Simple, add "--device xpu" to the command lines
swap_weight_devices_no_cuda has a missing parameter. In src/musubi_tuner/modules/custom_offloading_utils.py look for the function. Replace synchronize_device() by synchronize_device(device)
Assumed CUDA issue: In src/musubi_tuner/wan/modules/t5.py, replace the line device=torch.cuda.current_device(), by device=torch.xpu.current_device(),
--fp8_base: Got the error: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Float8_e4m3fn.. I removed fp8_base at first, which got it to at least get further, but it then failed with UR_OUT_OF_RESOURCES from the XPU. Turns out adding --fp8_scaled was also needed.
Autocast issues. (This was before I added --fp8_scaled to the command line, I haven't tested reverting this to see if it still needed)
Unfortunately, the autocast on XPU isn't that good and its been causing issues in src/musubi_tuner/wan/modules/model.py
Around line 829
e = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len)).float())
This causes Tensor to complain about mixed precision (Float and Half). This is the fix that at least passes the initial running. I'm a noob in Python, so I'm 100% sure that there is a more elegant solution:

diff --git a/src/musubi_tuner/wan/modules/model.py b/src/musubi_tuner/wan/modules/model.py
index 0b36521..ae1a1aa 100644
--- a/src/musubi_tuner/wan/modules/model.py
+++ b/src/musubi_tuner/wan/modules/model.py
@@ -829,9 +829,18 @@ class WanModel(nn.Module):  # ModelMixin, ConfigMixin):
                     t = t.unsqueeze(1).expand(-1, seq_len)
                 bt = t.size(0)
                 t = t.flatten()
-                e = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len)).float())
+#                e = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len)).float())
+                embedding_input = sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len))
+                embedding_input = embedding_input.to(self.time_embedding[0].weight.dtype)
+                e = self.time_embedding(embedding_input)
                 e0 = self.time_projection(e).unflatten(2, (6, self.dim))
-        assert e.dtype == torch.float32 and e0.dtype == torch.float32
+                newe = e.to(torch.float32)
+                e = newe
+                newe0 = e0.to(torch.float32)
+                e0 = newe0
+        assert e.dtype == torch.float32
+        assert e0.dtype == torch.float32

I was tired when I did it, but at least it works enough to start training.

Other issues, using accelerate fails because something in the XPU stuff expects Fabric based networking. I tried various ways to turn it off and to fix it, to no avail. The errors:

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1118 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 02025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1118 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1158 atl_ofi_get_prov_list: can't create providers for name sockets2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1158 atl_ofi_get_prov_list: can't create providers for name sockets

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1494 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
 fails with status: 12025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1494 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
 fails with status: 1

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1649 atl_ofi_open_nw_provs: can not open network providers2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1649 atl_ofi_open_nw_provs: can not open network providers

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:1157 open_providers: atl_ofi_open_nw_provs failed with status: 12025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:1157 open_providers: atl_ofi_open_nw_provs failed with status: 1

2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:167 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:167 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 12025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:242 init: can't find suitable provider

2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:242 init: can't find suitable provider
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_comm.cpp:229 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
failed to initialize ATL
2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_comm.cpp:229 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
failed to initialize ATL

For now I just cut out accelerate, so my final command looks like:

python src/musubi_tuner/wan_train_network.py     --task t2v-A14B     --dit /home/nuitari/ComfyUI/models/diffusion_models/wan2.2_t2v_low_noise_14B_fp16.safetensors    --dataset_config /home/nuitari/style_1/dataset.toml --sdpa     --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing     --max_data_loader_n_workers 2 --persistent_data_loader_workers     --network_module networks.lora_wan --network_dim 32     --timestep_sampling shift --discrete_flow_shift 3.0     --max_train_epochs 16 --save_every_n_epochs 1 --seed 42     --preserve_distribution_shape     --output_dir /home/nuitari/training --output_name wan2.2_t2v_low_noise_14B_style_bf16.safetensors --vae_cache_cpu --blocks_to_swap 30 --mixed_precision fp16 --fp8_scaled  --fp8_base

I'm training on a set of 111 images. Right now I just want to get a LORA file.

I don't know the consequence of removing accelerate from there. But so far its doing the steps calculation.

In terms of speed, I'm currently at 11.80s/it. There is 1776 steps, and the estimate looks like 5h30m or so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tuning Wan 2.2 on Intel ARC B580 Battlemage XPUs #544

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Tuning Wan 2.2 on Intel ARC B580 Battlemage XPUs #544

Uh oh!

Nuitari Sep 7, 2025

Replies: 0 comments

Nuitari
Sep 7, 2025