You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This isn't a success story yet because I only finally got it to start going through the training steps. But so far here are the changes and issues I had. This is from the tag v0.2.10
I have a pair of Intel ARC B580 Battlemage. To say that the AI journey has been painful is an understatement. Lots of software assumes that the only GPUs available are CUDA based. Each card has 12Gb of VRAM. There is 64Gb of RAM and the CPU is a Ryzen 9 5900XT.
This is using wan2.2_t2v_low_noise_14B_fp16.safetensors as the model.
Loading with --fp8_base fails because the resulting cast is causing all sorts of drama.
Pretraining "hanging", both vae and t5: Simple, add "--device xpu" to the command lines
swap_weight_devices_no_cuda has a missing parameter. In src/musubi_tuner/modules/custom_offloading_utils.py look for the function. Replace synchronize_device() by synchronize_device(device)
Assumed CUDA issue: In src/musubi_tuner/wan/modules/t5.py, replace the line device=torch.cuda.current_device(), by device=torch.xpu.current_device(),
--fp8_base: Got the error: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Float8_e4m3fn.. I removed fp8_base at first, which got it to at least get further, but it then failed with UR_OUT_OF_RESOURCES from the XPU. Turns out adding --fp8_scaled was also needed.
Autocast issues. (This was before I added --fp8_scaled to the command line, I haven't tested reverting this to see if it still needed)
Unfortunately, the autocast on XPU isn't that good and its been causing issues in src/musubi_tuner/wan/modules/model.py
Around line 829
e = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len)).float())
This causes Tensor to complain about mixed precision (Float and Half). This is the fix that at least passes the initial running. I'm a noob in Python, so I'm 100% sure that there is a more elegant solution:
I was tired when I did it, but at least it works enough to start training.
Other issues, using accelerate fails because something in the XPU stuff expects Fabric based networking. I tried various ways to turn it off and to fix it, to no avail. The errors:
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1118 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 02025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1118 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1158 atl_ofi_get_prov_list: can't create providers for name sockets2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1158 atl_ofi_get_prov_list: can't create providers for name sockets
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1494 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
fails with status: 12025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1494 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
fails with status: 1
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_helper.cpp:1649 atl_ofi_open_nw_provs: can not open network providers2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_helper.cpp:1649 atl_ofi_open_nw_provs: can not open network providers
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:1157 open_providers: atl_ofi_open_nw_provs failed with status: 12025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:1157 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:167 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
fails with status: 1
2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:167 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
fails with status: 12025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi.cpp:242 init: can't find suitable provider
2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi.cpp:242 init: can't find suitable provider
2025:09:07-16:28:03:27500:[0] |CCL_ERROR| atl_ofi_comm.cpp:229 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
failed to initialize ATL
2025:09:07-16:28:03:27501:[1] |CCL_ERROR| atl_ofi_comm.cpp:229 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
failed to initialize ATL
For now I just cut out accelerate, so my final command looks like:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
This isn't a success story yet because I only finally got it to start going through the training steps. But so far here are the changes and issues I had. This is from the tag v0.2.10
I have a pair of Intel ARC B580 Battlemage. To say that the AI journey has been painful is an understatement. Lots of software assumes that the only GPUs available are CUDA based. Each card has 12Gb of VRAM. There is 64Gb of RAM and the CPU is a Ryzen 9 5900XT.
This is using wan2.2_t2v_low_noise_14B_fp16.safetensors as the model.
Loading with --fp8_base fails because the resulting cast is causing all sorts of drama.
Unfortunately, the autocast on XPU isn't that good and its been causing issues in src/musubi_tuner/wan/modules/model.py
Around line 829
e = self.time_embedding(sinusoidal_embedding_1d(self.freq_dim, t).unflatten(0, (bt, seq_len)).float())
This causes Tensor to complain about mixed precision (Float and Half). This is the fix that at least passes the initial running. I'm a noob in Python, so I'm 100% sure that there is a more elegant solution:
I was tired when I did it, but at least it works enough to start training.
Other issues, using accelerate fails because something in the XPU stuff expects Fabric based networking. I tried various ways to turn it off and to fix it, to no avail. The errors:
For now I just cut out accelerate, so my final command looks like:
I'm training on a set of 111 images. Right now I just want to get a LORA file.
I don't know the consequence of removing accelerate from there. But so far its doing the steps calculation.
In terms of speed, I'm currently at 11.80s/it. There is 1776 steps, and the estimate looks like 5h30m or so.
Beta Was this translation helpful? Give feedback.
All reactions