You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running various Stable Diffusion and even a few LLM models on my trusty and now old Dell G5 SE 5055 laptop since the early days of ROCm 5 where overriding RDNA 1 to the RDNA 2 ISA using the HSA_OVERRIDE_GFX_VERSION environment variable used to work, and I've got to admit, it has been a long and often times frustrating journey.
Here are my findings compiled over the years of using (and fighting with) ComfyUI on ROCm with gfx1010:
Setup
I'm using system installed ROCm with the following tweaks/workarounds:
Built rocBLAS with -D GPU_TARGETS=gfx1010 and installed it over the system provided rocBLAS
Replaced all of the instances of ComposableKernel inside the pytorch source tree with the upstream ComposaleKernel code which has the bits for gfx1010 in it, since the pytorch git repository and submodules still don't have those
Yeah, I know this is a lot of work, but it was something that I came up with before AMD started shipping nightly ROCm wheels, and even then, you still have to build pytorch from source through their build scripts.
Running ComfyUI
These are the arguments and environment variables I run ComfyUI with:
Z-image-turbo can run at 7s/it for a 512x512 image.
It can scale up to 2048x2048 pixels without triggering an OOM error, but the inference time starts to blow up as well to 150s/it, at which point you start thermal throttling and the generation time starts to slowly increase - pushing the overall time from around 1 minute for a 512x512 image to 27 minutes for a 2048x2048 image.
1024x1024 is the "sweet spot" between resolution and time. It takes around 30 s/it
WAN 2.2
Wan 2.2 is basically pushing the limit of what the RX 5600M can do, and that is squarely on the 6GB framebuffer it has.
Here again, GGUF quantizations are your friend, and this one I've found gives the best results: https://huggingface.co/BigDannyPt/Wan-2.2-Remix/tree/main/I2V/v2.0/
The above model has the lightning 4step LoRa built in, and that is essential for generating any useful video at a sane timescale.
Wan 2.2 works best at a 480p resolution, and at a frame length of 33. For longer videos, you can chain it with SVI Pro 2, but at around 500 s/it you'll be spending 1.5 hours to get around 10 seconds of video at 480p.
It can be pushed to 77 frames without causing an OOM error, but each iteration takes over 20 minutes at that point, which makes it unfeasible even for an overnight generation.
Going to lower resolutions of 360p or 240p basically produces crappy video that upscaling cannot fix unless you do another inference pass on the frames using either the Wan 2.2 low noise, or Z-image-turbo - which is painful
You can push 720p at 33 frame chunks if you force --novram, but you have the same problem of 480p at 77 frames - the generation time blows up.
Caveats
Higher resolutions and video generation requires tiled VAE encoding/decoding, the regular VAE decode node can auto fallback to tiled decoding, but that takes a long time on ROCm 7.x, so it is faster to just use the tiled VAE encode/decode nodes for resolutions above 512 pixels.
TTS models and their related custom nodes inside ComfyUI are basically unfeasible at this point.
The models try to load completely in VRAM , and they offer bnb quantization that doesn't work on gfx1010.
So far, I've managed to get Qwen3 TTS to work, and even that is quite slow, and it spits out these warnings:
For kicks, I tried to run ComfyUI with the Ryzen 5 4600H's Vega integrated GPU. It requires overriding the HSA with gfx9.0.0 since gfx90c is not a default target.
And it is surprisingly usable! The performance is actually quite decent for something with just 6 CUs and that's mostly thanks to it basically being an extremely tiny Radeon VII - sort of. The 5600M is about 2.6x faster, and probably more if the model fits inside its measly 6GB VRAM, but for very VRAM heavy models like Wan, the iGPU has this weird paradox of being able to run them without worrying about OOMs, but also being super slow. It might actually be feasible for an "overnight" run of something big like Wan infinite talk, but I haven't tried that yet.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been running various Stable Diffusion and even a few LLM models on my trusty and now old Dell G5 SE 5055 laptop since the early days of ROCm 5 where overriding RDNA 1 to the RDNA 2 ISA using the HSA_OVERRIDE_GFX_VERSION environment variable used to work, and I've got to admit, it has been a long and often times frustrating journey.
Here are my findings compiled over the years of using (and fighting with) ComfyUI on ROCm with gfx1010:
Setup
I'm using system installed ROCm with the following tweaks/workarounds:
Yeah, I know this is a lot of work, but it was something that I came up with before AMD started shipping nightly ROCm wheels, and even then, you still have to build pytorch from source through their build scripts.
Running ComfyUI
These are the arguments and environment variables I run ComfyUI with:
Pinned memory can cause HIP mismatch errors when it moves from text encoding to the actual sampling.
HSA_ENABLE_MWAITX and HSA_ENABLE_SDMA is useful for reducing or even stopping GPU hangs and resets.
Subquadratic cross attention is basically the most stable and best performing attention method you can use on gfx1010.
Split cross attention uses more VRAM, and can cause OOMs unless you use --lowvram or --novram, and is generally not faster.
ComfyUI Model choices and workarounds
Z-image turbo
I use the Q4 GGUF quantized version from unsloth https://huggingface.co/unsloth/Z-Image-Turbo-GGUF/tree/main
Unsloth also has the Qwen3-4B text encoder models in excellent Q4 GGUFs that work very well with RAM offloading: https://huggingface.co/unsloth/Qwen3-4B-GGUF/tree/main
Performance:
WAN 2.2
Wan 2.2 is basically pushing the limit of what the RX 5600M can do, and that is squarely on the 6GB framebuffer it has.
Here again, GGUF quantizations are your friend, and this one I've found gives the best results:
https://huggingface.co/BigDannyPt/Wan-2.2-Remix/tree/main/I2V/v2.0/
The above model has the lightning 4step LoRa built in, and that is essential for generating any useful video at a sane timescale.
Wan 2.2 works best at a 480p resolution, and at a frame length of 33. For longer videos, you can chain it with SVI Pro 2, but at around 500 s/it you'll be spending 1.5 hours to get around 10 seconds of video at 480p.
It can be pushed to 77 frames without causing an OOM error, but each iteration takes over 20 minutes at that point, which makes it unfeasible even for an overnight generation.
Going to lower resolutions of 360p or 240p basically produces crappy video that upscaling cannot fix unless you do another inference pass on the frames using either the Wan 2.2 low noise, or Z-image-turbo - which is painful
You can push 720p at 33 frame chunks if you force --novram, but you have the same problem of 480p at 77 frames - the generation time blows up.
Caveats
Higher resolutions and video generation requires tiled VAE encoding/decoding, the regular VAE decode node can auto fallback to tiled decoding, but that takes a long time on ROCm 7.x, so it is faster to just use the tiled VAE encode/decode nodes for resolutions above 512 pixels.
TTS models and their related custom nodes inside ComfyUI are basically unfeasible at this point.
The models try to load completely in VRAM , and they offer bnb quantization that doesn't work on gfx1010.
So far, I've managed to get Qwen3 TTS to work, and even that is quite slow, and it spits out these warnings:
Vega iGPU
For kicks, I tried to run ComfyUI with the Ryzen 5 4600H's Vega integrated GPU. It requires overriding the HSA with gfx9.0.0 since gfx90c is not a default target.
And it is surprisingly usable! The performance is actually quite decent for something with just 6 CUs and that's mostly thanks to it basically being an extremely tiny Radeon VII - sort of. The 5600M is about 2.6x faster, and probably more if the model fits inside its measly 6GB VRAM, but for very VRAM heavy models like Wan, the iGPU has this weird paradox of being able to run them without worrying about OOMs, but also being super slow. It might actually be feasible for an "overnight" run of something big like Wan infinite talk, but I haven't tried that yet.
Beta Was this translation helpful? Give feedback.
All reactions