Testing ComfyUI with RDNA1 (Radeon RX 5600M) and with the Vega iGPU #3167

Zakhrov · 2026-01-30T04:55:23Z

Zakhrov
Jan 30, 2026

I've been running various Stable Diffusion and even a few LLM models on my trusty and now old Dell G5 SE 5055 laptop since the early days of ROCm 5 where overriding RDNA 1 to the RDNA 2 ISA using the HSA_OVERRIDE_GFX_VERSION environment variable used to work, and I've got to admit, it has been a long and often times frustrating journey.
Here are my findings compiled over the years of using (and fighting with) ComfyUI on ROCm with gfx1010:

Setup

I'm using system installed ROCm with the following tweaks/workarounds:

Built rocBLAS with -D GPU_TARGETS=gfx1010 and installed it over the system provided rocBLAS
Replaced all of the instances of ComposableKernel inside the pytorch source tree with the upstream ComposaleKernel code which has the bits for gfx1010 in it, since the pytorch git repository and submodules still don't have those
Built pytorch using the following flags:

PYTORCH_ROCM_ARCH=gfx1010 USE_FLASH_ATTENTION=OFF USE_MEM_EFF_ATTENTION=OFF pip install -v --no-build-isolation .

Yeah, I know this is a lot of work, but it was something that I came up with before AMD started shipping nightly ROCm wheels, and even then, you still have to build pytorch from source through their build scripts.

Running ComfyUI

These are the arguments and environment variables I run ComfyUI with:

ROCBLAS_TENSILE_LIBPATH=/opt/rocm/lib/rocblas/library MIOPEN_USER_DB_PATH=~/.comfyui/ MIOPEN_FIND_MODE=TRUST_VERIFY_FULL PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:32 HSA_ENABLE_MWAITX=1 HSA_ENABLE_SDMA=1 HIP_VISIBLE_DEVICES=0 python main.py --use-quad-cross-attention --async-offload 16 --cuda-device 0 --disable-pinned-memory

Pinned memory can cause HIP mismatch errors when it moves from text encoding to the actual sampling.

HSA_ENABLE_MWAITX and HSA_ENABLE_SDMA is useful for reducing or even stopping GPU hangs and resets.

Subquadratic cross attention is basically the most stable and best performing attention method you can use on gfx1010.

Split cross attention uses more VRAM, and can cause OOMs unless you use --lowvram or --novram, and is generally not faster.

ComfyUI Model choices and workarounds

Z-image turbo

I use the Q4 GGUF quantized version from unsloth https://huggingface.co/unsloth/Z-Image-Turbo-GGUF/tree/main

Unsloth also has the Qwen3-4B text encoder models in excellent Q4 GGUFs that work very well with RAM offloading: https://huggingface.co/unsloth/Qwen3-4B-GGUF/tree/main

Performance:

Z-image-turbo can run at 7s/it for a 512x512 image.
It can scale up to 2048x2048 pixels without triggering an OOM error, but the inference time starts to blow up as well to 150s/it, at which point you start thermal throttling and the generation time starts to slowly increase - pushing the overall time from around 1 minute for a 512x512 image to 27 minutes for a 2048x2048 image.
1024x1024 is the "sweet spot" between resolution and time. It takes around 30 s/it

WAN 2.2

Wan 2.2 is basically pushing the limit of what the RX 5600M can do, and that is squarely on the 6GB framebuffer it has.
Here again, GGUF quantizations are your friend, and this one I've found gives the best results:
https://huggingface.co/BigDannyPt/Wan-2.2-Remix/tree/main/I2V/v2.0/

The above model has the lightning 4step LoRa built in, and that is essential for generating any useful video at a sane timescale.

Wan 2.2 works best at a 480p resolution, and at a frame length of 33. For longer videos, you can chain it with SVI Pro 2, but at around 500 s/it you'll be spending 1.5 hours to get around 10 seconds of video at 480p.

It can be pushed to 77 frames without causing an OOM error, but each iteration takes over 20 minutes at that point, which makes it unfeasible even for an overnight generation.

Going to lower resolutions of 360p or 240p basically produces crappy video that upscaling cannot fix unless you do another inference pass on the frames using either the Wan 2.2 low noise, or Z-image-turbo - which is painful

You can push 720p at 33 frame chunks if you force --novram, but you have the same problem of 480p at 77 frames - the generation time blows up.

Caveats

Higher resolutions and video generation requires tiled VAE encoding/decoding, the regular VAE decode node can auto fallback to tiled decoding, but that takes a long time on ROCm 7.x, so it is faster to just use the tiled VAE encode/decode nodes for resolutions above 512 pixels.

TTS models and their related custom nodes inside ComfyUI are basically unfeasible at this point.
The models try to load completely in VRAM , and they offer bnb quantization that doesn't work on gfx1010.
So far, I've managed to get Qwen3 TTS to work, and even that is quite slow, and it spits out these warnings:

MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 79736832, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 79736832, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 79736832, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 79736832, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 199315200, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 199315200, provided ptr: 0 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 199315200, provided ptr: 0 size: 0

Vega iGPU

For kicks, I tried to run ComfyUI with the Ryzen 5 4600H's Vega integrated GPU. It requires overriding the HSA with gfx9.0.0 since gfx90c is not a default target.

And it is surprisingly usable! The performance is actually quite decent for something with just 6 CUs and that's mostly thanks to it basically being an extremely tiny Radeon VII - sort of. The 5600M is about 2.6x faster, and probably more if the model fits inside its measly 6GB VRAM, but for very VRAM heavy models like Wan, the iGPU has this weird paradox of being able to run them without worrying about OOMs, but also being super slow. It might actually be feasible for an "overnight" run of something big like Wan infinite talk, but I haven't tried that yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing ComfyUI with RDNA1 (Radeon RX 5600M) and with the Vega iGPU #3167

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Testing ComfyUI with RDNA1 (Radeon RX 5600M) and with the Vega iGPU #3167

Uh oh!

Uh oh!

Zakhrov Jan 30, 2026

Setup

Running ComfyUI

ComfyUI Model choices and workarounds

Z-image turbo

WAN 2.2

Caveats

Vega iGPU

Replies: 0 comments

Zakhrov
Jan 30, 2026