ComfyUI-MultiGPU

Device selection and model offloading tools for ComfyUI workloads that exceed single GPU capacity

This extension empowers users to spread model components across multiple GPUs or offload to CPU in a single ComfyUI workflow. Aimed at complex workloads that would normally require sequential model loading/unloading:

Full control over model placement
Support for all major model loader nodes
NEW! Hunyuan Video UNet model splitting across two devices using DiffSynth block-swapping

EXPERIMENTAL: This extension modifies ComfyUI's memory management behavior. While functional for standard GPU and CPU configurations, edge cases may exist. Use at your own risk.

Note: This enhances memory management, not parallelism. Workflow steps execute sequentially but with components loaded across your specified devices. Performance gains come from avoiding repeated model loading/unloading when VRAM is constrained.

NEW: DisTorch - Advanced GGUF-Quantized Model Layer Distribution

DisTorch nodes are now available, allowing fine-grained control over model layer distribution across multiple devices for GGUF quantized models. Using a simple allocation string (e.g., "cuda:0,0.025;cuda:1,0.05;cpu,0.10"), you can precisely specify how much memory each device should contribute to hosting model layers. This enables sophisticated memory management strategies like:

Splitting large models across multiple GPUs with different VRAM capacities
Utilizing CPU memory alongside GPU VRAM for handling memory-intensive models
Optimizing layer placement based on your specific hardware configuration

Check out the updated examples hunyuan_gguf_distorch.json and flux1dev_gguf_distorch.json to see DisTorch in action, demonstrating advanced layer distribution across multiple devices.

Installation

Installation via ComfyUI-Manager is preferred. Simply search for ComfyUI-MultiGPU in the list of nodes and follow installation instructions.

Manual Installation

Clone this repository inside ComfyUI/custom_nodes/.

Nodes

The extension automatically creates MultiGPU versions of loader nodes. Each MultiGPU node has the same functionality as its original counterpart but adds a device parameter that allows you to specify the GPU to use.

Currently supported nodes (automatically detected if available):

Standard ComfyUI model loaders:
- CheckpointLoaderSimpleMultiGPU
- CLIPLoaderMultiGPU
- ControlNetLoaderMultiGPU
- DualCLIPLoaderMultiGPU
- TripleCLIPLoaderMultiGPU
- UNETLoaderMultiGPU
- VAELoaderMultiGPU
GGUF loaders (requires ComfyUI-GGUF):
- UnetLoaderGGUFMultiGPU (supports quantized models like flux1-dev-gguf)
- UnetLoaderGGUFAdvancedMultiGPU
- CLIPLoaderGGUFMultiGPU
- DualCLIPLoaderGGUFMultiGPU
- TripleCLIPLoaderGGUFMultiGPU
XLabAI FLUX ControlNet (requires x-flux-comfy):
- LoadFluxControlNetMultiGPU
Florence2 (requires ComfyUI-Florence2):
- Florence2ModelLoaderMultiGPU
- DownloadAndLoadFlorence2ModelMultiGPU
LTX Video Custom Checkpoint Loader (requires ComfyUI-LTXVideo):
- LTXVLoaderMultiGPU
NF4 Checkpoint Format Loader(requires ComfyUI_bitsandbytes_NF4):
- CheckpointLoaderNF4MultiGPU
HunyuanVideoWrapper (requires ComfyUI-HunyuanVideoWrapper):
- HyVideoModelLoaderMultiGPU
- HyVideoModelLoaderDiffSynthMultiGPU (NEW - MultiGPU-specific node for offloading to an offload_device using MultiGPU's device selectors)
- HyVideoVAELoaderMultiGPU
- DownloadAndLoadHyVideoTextEncoderMultiGPU
Native to ComfyUI-MultiGPU
- DeviceSelectorMultiGPU (Allows user to link loaders together to use the same selected device)

All MultiGPU nodes available for your install can be found in the "multigpu" category in the node menu.

Example workflows

All workflows have been tested on a 2x 3090 linux setup, a 4070 win 11 setup, and a 3090/1070ti linux setup.

Split GGUF-quantized UNet and CLIP models across multiple devices using DisTorch

examples/hunyuan_gguf_distorch.json This workflow attaches a HunyuanVideo GGUF-quantized model on cuda:0 for compute and distrubutes its UNet across itself, a secondary CUDA device, and the system's main memory (cpu) using a new DisTorch distributed-load methodology. The text encoder now attaches itself to cuda:1 and splits iteself between cuda:1 amd cpu layers. While the VAE is loaded on GPU 1 directly and use cuda:1 for compute.
examples/flux1dev_gguf_distorch.json This workflow loads a FLUX.1-dev model on cuda:0 for compute and distrubutes its UNet across multiple CUDA devices using new DisTorch distributed-load methodology. While the text encoders and VAE are loaded on GPU 1 and use cuda:1 for compute.

Split Hunyuan Video UNet across two devices and use DiffSynth Just-in-Time loading

examples/hunyuanvideowrapper_diffsynth.json This workflow demonstrates DiffSynth's memory optimization strategy enabled in kijai's ComfyUI-HunyuanVideoWrapper UNet loader, splitting the UNet model across two CUDA devices using block-swapping. The main device handles active computations while blocks are swapped to and from the offload device as needed. As written, the CLIP loads on cuda:0 and then offloads, and the VAE is loaded after the UNet model has been cleared from memory after generation. This approach enables processing of higher resolution or longer duration videos that would exceed a single GPU's memory capacity, though at the cost of additional processing time. Note that an initial OOM error is expected as the workflow calibrates its memory management strategy - simply run the generation again with the same parameters.

Split Hunyuan Video generation across multiple resources

examples/hunyuanvideowrapper_native_vae.json This workflow uses two of the custom_node's loaders - putting the main video model on cuda0 and the CLIP onto the CPU. This workflow uses Comfy's native VAE loader to load the VAE onto a second cuda device, keeping model, VAE, and CLIP in their operating memory space at all times. This allows the benefit of kijai's proecessing node with the flexibility of a MultiGPU setup.
examples/hunyuanvideowrapper_select_device.json This workflow loads the main video model and VAE onto the cuda device and the CLIP onto the cpu. The model and VAE are linked in this example due to kijai's own extensive memory management assuming model and VAE are on the same device.

Split FLUX.1-dev across two GPUs

examples/flux1dev_2gpu.json This workflow loads a FLUX.1-dev model and splits its components across two GPUs. The UNet model is loaded on GPU 1 while the text encoders and VAE are loaded on GPU 0.

Split FLUX.1-dev between the CPU and a single GPU

examples/flux1dev_cpu_1gpu_GGUF.json This workflow demonstrates splitting a quantized, GGUF FLUX.1-dev model between a CPU and a single GPU. The UNet model is loaded on the GPU, while the VAE and text encoders are handled by the CPU. Requires ComfyUI-GGUF.

Using GGUF quantized models across GPUs

examples/flux1dev_2gpu_GGUF.json This workflow demonstrates using quantized GGUF models split across multiple GPUs for reduced VRAM usage with the UNet on GPU 1, VAE and text encoders on GPU 0. Requires ComfyUI-GGUF.

Using GGUF quantized models across a CPU and a single GPU for video generation

examples/hunyuan_cpu_1gpu_GGUF.json This workflow demonstrates using quantized GGUF models for Hunyan Video split across the CPU and one GPU. In this instance, a quantized video model's UNet and VAE are on GPU 0, whereas a split of one standard and one GGUF model text encoder are on the CPU. Requires ComfyUI-GGUF.

Using GGUF quantized models across GPUs for video generation

examples/hunyuan_2gpu_GGUF.json This workflow demonstrates using quantized GGUF models for Hunyan Video split across multiple GPUs. In this instance, a quantized video model's UNet is on GPU 0 whereas the VAE and text encoders are on GPU 1. Requires ComfyUI-GGUF.

Loading two SDXL checkpoints on different GPUs

examples/sdxl_2gpu.json This workflow loads two SDXL checkpoints on two different GPUs. The first checkpoint is loaded on GPU 0, and the second checkpoint is loaded on GPU 1.

FLUX.1-dev and SDXL in the same workflow

examples/flux1dev_sdxl_2gpu.json This workflow loads a FLUX.1-dev model and an SDXL model in the same workflow. The FLUX.1-dev model has its UNet on GPU 1 with VAE and text encoders on GPU 0, while the SDXL model uses separate allocations on GPU 0.

Image to Prompt to Image to Video Generation Pipeline

examples/florence2_flux1dev_ltxv_cpu_2gpu.json This workflow creates an img2txt2img2vid video generation pipeline by:

Loading the Florence2 model on the CPU and providing a starting image for analysis and generating a text response
Loading FLUX.1 Dev UNET on GPU 1, with CLIP and VAE on the CPU and generating an image using the Florence2 text as a prompt
Loading the LTX Video UNet and VAE on GPU 2, and LTX-encoded CLIP on the CPU, and taking the resulting FLUX.1 image and provide it as the starting image for an LTX Video image-to-video generation
Generate a 5 second video based on the provided image All models are distributed across available the available CPU and GPUs with no model reloading on dual 3090s. Requires ComfyUI-GGUF and ComfyUI-LTXVideo

Using DeviceSelectorMultiGPU

examples/device_selector_lowvram_flux_controlnet.json This workflow loads a GGUF version of FLUX.1-dev onto a cuda device and the T5 CLIP onto the CPU. The FLUX.1-dev fill controlnet model by alimama-creative FLUX.1-dev-Controlnet-Inpainting-Alpha controlnet model illustrating linking two loaders together so their resources always remain in-sync.

LLM-Guided Video Generation

examples/llamacpp_ltxv_2gpu_GGUF.json This workflow demonstrates:

Using a local LLM (loaded on first GPU via llama.cpp) to take a text suggestion and craft an LTX Video promot
Feeding the enhanced prompt to LTXVideo (loaded on second GPU) for video generation Requires appropriate LLM. Requires ComfyUI-GGUF.

Support

If you encounter problems, please open an issue. Attach the workflow if possible.

Credits

Originally created by Alexander Dzhoganov. Implementation improved by City96. Currently maintained by pollockjj.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ComfyUI-MultiGPU

Device selection and model offloading tools for ComfyUI workloads that exceed single GPU capacity

NEW: DisTorch - Advanced GGUF-Quantized Model Layer Distribution

Installation

Manual Installation

Nodes

Example workflows

Split GGUF-quantized UNet and CLIP models across multiple devices using DisTorch

Split Hunyuan Video UNet across two devices and use DiffSynth Just-in-Time loading

Split Hunyuan Video generation across multiple resources

Split FLUX.1-dev across two GPUs

Split FLUX.1-dev between the CPU and a single GPU

Using GGUF quantized models across GPUs

Using GGUF quantized models across a CPU and a single GPU for video generation

Using GGUF quantized models across GPUs for video generation

Loading two SDXL checkpoints on different GPUs

FLUX.1-dev and SDXL in the same workflow

Image to Prompt to Image to Video Generation Pipeline

Using DeviceSelectorMultiGPU

LLM-Guided Video Generation

Support

Credits

About

Uh oh!

Releases

Packages

Languages

License

3dluvr/ComfyUI-MultiGPU

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-MultiGPU

Device selection and model offloading tools for ComfyUI workloads that exceed single GPU capacity

NEW: DisTorch - Advanced GGUF-Quantized Model Layer Distribution

Installation

Manual Installation

Nodes

Example workflows

Split GGUF-quantized UNet and CLIP models across multiple devices using DisTorch

Split Hunyuan Video UNet across two devices and use DiffSynth Just-in-Time loading

Split Hunyuan Video generation across multiple resources

Split FLUX.1-dev across two GPUs

Split FLUX.1-dev between the CPU and a single GPU

Using GGUF quantized models across GPUs

Using GGUF quantized models across a CPU and a single GPU for video generation

Using GGUF quantized models across GPUs for video generation

Loading two SDXL checkpoints on different GPUs

FLUX.1-dev and SDXL in the same workflow

Image to Prompt to Image to Video Generation Pipeline

Using DeviceSelectorMultiGPU

LLM-Guided Video Generation

Support

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages