AI Model Dynamic Offloader

This project is a pytorch VRAM allocator that implements on-demand offloading of model weights when the primary pytorch VRAM allocator comes under pressure.

Support:

Nvidia GPUs only
Pytorch 2.6+
Cuda 12.8+
Windows 11+ / Linux as per python ManyLinux support

How it works:

The pytorch application creates a Virtual Base Address Register (VBAR) for a model. Creating a VBAR doesn't cost any VRAM, only GPU virtual address space (which is pretty much free).
The pytorch application allocates tensors for model weights within the VBAR. These tensors are initially un-allocated and will segfault if touched.
The pytorch application faults in the tensors using the fault() API at the time the tensor is needed. This is where VRAM actually gets allocated.

If the `fault()` is successful (sufficient VRAM for this tensor):

If the fault() resultant signature is changed or unknown:
- The application uses tensor::_copy() to populate the weight data on the GPU.
- The application saves the returned signature against this weight for future comparison
The layer uses the weight tensor.
The application calls unpin() on the tensor to allow it to be freed under pressure later if needed.

If the `fault()` is unsuccessful (offloaded weight):

The application allocates a temporary regular GPU tensor.
Uses _copy to populate weight data on the GPU.
The layer uses the temporary as the weight.
Pytorch garbage collects the temp when the layer is finished.

see examples/example.py

Priorities:

The most recent VBARs are the highest priority and lower addresses in the VBAR take priority over higher addresses.
Applications should order their tensor allocations in the VBAR in load-priority order with the lowest addresses for the highest priority weights.
Calling fault() on a weight that is higher priority than other weights will cause those lower priority weights to get freed to make space.
Having a weight evicted sets that VBAR's watermark to that weight's level. Any weights in the same VBAR above the watermark automatically fail the fault() API. This avoids constantly faulting in all weights each model iteration while allowing the application to just blindly call fault() every layer and check the results. There is no need for the application to manage any VRAM quotas or watermarks.
Existing VBARs can be pushed to top priority with the prioritize() API. This allows use of an already loaded or partially model (e.g. using the same model twice in a complex workflow). Using prioritize resets the offload watermark of that model to no offloading, giving its weights priority over any other currently loaded models.

Backend:

VBAR allocation is done with cuMemAddressReserve(), faulting with cuMemCreate() and cuMemMap() and all frees done with appropriate converse APIs.
For consistency with VBAR memory management, main pytorch allocator plugin is also implemented with cuMemAddressReserve -> cuMemCreate -> cuMemMap. This also behaves a lot better on Windows systems with System Memory fallback.
This allocator is incompatible with the pytorch cudaMallocAsync backend or expandable segments backends (as the plugin interface does not exist on these backends as of this writing).

Caveats:

There is no real way for this allocator to tell the difference between high usage and bad fragmentation in the pytorch caching allocator. As we always return success to the pytorch caching allocator it experiences no pressure while weights are being offloaded which means it can run in an extremely fragmented mode. The assumption is model weight access patterns are reasonably regular over blocks or iterations and it finds a good set of sizes to cache. What you should generally do though, is completely flush the pytorch caching allocator before each new model run, which avoids completely un-used reservations from taking priority over the next models weights.

Experimental Windows ROCm (TheRock) support

This fork provides a batch script and necessary tweaks to allow comfy-aimdo to build and run with ROCm support on Windows.

Prerequisites

ROCm (TheRock) SDK & PyTorch
Visual Studio 2022 with the Desktop development with C++ workload
CUDA toolkit (for hipify)
Git
ComfyUI

Build from source and and Install

Open PowerShell.
Navigate to the folder where you want to clone the repository.

For example:

cd C:/

Clone this fork:

git clone https://github.com/0xDELUXA/comfy-aimdo_win-rocm.git

Navigate to the win_rocm_patch directory:

cd comfy-aimdo_win-rocm\win_rocm_patch

Activate your ComfyUI virtual environment.
Run the batch script:

.\build-rocm-windows.bat

The script will automatically compile aimdo.dll along with aimdo.lib and modify the necessary files to be compatible with Windows ROCm. During the process, it will prompt you for the locations of the ROCm SDK core, CUDA Toolkit, and Visual Studio.

To install the built files, run:

cd ..
pip install .

After install, you must manually copy amdhip64_7.dll from your ROCm SDK into your venv\Lib\site-packages\comfy_aimdo\ - the script will specify the required location

Upon completion, comfy-aimdo should work on Windows ROCm.

Additional notes

comfy-aimdo is only loaded and used if ComfyUI is launched with the --fast flag.
If your startup console prints: HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll, (at least on RDNA4 with Adrenalin 26.1.1) comfy-aimdo will not work at all. This is why the final manual copy step is required.
The "new" Model Initializing... phase introduced by comfy-aimdo can be quite demanding on AMD GPUs, particularly for larger models, and may even hang.
For some reason, in certain workflows, comfy-aimdo breaks triton-windows on AMD with the following error: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?). As a result, we cannot use SageAttention V1 or FlashAttention-2, only SDPA works (for now).
Running pip install -r requirements.txt will always uninstall the Windows ROCm version of comfy-aimdo and install the Nvidia-only version. After running this command, you need to reinstall the ROCm version with:

venv\Scripts\activate
pip uninstall comfy-aimdo -y
cd comfy-aimdo_win-rocm
pip install .

Alternatively, if you use a batch script to start ComfyUI, you can add the following lines to your script as a workaround to prevent reinstalling the Nvidia-only version:

set "COMFY_PATH=%~dp0"
powershell -Command "Get-Content \"$env:COMFY_PATH\requirements.txt\" | Where-Object { $_ -notmatch 'comfy-aimdo' } | Out-File -Encoding ASCII 'temp_reqs-no_aimdo.txt'"
python -m pip install -r temp_reqs-no_aimdo.txt
del temp_reqs-no_aimdo.txt

After you've built and installed it, you can test whether it works on your system using this script: example_hip.py. It should print [No Load Needed], [Offloaded], etc., and not just error out. Here is my local output. Yes, it isn’t flawless, but at least it does something.
Don't expect comfy-aimdo to improve VRAM management or overall performance on AMD compared to not using it. I just got it working by hipifying the CUDA code and adding some workarounds. It would be appreciated if an AMD developer or a community member with deeper insight could further optimize the build script.
According to the ROCm documentation:

Please note, the virtual memory management functions of the HIP runtime API are implemented on Linux and are under development on Windows.

Tested on Windows 11 with the latest versions of TheRock ROCm (7.12.0a20260224) and PyTorch (2.12.0a0+rocm7.12.0a20260224), using an RDNA4 GPU (AMD Radeon RX 9060 XT), in the latest ComfyUI (v0.14.2), launched with the --fast flag. Works with WAN, Qwen, FLUX, SDXL workloads.

This is experimental and may not function as expected. Even after successful build, installation, and loading, occasional GPU hangs, out-of-memory (OOM) errors, and other runtime issues may occur on AMD hardware running Windows.

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github		.github
comfy_aimdo		comfy_aimdo
docker		docker
examples		examples
src-win		src-win
src		src
win_rocm_patch		win_rocm_patch
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build-linux-docker		build-linux-docker
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Model Dynamic Offloader

Support:

How it works:

If the `fault()` is successful (sufficient VRAM for this tensor):

If the `fault()` is unsuccessful (offloaded weight):

Priorities:

Backend:

Caveats:

Experimental Windows ROCm (TheRock) support

Prerequisites

Build from source and and Install

Additional notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

0xDELUXA/comfy-aimdo_win-rocm

Folders and files

Latest commit

History

Repository files navigation

AI Model Dynamic Offloader

Support:

How it works:

If the fault() is successful (sufficient VRAM for this tensor):

If the fault() is unsuccessful (offloaded weight):

Priorities:

Backend:

Caveats:

Experimental Windows ROCm (TheRock) support

Prerequisites

Build from source and and Install

Additional notes

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

If the `fault()` is successful (sufficient VRAM for this tensor):

If the `fault()` is unsuccessful (offloaded weight):

Packages