Brief instructions for running RFDiffusion on AMD GPU using DGL ported to use HIP (in https://github.com/nod-ai/dgl).
The dockerfile is at [./Dockerfile.rocm].
To build the docker container:
docker build -t rfdiffusion-dgl-rocm - <Dockerfile.rocm
To run the docker container:
docker run -it \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--ipc=host \
--shm-size 16G \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt \
seccomp=unconfined \
rfdiffusion-dgl-rocm
This will give you a bash shell in the /rfidffusion
directory (a clone of the
RFDiffusion git repository).
This is just an example. For official docs, see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html
The Docker container is going to be the most reliable documentation of the dependencies, but generally you need the latest ROCM plus two header-only patches: ROCm/rocPRIM#662 (don't need the test update) and ROCm/hipCUB#433. If you are setting up a local machine, I highly recommend using a Python virtual environment.
DGL has been ported to work with HIP in https://github.com/nod-ai/dgl/tree/hipify-inplace. Only the PyTorch backend is supported and distributed multi-GPU execution is not supported, but otherwise all C++ and Python unit tests pass. You can run them as:
cd /dgl
# C++ Tests
build/runUnitTests
# Python GPU tests
bash ./script/run_pytest.sh -g \
tests/python/pytorch/ \
--ignore=tests/python/pytorch/graphbolt \
--ignore=tests/python/pytorch/distributed \
--deselect=tests/python/pytorch/dataloading/test_dataloader.py::test_distributed_dataloaders
bash ./script/run_pytest.sh -g \
tests/python/common/ \
--ignore=tests/python/common/test_partition.py
# Python CPU tests
bash ./script/run_pytest.sh -c \
tests/python/pytorch/ \
--ignore=tests/python/pytorch/graphbolt \
--ignore=tests/python/pytorch/distributed \
--deselect=tests/python/pytorch/dataloading/test_dataloader.py::test_distributed_dataloaders
bash ./script/run_pytest.sh -c \
tests/python/common/ \
--ignore=tests/python/common/test_partition.py
The Docker container contains a release build.
You can base a source build off of the command in the Docker container. Note the
usage of clang-19. This is required to pick up
llvm/llvm-project#93976. If need to debug, you may want
to do a debug build. Set -DCMAKE_BUILD_TYPE=Release -DBUILD_TYPE=release
. If
you're rebuilding a lot, I recommend using ccache. Set
-DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
. If
building in a virtual environment, I strongly recommend
-DPython3_FIND_VIRTUALENV=ONLY
.
I've added a few minor patches to RFDiffusion: updating SE3Transformer to the latest from upstream and commenting out a couple of NVIDIA-specific dependencies. Running RFDiffusion should work as described in the README. The Conda environment setup is specific to CUDA (and a specific version thereof), so when working locally I just used a virtual environment and installed dependencies with pip.
The docker container should have everything you need and you should be able to jump straight to the README Usage section.