Skip to content

Latest commit

 

History

History
103 lines (80 loc) · 3.38 KB

README.amd.md

File metadata and controls

103 lines (80 loc) · 3.38 KB

RFDiffusion on AMD GPU

Brief instructions for running RFDiffusion on AMD GPU using DGL ported to use HIP (in https://github.com/nod-ai/dgl).

Docker

The dockerfile is at [./Dockerfile.rocm].

To build the docker container:

docker build -t rfdiffusion-dgl-rocm - <Dockerfile.rocm

To run the docker container:

docker run -it \
    --network=host \
    --device=/dev/kfd \
    --device=/dev/dri \
    --ipc=host \
    --shm-size 16G \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt \
    seccomp=unconfined \
    rfdiffusion-dgl-rocm

This will give you a bash shell in the /rfidffusion directory (a clone of the RFDiffusion git repository).

This is just an example. For official docs, see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html

Dependencies

The Docker container is going to be the most reliable documentation of the dependencies, but generally you need the latest ROCM plus two header-only patches: ROCm/rocPRIM#662 (don't need the test update) and ROCm/hipCUB#433. If you are setting up a local machine, I highly recommend using a Python virtual environment.

DGL

DGL has been ported to work with HIP in https://github.com/nod-ai/dgl/tree/hipify-inplace. Only the PyTorch backend is supported and distributed multi-GPU execution is not supported, but otherwise all C++ and Python unit tests pass. You can run them as:

cd /dgl
# C++ Tests
build/runUnitTests
# Python GPU tests
bash ./script/run_pytest.sh -g \
    tests/python/pytorch/ \
    --ignore=tests/python/pytorch/graphbolt \
    --ignore=tests/python/pytorch/distributed \
    --deselect=tests/python/pytorch/dataloading/test_dataloader.py::test_distributed_dataloaders
bash ./script/run_pytest.sh -g \
    tests/python/common/ \
    --ignore=tests/python/common/test_partition.py

# Python CPU tests
bash ./script/run_pytest.sh -c \
    tests/python/pytorch/ \
    --ignore=tests/python/pytorch/graphbolt \
    --ignore=tests/python/pytorch/distributed \
    --deselect=tests/python/pytorch/dataloading/test_dataloader.py::test_distributed_dataloaders
bash ./script/run_pytest.sh -c \
    tests/python/common/ \
    --ignore=tests/python/common/test_partition.py

The Docker container contains a release build.

Building from source

You can base a source build off of the command in the Docker container. Note the usage of clang-19. This is required to pick up llvm/llvm-project#93976. If need to debug, you may want to do a debug build. Set -DCMAKE_BUILD_TYPE=Release -DBUILD_TYPE=release. If you're rebuilding a lot, I recommend using ccache. Set -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache. If building in a virtual environment, I strongly recommend -DPython3_FIND_VIRTUALENV=ONLY.

RFDiffusion

I've added a few minor patches to RFDiffusion: updating SE3Transformer to the latest from upstream and commenting out a couple of NVIDIA-specific dependencies. Running RFDiffusion should work as described in the README. The Conda environment setup is specific to CUDA (and a specific version thereof), so when working locally I just used a virtual environment and installed dependencies with pip.

The docker container should have everything you need and you should be able to jump straight to the README Usage section.