Fix CUDA plugin CI. #8593

ysiraichi · 2025-01-21T12:41:49Z

This PR reverts #8286, and bumps CUDA version to 12.3. The latter is needed for successfully compiling GPU dependent source code that makes use of CUgraphConditionalHandle (not available in 12.1) driver API typedef.

amjames · 2025-01-22T14:21:01Z

Looks like the failing jobs are due to a failed clone from kleidiai's gitlab. Is that a widespread issue or spurious failure?

ysiraichi · 2025-01-22T15:02:06Z

It doesn't look widespread (haven't seen in other PRs). I will try rebasing this PR.

tengyifei · 2025-01-22T18:32:38Z

@ysiraichi from pytorch/pytorch#138609 (comment), it looks like PyTorch upstream decided to release with some specific set of CUDA versions (see issue). Can we use one of their chosen versions, for example CUDA 12.4 instead of CUDA 12.3?

ysiraichi · 2025-01-22T18:51:20Z

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

tengyifei · 2025-01-22T19:52:55Z

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

Could you clarify this challenge? Do you mean that you were hoping to find a torch_xla CUDA 12.4 docker build?

ysiraichi · 2025-01-22T20:02:07Z

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

xla/.github/workflows/build_and_test.yml

Lines 44 to 52 in fbbdfca

    
           # build-torch-with-cuda: 
        
           #   name: "Build PyTorch with CUDA" 
        
           #   uses: ./.github/workflows/_build_torch_with_cuda.yml 
        
           #   needs: get-torch-commit 
        
           #   with: 
        
           #     # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner. 
        
           #     dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1 
        
           #     torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}} 
        
           #     runner: linux.24xlarge

tengyifei · 2025-01-23T22:59:09Z

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

@ysiraichi got it. thanks for the explanation. I think using CUDA 12.3 for now is a-okay. IIUC, most of the time we're only using torch CPU + torch_xla GPU in any case.

tengyifei · 2025-01-23T22:59:45Z

LMK when I should review. It looks like there are still some failed tests.

This reverts commit da18622.

This reverts commit d8fba62.

This reverts commit 9db596b.

ysiraichi added the xla:gpu label Jan 21, 2025

Bump CUDA version to 12.3.

afc5707

ysiraichi force-pushed the fix-cuda-plugin-compilation branch from b5474c1 to afc5707 Compare January 22, 2025 15:04

Update Github actions.

35cbd57

Fix build YAML.

c3c8aa6

ysiraichi added 8 commits January 28, 2025 18:21

Add torch pin for Sep 30, 2024 commit.

d8fba62

Downgrade CUDA to 12.1.

da18622

Revert "Downgrade CUDA to 12.1."

1e6e694

This reverts commit da18622.

Restrict PyTorch with CUDA build to 12 parallel jobs.

d5be1ec

Dump machine details.

9db596b

Revert "Add torch pin for Sep 30, 2024 commit."

81915e8

This reverts commit d8fba62.

Increase MAX_JOBS to 24.

c3a9f45

Revert "Dump machine details."

44b72a2

This reverts commit 9db596b.

ysiraichi marked this pull request as ready for review January 29, 2025 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA plugin CI. #8593

Fix CUDA plugin CI. #8593

ysiraichi commented Jan 21, 2025

amjames commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 23, 2025

tengyifei commented Jan 23, 2025

Fix CUDA plugin CI. #8593

Are you sure you want to change the base?

Fix CUDA plugin CI. #8593

Conversation

ysiraichi commented Jan 21, 2025

amjames commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 22, 2025

ysiraichi commented Jan 22, 2025

tengyifei commented Jan 23, 2025

tengyifei commented Jan 23, 2025