Skip to content

Comments

GitHub Actions workflows to support running on k8's#210

Open
okakarpa wants to merge 13 commits intomasterfrom
gpu-isolation
Open

GitHub Actions workflows to support running on k8's#210
okakarpa wants to merge 13 commits intomasterfrom
gpu-isolation

Conversation

@okakarpa
Copy link
Collaborator

@okakarpa okakarpa commented Dec 5, 2025

This pull request updates the GitHub Actions workflows to improve how GPU devices are configured for Docker containers, especially in Kubernetes (K8s) environments. The main improvement is the introduction of a step that dynamically detects and configures available GPU devices, making the workflows more robust and portable across different environments.

Key improvements to GPU device configuration in CI workflows:

Dynamic GPU Device Detection and Configuration:

  • Added a new Configure GPU devices for K8s step to the .github/workflows/ci.yml, .github/workflows/llama-perf.yml, and .github/workflows/rocm-perf.yml workflows. This step checks for a list of GPU devices provided by the K8s environment or falls back to detecting devices in /dev/dri, and outputs the appropriate Docker --device flags. [1] [2] [3] [4]

Workflow Docker Run Command Updates:

  • Updated all relevant docker run commands in the workflows to use the dynamically generated device flags from the new configuration step instead of hardcoding --device=/dev/dri. This allows the workflows to adapt to the actual devices present in the environment. [1] [2] [3] [4]

Copy link
Collaborator

@charleshofer charleshofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what this PR is supposed to do. I understand that this looks for the current If I'm running a container with docker run ... I'm certainly not running it in a Kubernetes pod. Unless you're planning on doing docker-in-docker? Which very, very messy. The much smarter thing to do, IMO, would be to change the build scripts so that they assume that you're already running in a Manylinux or Ubuntu docker container, and then kick off the container via GHA's built-in facilities for specifying a docker container like this. Or we should be using an ARC to handle Kubernetes runners. I'd like for us to have a design discussion that describes how things are supposed to work before going ahead and implementing anything k8s related in JAX CI. JAX CI is very much built around the assumption that you're running a regular-old system with Docker installed on it.

Beyond that, have you tried running this on the Kubernetes runner you added? I'd like to see a passing run on the Kubernetes runner before merging anything related to it into our workflows.

Also also, the .github/workflows/nightly.yml job would need the same treatment as ci.yml

Finally, assuming that we dp docker-in-docker, you'll need to update the .gihtub/workflows/build-wheels.yml job and its related build scripts (build/ci_build and jax_rocm_plugin/build/rocm/ci_build at the very least) to handle it.

@okakarpa okakarpa requested review from charleshofer and removed request for charleshofer December 9, 2025 02:28
@okakarpa okakarpa closed this Dec 9, 2025
@okakarpa okakarpa reopened this Dec 9, 2025
Copy link
Contributor

@gulsumgudukbay gulsumgudukbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okakarpa could you please change the title of this PR to a more descriptive one?

@okakarpa okakarpa changed the title adding the changes GitHub Actions workflows to support running on k8's Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants