GitHub Actions workflows to support running on k8's by okakarpa · Pull Request #210 · ROCm/rocm-jax

okakarpa · 2025-12-05T21:54:58Z

This pull request updates the GitHub Actions workflows to improve how GPU devices are configured for Docker containers, especially in Kubernetes (K8s) environments. The main improvement is the introduction of a step that dynamically detects and configures available GPU devices, making the workflows more robust and portable across different environments.

Key improvements to GPU device configuration in CI workflows:

Dynamic GPU Device Detection and Configuration:

Added a new Configure GPU devices for K8s step to the .github/workflows/ci.yml, .github/workflows/llama-perf.yml, and .github/workflows/rocm-perf.yml workflows. This step checks for a list of GPU devices provided by the K8s environment or falls back to detecting devices in /dev/dri, and outputs the appropriate Docker --device flags. [1] [2] [3] [4]

Workflow Docker Run Command Updates:

Updated all relevant docker run commands in the workflows to use the dynamically generated device flags from the new configuration step instead of hardcoding --device=/dev/dri. This allows the workflows to adapt to the actual devices present in the environment. [1] [2] [3] [4]

charleshofer

I'm having trouble understanding what this PR is supposed to do. I understand that this looks for the current If I'm running a container with docker run ... I'm certainly not running it in a Kubernetes pod. Unless you're planning on doing docker-in-docker? Which very, very messy. The much smarter thing to do, IMO, would be to change the build scripts so that they assume that you're already running in a Manylinux or Ubuntu docker container, and then kick off the container via GHA's built-in facilities for specifying a docker container like this. Or we should be using an ARC to handle Kubernetes runners. I'd like for us to have a design discussion that describes how things are supposed to work before going ahead and implementing anything k8s related in JAX CI. JAX CI is very much built around the assumption that you're running a regular-old system with Docker installed on it.

Beyond that, have you tried running this on the Kubernetes runner you added? I'd like to see a passing run on the Kubernetes runner before merging anything related to it into our workflows.

Also also, the .github/workflows/nightly.yml job would need the same treatment as ci.yml

Finally, assuming that we dp docker-in-docker, you'll need to update the .gihtub/workflows/build-wheels.yml job and its related build scripts (build/ci_build and jax_rocm_plugin/build/rocm/ci_build at the very least) to handle it.

.github/workflows/ci.yml

gulsumgudukbay

@okakarpa could you please change the title of this PR to a more descriptive one?

okakarpa added 5 commits December 5, 2025 21:53

adding the changes

d7f697e

changes

5995061

adding the changes

5d71d44

reverting changes

bff08a2

removeing duplicates

b117998

charleshofer requested changes Dec 8, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

okakarpa added 3 commits December 8, 2025 21:51

addressing the comments

3bf5f5d

adding the simple change

09bb077

removing spaces

7723943

okakarpa requested review from charleshofer and removed request for charleshofer December 9, 2025 02:28

okakarpa added 2 commits December 9, 2025 02:29

fixing yaml

0558f35

j

7cd99c9

okakarpa closed this Dec 9, 2025

removing white spaces

b73c2ec

okakarpa reopened this Dec 9, 2025

okakarpa added 2 commits December 9, 2025 03:30

entering device flag

259d3d2

fixing yaml lint

339be4a

gulsumgudukbay requested changes Dec 9, 2025

View reviewed changes

okakarpa changed the title ~~adding the changes~~ GitHub Actions workflows to support running on k8's Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

GitHub Actions workflows to support running on k8's#210

GitHub Actions workflows to support running on k8's#210
okakarpa wants to merge 13 commits intomasterfrom
gpu-isolation

okakarpa commented Dec 5, 2025

Uh oh!

charleshofer left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gulsumgudukbay left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

okakarpa commented Dec 5, 2025

Uh oh!

charleshofer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gulsumgudukbay left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants