GitHub Actions workflows to support running on k8's#210
GitHub Actions workflows to support running on k8's#210
Conversation
charleshofer
left a comment
There was a problem hiding this comment.
I'm having trouble understanding what this PR is supposed to do. I understand that this looks for the current If I'm running a container with docker run ... I'm certainly not running it in a Kubernetes pod. Unless you're planning on doing docker-in-docker? Which very, very messy. The much smarter thing to do, IMO, would be to change the build scripts so that they assume that you're already running in a Manylinux or Ubuntu docker container, and then kick off the container via GHA's built-in facilities for specifying a docker container like this. Or we should be using an ARC to handle Kubernetes runners. I'd like for us to have a design discussion that describes how things are supposed to work before going ahead and implementing anything k8s related in JAX CI. JAX CI is very much built around the assumption that you're running a regular-old system with Docker installed on it.
Beyond that, have you tried running this on the Kubernetes runner you added? I'd like to see a passing run on the Kubernetes runner before merging anything related to it into our workflows.
Also also, the .github/workflows/nightly.yml job would need the same treatment as ci.yml
Finally, assuming that we dp docker-in-docker, you'll need to update the .gihtub/workflows/build-wheels.yml job and its related build scripts (build/ci_build and jax_rocm_plugin/build/rocm/ci_build at the very least) to handle it.
gulsumgudukbay
left a comment
There was a problem hiding this comment.
@okakarpa could you please change the title of this PR to a more descriptive one?
This pull request updates the GitHub Actions workflows to improve how GPU devices are configured for Docker containers, especially in Kubernetes (K8s) environments. The main improvement is the introduction of a step that dynamically detects and configures available GPU devices, making the workflows more robust and portable across different environments.
Key improvements to GPU device configuration in CI workflows:
Dynamic GPU Device Detection and Configuration:
Configure GPU devices for K8sstep to the.github/workflows/ci.yml,.github/workflows/llama-perf.yml, and.github/workflows/rocm-perf.ymlworkflows. This step checks for a list of GPU devices provided by the K8s environment or falls back to detecting devices in/dev/dri, and outputs the appropriate Docker--deviceflags. [1] [2] [3] [4]Workflow Docker Run Command Updates:
docker runcommands in the workflows to use the dynamically generated device flags from the new configuration step instead of hardcoding--device=/dev/dri. This allows the workflows to adapt to the actual devices present in the environment. [1] [2] [3] [4]