-
Notifications
You must be signed in to change notification settings - Fork 22
CI: Use per-machine runner labels for GPU load test #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
18feab9
a809a99
6f752b8
b11a94e
0eebfb1
f08d67e
39e698b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -24,7 +24,7 @@ jobs: | |||||||||||||||||
| run: | | ||||||||||||||||||
| INPUT="${{ inputs.runners || 'all' }}" | ||||||||||||||||||
| if [ "$INPUT" = "all" ]; then | ||||||||||||||||||
| MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]' | ||||||||||||||||||
| MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]' | ||||||||||||||||||
|
||||||||||||||||||
| MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]' | |
| MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]' |
Copilot
AI
Mar 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker kill $(docker ps -q) will terminate all containers running on the self-hosted runner, including unrelated long-lived services/containers that might be required on that machine. Please scope cleanup to containers created by this workflow (e.g., stop/rm only atom_inference, or filter by a workflow-specific label) rather than killing everything.
| - name: Kill all Docker containers and clean up workspace | |
| run: | | |
| echo "=== Cleaning up containers on $(hostname) ===" | |
| containers=$(docker ps -q) | |
| - name: Kill workflow Docker containers and clean up workspace | |
| run: | | |
| echo "=== Cleaning up workflow containers on $(hostname) ===" | |
| containers=$(docker ps -q --filter "name=atom_inference") |
Copilot
AI
Mar 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup step no longer removes an existing ${{ env.CONTAINER_NAME }} container. Since docker ps -q only lists running containers, a previously-stopped gpu_load_test container could remain and make the later docker run -dt --name ${{ env.CONTAINER_NAME }} fail with a name conflict. Consider explicitly docker rm -f ${{ env.CONTAINER_NAME }} (or removing all containers you kill) as part of this step.
Copilot
AI
Mar 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace cleanup command rm -rf /workspace/* will not remove dotfiles (e.g., .git, .github, .cache), so state can leak between runs on self-hosted runners. Consider using a cleanup that also removes dotfiles (or rely on actions/checkout with clean: true/git clean -ffdx) so each run starts from a truly clean workspace.
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true | |
| docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_LOCAL_PATHwas changed to/models/..., but the script’sdocker runlater only bind-mounts/datainto the container. If the model is found locally,MODEL_PATHwill be a/models/...path that won’t exist inside the container, causing the inference command to fail. Either mount/modelsinto the container as well, or add the same/models→/datafallback logic used in the GitHub Actions workflow.