Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/scripts/gpu_load_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#############################################

MODEL_NAME="deepseek-ai/DeepSeek-R1-0528"
MODEL_LOCAL_PATH="/data/deepseek-ai/DeepSeek-R1-0528"
MODEL_LOCAL_PATH="/models/deepseek-ai/DeepSeek-R1-0528"
TENSOR_PARALLEL=8
KV_CACHE_DTYPE="fp8"
TEMPERATURE=0
Comment on lines 8 to 12
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_LOCAL_PATH was changed to /models/..., but the script’s docker run later only bind-mounts /data into the container. If the model is found locally, MODEL_PATH will be a /models/... path that won’t exist inside the container, causing the inference command to fail. Either mount /models into the container as well, or add the same /models/data fallback logic used in the GitHub Actions workflow.

Copilot uses AI. Check for mistakes.
Expand Down
13 changes: 7 additions & 6 deletions .github/workflows/gpu-load-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
run: |
INPUT="${{ inputs.runners || 'all' }}"
if [ "$INPUT" = "all" ]; then
MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]'
MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow now uses per-machine runner labels for the default all case, but .github/runner-config.yml still only documents the old shared atom-mi355-8gpu.predownload label and indicates it should be updated when CI runner labels change. To keep the runner→hardware mapping (and any dependent dashboards) accurate, add entries for the new mia1-p01-g* labels there (or otherwise update the mapping source of truth).

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow now targets new runner labels (mia1-p01-g33, g34, etc.). .github/runner-config.yml explicitly says to update the runner→GPU mapping when runner labels in workflows change, but it currently only lists atom-mi355-8gpu.predownload and linux-atom-mi355-*. Please add entries for these new labels (or adjust the workflow to use existing mapped labels) so downstream tooling stays accurate.

Suggested change
MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]'

Copilot uses AI. Check for mistakes.
else
MATRIX="["
SEP=""
Expand Down Expand Up @@ -55,16 +55,17 @@ jobs:
KV_CACHE_DTYPE: "fp8"

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Clean up existing containers
- name: Kill all Docker containers and clean up workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
Comment on lines +48 to 51
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker kill $(docker ps -q) will terminate all containers running on the self-hosted runner, including unrelated long-lived services/containers that might be required on that machine. Please scope cleanup to containers created by this workflow (e.g., stop/rm only atom_inference, or filter by a workflow-specific label) rather than killing everything.

Suggested change
- name: Kill all Docker containers and clean up workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
- name: Kill workflow Docker containers and clean up workspace
run: |
echo "=== Cleaning up workflow containers on $(hostname) ==="
containers=$(docker ps -q --filter "name=atom_inference")

Copilot uses AI. Check for mistakes.
if [ -n "$containers" ]; then
docker kill $containers || true
fi
docker rm -f ${{ env.CONTAINER_NAME }} 2>/dev/null || true
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
Comment on lines +48 to +55
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup step no longer removes an existing ${{ env.CONTAINER_NAME }} container. Since docker ps -q only lists running containers, a previously-stopped gpu_load_test container could remain and make the later docker run -dt --name ${{ env.CONTAINER_NAME }} fail with a name conflict. Consider explicitly docker rm -f ${{ env.CONTAINER_NAME }} (or removing all containers you kill) as part of this step.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace cleanup command rm -rf /workspace/* will not remove dotfiles (e.g., .git, .github, .cache), so state can leak between runs on self-hosted runners. Consider using a cleanup that also removes dotfiles (or rely on actions/checkout with clean: true/git clean -ffdx) so each run starts from a truly clean workspace.

Suggested change
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" || true

Copilot uses AI. Check for mistakes.

- name: Checkout code
uses: actions/checkout@v4

- name: GPU status
run: |
Expand Down
Loading