Skip to content

CI: Use per-machine runner labels for GPU load test#299

Closed
gyohuangxin wants to merge 7 commits intomainfrom
xihuang/gpu-load-test-workflow
Closed

CI: Use per-machine runner labels for GPU load test#299
gyohuangxin wants to merge 7 commits intomainfrom
xihuang/gpu-load-test-workflow

Conversation

@gyohuangxin
Copy link
Member

Summary

  • Use per-machine runner labels (mia1-p01-g33, g34, g40, g42, g45, g64) instead of the shared atom-mi355-8gpu.predownload label
  • This ensures every physical machine runs the GPU load test exactly once with no duplication
  • Labels have been added to each runner via the GitHub API

Follows up on #298

Copilot AI review requested due to automatic review settings March 10, 2026 09:55
@gyohuangxin gyohuangxin changed the title Use per-machine runner labels for GPU load test CI: Use per-machine runner labels for GPU load test Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the manual GPU load test workflow to target specific self-hosted runner labels per physical machine, avoiding duplicate runs caused by a shared runner label.

Changes:

  • Replaced the single shared runner label in the default (“all”) runner matrix with a fixed list of per-machine runner labels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 10, 2026 12:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +65
- name: Kill all Docker containers and clean up workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
if [ -n "$containers" ]; then
docker kill $containers || true
fi
docker rm -f ${{ env.CONTAINER_NAME }} 2>/dev/null || true
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup step no longer removes an existing ${{ env.CONTAINER_NAME }} container. Since docker ps -q only lists running containers, a previously-stopped gpu_load_test container could remain and make the later docker run -dt --name ${{ env.CONTAINER_NAME }} fail with a name conflict. Consider explicitly docker rm -f ${{ env.CONTAINER_NAME }} (or removing all containers you kill) as part of this step.

Copilot uses AI. Check for mistakes.
INPUT="${{ inputs.runners || 'all' }}"
if [ "$INPUT" = "all" ]; then
MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]'
MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow now uses per-machine runner labels for the default all case, but .github/runner-config.yml still only documents the old shared atom-mi355-8gpu.predownload label and indicates it should be updated when CI runner labels change. To keep the runner→hardware mapping (and any dependent dashboards) accurate, add entries for the new mia1-p01-g* labels there (or otherwise update the mapping source of truth).

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 12
MODEL_NAME="deepseek-ai/DeepSeek-R1-0528"
MODEL_LOCAL_PATH="/data/deepseek-ai/DeepSeek-R1-0528"
MODEL_LOCAL_PATH="/models/deepseek-ai/DeepSeek-R1-0528"
TENSOR_PARALLEL=8
KV_CACHE_DTYPE="fp8"
TEMPERATURE=0
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_LOCAL_PATH was changed to /models/..., but the script’s docker run later only bind-mounts /data into the container. If the model is found locally, MODEL_PATH will be a /models/... path that won’t exist inside the container, causing the inference command to fail. Either mount /models into the container as well, or add the same /models/data fallback logic used in the GitHub Actions workflow.

Copilot uses AI. Check for mistakes.
grep -c outputs '0' to stdout then exits with code 1 when no matches
are found. The '|| echo 0' fallback then outputs another '0', making
LOAD_COUNT='0\n0' which breaks the integer comparison. Use '|| true'
to suppress the error exit and default to 0 if empty.
Move all analysis logic into gpu_load_test.sh and remove redundant
analysis step from the workflow. The script now handles GPU status
check, model loading, timing analysis, and detailed diagnostics
(including problem analysis and recommended actions for high variance).
Copilot AI review requested due to automatic review settings March 10, 2026 14:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to 51
- name: Kill all Docker containers and clean up workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker kill $(docker ps -q) will terminate all containers running on the self-hosted runner, including unrelated long-lived services/containers that might be required on that machine. Please scope cleanup to containers created by this workflow (e.g., stop/rm only atom_inference, or filter by a workflow-specific label) rather than killing everything.

Suggested change
- name: Kill all Docker containers and clean up workspace
run: |
echo "=== Cleaning up containers on $(hostname) ==="
containers=$(docker ps -q)
- name: Kill workflow Docker containers and clean up workspace
run: |
echo "=== Cleaning up workflow containers on $(hostname) ==="
containers=$(docker ps -q --filter "name=atom_inference")

Copilot uses AI. Check for mistakes.
MOUNT_FLAGS=""
[ -d "/data" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /data:/data"
[ -d "/models" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /models:/models"
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace cleanup command rm -rf /workspace/* will not remove dotfiles (e.g., .git, .github, .cache), so state can leak between runs on self-hosted runners. Consider using a cleanup that also removes dotfiles (or rely on actions/checkout with clean: true/git clean -ffdx) so each run starts from a truly clean workspace.

Suggested change
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true
docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" || true

Copilot uses AI. Check for mistakes.
INPUT="${{ inputs.runners || 'all' }}"
if [ "$INPUT" = "all" ]; then
MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]'
MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow now targets new runner labels (mia1-p01-g33, g34, etc.). .github/runner-config.yml explicitly says to update the runner→GPU mapping when runner labels in workflows change, but it currently only lists atom-mi355-8gpu.predownload and linux-atom-mi355-*. Please add entries for these new labels (or adjust the workflow to use existing mapped labels) so downstream tooling stays accurate.

Suggested change
MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]'

Copilot uses AI. Check for mistakes.
echo "═══════════════════════════════════════════════════════"
echo ""
echo "1. 🔄 RUN TEST AGAIN to see if same GPUs are slow:"
echo " ./test_gpu_load.sh"
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnostic output suggests re-running ./test_gpu_load.sh, but this repository script is .github/scripts/gpu_load_test.sh (and the workflow executes that). Please update the printed command/path to the actual script name/location to avoid confusion during manual troubleshooting.

Suggested change
echo " ./test_gpu_load.sh"
echo " ./.github/scripts/gpu_load_test.sh"

Copilot uses AI. Check for mistakes.

# Identify slow GPUs
SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{
SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awk is passed -v max="$MAX_TIME" here, but max is never referenced in the awk program. Please drop the unused variable to avoid confusion (or use it if it was intended for the threshold logic).

Suggested change
SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{
SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants