Skip to content

Add Intel nightly tests for XPU and CPU platforms#22677

Open
MingxuZh wants to merge 15 commits intosgl-project:mainfrom
MingxuZh:main
Open

Add Intel nightly tests for XPU and CPU platforms#22677
MingxuZh wants to merge 15 commits intosgl-project:mainfrom
MingxuZh:main

Conversation

@MingxuZh
Copy link
Copy Markdown
Contributor

@MingxuZh MingxuZh commented Apr 13, 2026

Summary

This PR adds comprehensive nightly testing infrastructure for Intel platforms (XPU and CPU) and improves the existing XPU CI configuration.

Changes

1. Nightly Test Workflow (.github/workflows/nightly-test-intel.yml)

  • Schedule: Runs daily at 23:00 Beijing time (15:00 UTC)
  • Two parallel jobs:
    • nightly-test-xpu: Runs on sglang-bmg runner with Intel Arc B580 GPUs
    • nightly-test-cpu: Runs on xeon-gnr runner with Intel Xeon CPU
  • Docker configuration: Fixed multi-GPU support with proper /dev/dri and /dev/dri/by-path volume mounts for oneCCL communication
  • Permissions: Added render group (GID 992) for GPU device access

2. PR Test Workflow Updates (.github/workflows/pr-test-xpu.yml)

  • Applied the same Docker mount fixes for consistent behavior between PR and nightly tests

3. New XPU Test Files

  • test_llama_tp.py: Llama 3.2 3B model with TP=2 for multi-GPU testing (nightly only)
  • test_deepseek_ocr.py: Added --mem-fraction-static 0.7 to prevent OOM
  • test_deepseek_ocr_triton.py: Adjusted est_time=400 for proper test ordering

Test Configuration

Test Model TP Suite Notes
DeepSeek-OCR deepseek-ai/DeepSeek-OCR 1 per-commit, nightly With triton backend
Llama TP=2 meta-llama/Llama-3.2-3B-Instruct 2 nightly Multi-GPU validation

Hardware Requirements

  • XPU: Intel Arc B580 (4x 12GB) on sglang-bmg runner
  • CPU: Intel Xeon on xeon-gnr runner

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces XPU (Intel GPU) support to the CI registration system and adds several XPU-specific tests, including a new multi-GPU tensor parallelism test for Llama 3.2. Key changes include the definition of register_xpu_ci, updates to the test suite runner to recognize XPU backends, and memory limit adjustments for DeepSeek OCR tests. Feedback focuses on improving test robustness by checking HTTP response statuses in the new Llama test and enabling strict suite validation for the XPU platform in the test runner.

Comment on lines +66 to +68
HWBackend.XPU: [
"per-commit-xpu",
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While adding HWBackend.XPU to the suite mappings is correct, it should also be added to the _SUITE_CHECKED_BACKENDS set (around line 136) to enable strict suite validation for the XPU platform. Currently, validation is skipped for XPU tests, which could lead to incorrectly registered tests going unnoticed during CI runs.

Comment on lines +83 to +84
)
ret = response.json()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to call response.raise_for_status() before attempting to parse the JSON response. This ensures that if the server returns an error (e.g., 500 Internal Server Error), the test fails with a clear HTTP error message rather than a potentially confusing KeyError or JSONDecodeError later.

Suggested change
)
ret = response.json()
)
response.raise_for_status()
ret = response.json()

Comment on lines +102 to +103
)
ret = response.json()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the /generate endpoint, adding response.raise_for_status() here will improve the robustness of the test by providing immediate feedback if the chat completion request fails.

Suggested change
)
ret = response.json()
)
response.raise_for_status()
ret = response.json()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant