Add Intel nightly tests for XPU and CPU platforms#22677
Add Intel nightly tests for XPU and CPU platforms#22677MingxuZh wants to merge 15 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces XPU (Intel GPU) support to the CI registration system and adds several XPU-specific tests, including a new multi-GPU tensor parallelism test for Llama 3.2. Key changes include the definition of register_xpu_ci, updates to the test suite runner to recognize XPU backends, and memory limit adjustments for DeepSeek OCR tests. Feedback focuses on improving test robustness by checking HTTP response statuses in the new Llama test and enabling strict suite validation for the XPU platform in the test runner.
| HWBackend.XPU: [ | ||
| "per-commit-xpu", | ||
| ], |
There was a problem hiding this comment.
While adding HWBackend.XPU to the suite mappings is correct, it should also be added to the _SUITE_CHECKED_BACKENDS set (around line 136) to enable strict suite validation for the XPU platform. Currently, validation is skipped for XPU tests, which could lead to incorrectly registered tests going unnoticed during CI runs.
| ) | ||
| ret = response.json() |
There was a problem hiding this comment.
It is recommended to call response.raise_for_status() before attempting to parse the JSON response. This ensures that if the server returns an error (e.g., 500 Internal Server Error), the test fails with a clear HTTP error message rather than a potentially confusing KeyError or JSONDecodeError later.
| ) | |
| ret = response.json() | |
| ) | |
| response.raise_for_status() | |
| ret = response.json() |
| ) | ||
| ret = response.json() |
There was a problem hiding this comment.
Summary
This PR adds comprehensive nightly testing infrastructure for Intel platforms (XPU and CPU) and improves the existing XPU CI configuration.
Changes
1. Nightly Test Workflow (
.github/workflows/nightly-test-intel.yml)nightly-test-xpu: Runs onsglang-bmgrunner with Intel Arc B580 GPUsnightly-test-cpu: Runs onxeon-gnrrunner with Intel Xeon CPU/dev/driand/dev/dri/by-pathvolume mounts for oneCCL communicationrendergroup (GID 992) for GPU device access2. PR Test Workflow Updates (
.github/workflows/pr-test-xpu.yml)3. New XPU Test Files
test_llama_tp.py: Llama 3.2 3B model with TP=2 for multi-GPU testing (nightly only)test_deepseek_ocr.py: Added--mem-fraction-static 0.7to prevent OOMtest_deepseek_ocr_triton.py: Adjustedest_time=400for proper test orderingTest Configuration
Hardware Requirements
sglang-bmgrunnerxeon-gnrrunner