Skip to content

Commit b6e9fc7

Browse files
committed
test
Signed-off-by: Zhewen Li <zhewenli@meta.com>
1 parent d3c686b commit b6e9fc7

File tree

1 file changed

+166
-0
lines changed

1 file changed

+166
-0
lines changed

NIGHTLY_NOTIFICATION_FIX.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Nightly Tests Failure Notification Fix
2+
3+
## Problem
4+
5+
The "Nightly Tests Failure Notification" step was not working properly during nightly builds because:
6+
7+
1. The original wait step used `wait: ~` which waits for **ALL** previous pipeline steps
8+
2. Many tests had manual block steps that required human approval:
9+
- `block-build-cu118` - CUDA 11.8 image build
10+
- `block-arm64-cpu-img-build` - ARM64 CPU image build
11+
- `block-intel-cpu` - Intel CPU test
12+
- `block-ibm-power` - IBM Power test
13+
- `block-ibm-ppc64-test` - IBM Power ppc64le test
14+
- `block-ibm-s390x` - IBM Z test
15+
16+
3. During nightly runs with `NIGHTLY="1"` and `RUN_ALL="1"`, these blocks were never manually approved, causing the wait to hang forever
17+
4. The notification step never executed because it was waiting for blocked steps to complete
18+
19+
## Solution
20+
21+
### Strategy: Explicit Wait Dependencies with Checkpoints
22+
23+
Instead of using `wait: ~` (waits for everything), we implemented strategic checkpoint wait steps and made the final wait depend only on those checkpoints.
24+
25+
### Changes Made
26+
27+
#### 1. Checkpoint after Main GPU Tests (line ~488-492)
28+
29+
```jinja
30+
{% if nightly == "1" %}
31+
# Checkpoint: wait for all main GPU tests to complete before continuing
32+
- wait:
33+
key: main-tests-complete
34+
continue_on_failure: true
35+
{% endif %}
36+
```
37+
38+
This checkpoint is placed after the `{% for step in steps %}` loop completes, ensuring all main GPU tests are done.
39+
40+
#### 2. Checkpoint after AMD Tests (line ~637-642)
41+
42+
```jinja
43+
{% if nightly == "1" %}
44+
# Checkpoint: wait for all AMD tests to complete
45+
- wait:
46+
key: amd-tests-complete
47+
continue_on_failure: true
48+
{% endif %}
49+
```
50+
51+
This checkpoint is placed after the AMD test group completes.
52+
53+
#### 3. Key Added to GH200 Test (line ~752)
54+
55+
```jinja
56+
- label: "GH200 Test"
57+
key: gh200-test
58+
depends_on: ~
59+
soft_fail: true
60+
agents:
61+
queue: gh200_queue
62+
command: nvidia-smi && bash .buildkite/scripts/hardware_ci/run-gh200-test.sh
63+
```
64+
65+
Added explicit `key: gh200-test` so we can reference it in the wait dependencies.
66+
67+
#### 4. Final Wait with Explicit Dependencies (line ~773-778)
68+
69+
```jinja
70+
- wait:
71+
depends_on:
72+
- main-tests-complete
73+
- amd-tests-complete
74+
- gh200-test
75+
continue_on_failure: true
76+
77+
- label: "Nightly Tests Failure Notification"
78+
soft_fail: true
79+
agents:
80+
queue: small_cpu_queue
81+
commands: |
82+
# Gather test results...
83+
```
84+
85+
The notification wait now depends ONLY on:
86+
- Main GPU tests checkpoint
87+
- AMD tests checkpoint
88+
- GH200 test
89+
90+
## How It Works
91+
92+
### During Nightly Runs (`NIGHTLY="1"`):
93+
94+
1. **Main GPU tests execute** → checkpoint wait is created at the end
95+
2. **AMD tests execute** → checkpoint wait is created at the end
96+
3. **GH200 test executes** → has explicit key
97+
4. **Blocked steps exist but are ignored** (CUDA 11.8 build, ARM64 build, Intel/IBM tests)
98+
5. **Final wait** triggers when the three checkpoints complete
99+
6. **Notification executes** and gathers results from all completed tests
100+
101+
### Tests Included in Notification:
102+
103+
The notification checks results from the `steps` variable (main test suite from `.buildkite/test-pipeline.yaml`), which includes:
104+
- All main GPU tests (L4, A100, H100, H200, B200)
105+
- Multi-GPU distributed tests
106+
- Various vLLM feature tests
107+
108+
### Tests Excluded from Wait (Won't Block Notification):
109+
110+
- CUDA 11.8 image build (blocked)
111+
- ARM64 CPU image build (blocked)
112+
- Torch nightly tests (in separate group with `depends_on: ~`)
113+
- Intel CPU/HPU/GPU tests
114+
- IBM Power tests
115+
- IBM Z tests
116+
- Neuron tests
117+
- Ascend NPU tests
118+
119+
These tests can exist and run (if manually approved or have agents), but they won't prevent the notification from running.
120+
121+
## Benefits
122+
123+
1. ✅ Nightly notification runs as soon as core tests complete
124+
2. ✅ No hanging on manual approval blocks
125+
3. ✅ Blocked builds/tests don't interfere with notification
126+
4. ✅ Original test structure preserved (no blocks removed)
127+
5. ✅ Explicit control over which test groups to wait for
128+
129+
## Technical Notes
130+
131+
### Why Not `wait: ~`?
132+
133+
Buildkite's `wait` step has only two modes:
134+
- `wait: ~` - waits for ALL previous steps (including blocks)
135+
- `wait` with `depends_on: [keys]` - waits ONLY for specified keys
136+
137+
There's no "exclude" or "skip" pattern available, so we must use explicit dependencies.
138+
139+
### Why Checkpoints?
140+
141+
Individual test steps in the `{% for step in steps %}` loop don't have explicit keys by default. Rather than adding keys to hundreds of tests, we place a single checkpoint wait after the loop completes. This checkpoint implicitly waits for all tests in that section.
142+
143+
### Nightly Environment Variables
144+
145+
```bash
146+
AMD_MIRROR_HW="amdexperimental"
147+
NIGHTLY="1"
148+
RUN_ALL="1"
149+
```
150+
151+
These variables ensure:
152+
- All optional tests run (no `ns.blocked == 1` conditions)
153+
- Checkpoints are created (only when `nightly == "1"`)
154+
- AMD tests execute on experimental hardware
155+
156+
## Testing
157+
158+
To test these changes:
159+
1. Create a build with environment variable: `VLLM_CI_BRANCH=<your-branch>`
160+
2. Set `NIGHTLY="1"` and `RUN_ALL="1"`
161+
3. Verify the notification step runs after main/AMD/GH200 tests complete
162+
4. Verify it doesn't wait for blocked Intel/IBM tests
163+
164+
## Files Modified
165+
166+
- `buildkite/test-template-ci.j2` - Jinja2 template for CI pipeline

0 commit comments

Comments
 (0)