Skip to content

Commit 94f9489

Browse files
authored
[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back (#12159)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
1 parent c0cf5a3 commit 94f9489

File tree

5 files changed

+166
-4
lines changed

5 files changed

+166
-4
lines changed

jenkins/L0_Test.groovy

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3396,15 +3396,15 @@ def launchTestJobs(pipeline, testFilter)
33963396
"GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge",
33973397
"auto:gb200-flex",
33983398
"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2",
3399-
2,
3399+
3,
34003400
8,
34013401
2
34023402
)
34033403
multiNodesSBSAConfigs += buildStageConfigs(
34043404
"GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge",
34053405
"auto:gb200-flex",
34063406
"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4",
3407-
3,
3407+
4,
34083408
8,
34093409
2
34103410
)
@@ -3417,6 +3417,14 @@ def launchTestJobs(pipeline, testFilter)
34173417
2
34183418
)
34193419
// 3 Nodes
3420+
multiNodesSBSAConfigs += buildStageConfigs(
3421+
"GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge",
3422+
"auto:gb200-flex",
3423+
"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8",
3424+
1,
3425+
12,
3426+
3
3427+
)
34203428
multiNodesSBSAConfigs += buildStageConfigs(
34213429
"GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge",
34223430
"auto:gb200-flex",

jenkins/scripts/perf/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,40 @@ When modifying disaggregated SLURM scripts, keep these invariants:
124124
3. **Non-MPI roles must stay MPI-free**: The disagg server and benchmark steps must
125125
not receive `--mpi` flags. If adding a new srun step, consider whether it needs MPI.
126126

127+
## Adding or Re-enabling Perf Sanity Tests in CI
128+
129+
When adding or re-enabling perf sanity tests, two files must be updated:
130+
131+
1. **Test-db YAML** in `tests/integration/test_lists/test-db/` — add or uncomment the test case line
132+
2. **`jenkins/L0_Test.groovy`** — update or add the CI stage in `launchTestJobs()`
133+
134+
### Where to Find CI Stage Definitions
135+
136+
In `jenkins/L0_Test.groovy`, search for `launchTestJobs`. Perf sanity stages are grouped by test type:
137+
138+
| Config Variable | Test Type | Platform |
139+
|-----------------|-----------|----------|
140+
| `x86SlurmTestConfigs` | Single-node aggregated perf sanity (x86) | `"auto:h100-cr-x8"` etc. |
141+
| `SBSASlurmTestConfigs` | Single-node aggregated perf sanity (SBSA/Grace) | `"auto:gb200-x4"` etc. |
142+
| `multiNodesSBSAConfigs` | Multi-node aggregated **and** disaggregated perf sanity | `"auto:gb200-flex"` etc. |
143+
144+
### `buildStageConfigs` Function
145+
146+
Disaggregated and multi-node perf sanity stages use `buildStageConfigs()`:
147+
148+
```groovy
149+
def buildStageConfigs(stageName, platform, testlist, testCount, gpuCount, nodeCount, runWithSbatch=false)
150+
```
151+
152+
- `testlist`: test-db YAML filename without `.yml` extension
153+
- `testCount`: must equal the number of **active (uncommented)** tests in the test-db file (each disagg test gets its own CI stage)
154+
- `gpuCount`: total GPUs allocated per stage = `total_nodes * gpus_per_node`
155+
- `nodeCount`: total SLURM nodes per stage
156+
157+
When adding a test, either increment `testCount` on an existing entry or add a new `buildStageConfigs` block. Stages are grouped by node count (2 Nodes, 3 Nodes, 4 Nodes, etc.).
158+
159+
For the full step-by-step guide including how to derive test-db filenames and GPU/node counts from disaggregated config YAMLs, see [`tests/scripts/perf-sanity/README.md`](../../tests/scripts/perf-sanity/README.md) ("Step-by-Step: Adding or Re-enabling Disaggregated Perf Sanity Tests").
160+
127161
## Post-Processing and Triage
128162

129163
### `get_pre_merge_html.py`

tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2:
1616
tests:
1717
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
1818
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
19-
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
19+
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2020
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2121
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2222
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)

tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4:
1717
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
1818
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
1919
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
20-
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
20+
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2121
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2222
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
2323
# - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)

tests/scripts/perf-sanity/README.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,3 +233,123 @@ When working with perf sanity tests, use these paths:
233233
| Local submit (all) | `jenkins/scripts/perf/local/submit.py` |
234234
| Jenkins pipeline | `jenkins/L0_Test.groovy` |
235235
| Test database | `tests/integration/test_lists/test-db/` |
236+
| Test waives | `tests/integration/test_lists/waives.txt` |
237+
238+
## Step-by-Step: Adding or Re-enabling Disaggregated Perf Sanity Tests
239+
240+
When adding a new disaggregated perf sanity test (or uncommenting an existing one), you must update **two files**: the test-db YAML and `jenkins/L0_Test.groovy`. This section describes how to locate and edit each one.
241+
242+
### Step 1: Identify the Disaggregated Config YAML
243+
244+
Config files live in `tests/scripts/perf-sanity/disaggregated/`. The filename encodes the GPU type and test parameters:
245+
246+
```
247+
{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml
248+
```
249+
250+
Example: `gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml`
251+
252+
The **base name** (filename without `.yaml`) is used as the test case ID in the test-db.
253+
254+
### Step 2: Calculate Resource Requirements from Config YAML
255+
256+
Read the config YAML and extract these fields:
257+
258+
```yaml
259+
hardware:
260+
gpus_per_node: 4 # GPUs per physical node
261+
num_ctx_servers: 1 # Number of context workers
262+
num_gen_servers: 1 # Number of generation workers
263+
worker_config:
264+
ctx:
265+
tensor_parallel_size: 1 # GPUs per ctx worker
266+
gen:
267+
tensor_parallel_size: 8 # GPUs per gen worker
268+
```
269+
270+
Calculate:
271+
272+
| Value | Formula | Example (ctx_tp=1, gen_tp=8, gpus_per_node=4) |
273+
|-------|---------|-----------------------------------------------|
274+
| Nodes per ctx worker | `ceil(ctx_tp / gpus_per_node)` | `ceil(1/4) = 1` |
275+
| Nodes per gen worker | `ceil(gen_tp / gpus_per_node)` | `ceil(8/4) = 2` |
276+
| Total nodes | `(nodes_per_ctx * num_ctx) + (nodes_per_gen * num_gen)` | `1*1 + 2*1 = 3` |
277+
| Total GPUs | `total_nodes * gpus_per_node` | `3 * 4 = 12` |
278+
279+
### Step 3: Find the Test-db YAML File
280+
281+
The test-db file name follows this pattern (all in `tests/integration/test_lists/test-db/`):
282+
283+
```
284+
l0_{gpu_type}_multi_nodes_perf_sanity_ctx{num_ctx}_node{nodes_per_ctx}_gpu{ctx_tp}_gen{num_gen}_node{nodes_per_gen}_gpu{gen_tp}.yml
285+
```
286+
287+
Example: for ctx_tp=1, gen_tp=8, 1 ctx worker, 1 gen worker on GB200:
288+
```
289+
l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8.yml
290+
```
291+
292+
The `system_gpu_count` in the test-db condition section equals the total GPUs calculated above.
293+
294+
### Step 4: Add or Uncomment the Test in the Test-db File
295+
296+
Each disagg test line in the test-db file follows one of these formats:
297+
298+
```yaml
299+
# gen_only test:
300+
- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-{config_base_name}] TIMEOUT (120)
301+
302+
# e2e test:
303+
- perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-{config_base_name}] TIMEOUT (120)
304+
305+
# ctx_only test (placed in aggregated test-db files, not disagg ones):
306+
- perf/test_perf_sanity.py::test_e2e[aggr_upload-ctx_only-{config_base_name}] TIMEOUT (120)
307+
```
308+
309+
- If the test line already exists but is **commented out** (prefixed with `# `), remove the `# ` prefix.
310+
- If the test line does not exist, add it to the `tests` list.
311+
- Count the total number of **active (uncommented) tests** in the file — you will need this count for Step 5.
312+
313+
### Step 5: Update `jenkins/L0_Test.groovy`
314+
315+
Open `jenkins/L0_Test.groovy` and search for the `multiNodesSBSAConfigs` section inside `launchTestJobs()`. Disaggregated perf sanity stages are added via `buildStageConfigs()`:
316+
317+
```groovy
318+
def buildStageConfigs(stageName, platform, testlist, testCount, gpuCount, nodeCount, runWithSbatch=false)
319+
```
320+
321+
| Parameter | Description |
322+
|-----------|-------------|
323+
| `stageName` | CI stage name prefix (see naming convention below) |
324+
| `platform` | Hardware platform, e.g., `"auto:gb200-flex"` |
325+
| `testlist` | Test-db filename **without** `.yml`, e.g., `"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4"` |
326+
| `testCount` | Number of **active (uncommented)** tests in the test-db file. Each disagg test gets its own stage, so `testCount` must equal the number of active tests. |
327+
| `gpuCount` | Total GPUs from Step 2 (= `total_nodes * gpus_per_node`) |
328+
| `nodeCount` | Total nodes from Step 2 |
329+
330+
**Stage naming convention:**
331+
332+
```
333+
GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
334+
```
335+
336+
**If a `buildStageConfigs` entry already exists** for the test-db file: update `testCount` to match the new total number of active tests.
337+
338+
**If no entry exists** for the test-db file: add a new `buildStageConfigs` block. Insert it in the correct section sorted by node count (2 Nodes, 3 Nodes, 4 Nodes, etc.).
339+
340+
### Step 6: Check Waives
341+
342+
Search `tests/integration/test_lists/waives.txt` for the exact test case string. If the test is listed there with a `SKIP` directive, remove that line (otherwise the test will be skipped even if present in the test-db).
343+
344+
### Worked Example
345+
346+
Adding back `qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX` as a gen_only test:
347+
348+
1. Config file: `tests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml`
349+
2. From config: `gpus_per_node=4`, `ctx_tp=1`, `gen_tp=4`, `num_ctx=1`, `num_gen=1`
350+
3. Nodes per ctx = `ceil(1/4)=1`, nodes per gen = `ceil(4/4)=1`, total nodes = 2, total GPUs = 8
351+
4. Test-db file: `l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml`
352+
5. Uncomment the line: `- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)`
353+
6. Count active tests in that file (now 4)
354+
7. In `L0_Test.groovy`, find the existing `buildStageConfigs` for `l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4`, update `testCount` from 3 to 4
355+
8. Check `waives.txt` — no matching entry, done

0 commit comments

Comments
 (0)