[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back (#12159)

chenfeiz0326 · web-flow · commit 94f94897cf5f · 2026-03-13T10:20:32.000+08:00
Signed-off-by: Chenfei Zhang &lt;chenfeiz@nvidia.com&gt;
diff --git a/jenkins/L0_Test.groovy b/jenkins/L0_Test.groovy
@@ -3396,15 +3396,15 @@ def launchTestJobs(pipeline, testFilter)
         "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge",
         "auto:gb200-flex",
         "l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2",
-        2,
+        3,
         8,
         2
     )
     multiNodesSBSAConfigs += buildStageConfigs(
         "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge",
         "auto:gb200-flex",
         "l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4",
-        3,
+        4,
         8,
         2
     )
@@ -3417,6 +3417,14 @@ def launchTestJobs(pipeline, testFilter)
         2
     )
     // 3 Nodes
+    multiNodesSBSAConfigs += buildStageConfigs(
+        "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge",
+        "auto:gb200-flex",
+        "l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8",
+        1,
+        12,
+        3
+    )
     multiNodesSBSAConfigs += buildStageConfigs(
         "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge",
         "auto:gb200-flex",
diff --git a/jenkins/scripts/perf/README.md b/jenkins/scripts/perf/README.md
@@ -124,6 +124,40 @@ When modifying disaggregated SLURM scripts, keep these invariants:
 3. **Non-MPI roles must stay MPI-free**: The disagg server and benchmark steps must
    not receive `--mpi` flags. If adding a new srun step, consider whether it needs MPI.
 
+## Adding or Re-enabling Perf Sanity Tests in CI
+
+When adding or re-enabling perf sanity tests, two files must be updated:
+
+1. **Test-db YAML** in `tests/integration/test_lists/test-db/` — add or uncomment the test case line
+2. **`jenkins/L0_Test.groovy`** — update or add the CI stage in `launchTestJobs()`
+
+### Where to Find CI Stage Definitions
+
+In `jenkins/L0_Test.groovy`, search for `launchTestJobs`. Perf sanity stages are grouped by test type:
+
+| Config Variable | Test Type | Platform |
+|-----------------|-----------|----------|
+| `x86SlurmTestConfigs` | Single-node aggregated perf sanity (x86) | `"auto:h100-cr-x8"` etc. |
+| `SBSASlurmTestConfigs` | Single-node aggregated perf sanity (SBSA/Grace) | `"auto:gb200-x4"` etc. |
+| `multiNodesSBSAConfigs` | Multi-node aggregated **and** disaggregated perf sanity | `"auto:gb200-flex"` etc. |
+
+### `buildStageConfigs` Function
+
+Disaggregated and multi-node perf sanity stages use `buildStageConfigs()`:
+
+```groovy
+def buildStageConfigs(stageName, platform, testlist, testCount, gpuCount, nodeCount, runWithSbatch=false)
+```
+
+- `testlist`: test-db YAML filename without `.yml` extension
+- `testCount`: must equal the number of **active (uncommented)** tests in the test-db file (each disagg test gets its own CI stage)
+- `gpuCount`: total GPUs allocated per stage = `total_nodes * gpus_per_node`
+- `nodeCount`: total SLURM nodes per stage
+
+When adding a test, either increment `testCount` on an existing entry or add a new `buildStageConfigs` block. Stages are grouped by node count (2 Nodes, 3 Nodes, 4 Nodes, etc.).
+
+For the full step-by-step guide including how to derive test-db filenames and GPU/node counts from disaggregated config YAMLs, see [`tests/scripts/perf-sanity/README.md`](../../tests/scripts/perf-sanity/README.md) ("Step-by-Step: Adding or Re-enabling Disaggregated Perf Sanity Tests").
+
 ## Post-Processing and Triage
 
 ### `get_pre_merge_html.py`
diff --git a/tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml b/tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml
@@ -16,7 +16,7 @@ l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2:
   tests:
   - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
-  # - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
+  - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
diff --git a/tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml b/tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml
@@ -17,7 +17,7 @@ l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4:
   - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
-  # - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
+  - perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
   # - perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)
diff --git a/tests/scripts/perf-sanity/README.md b/tests/scripts/perf-sanity/README.md
@@ -233,3 +233,123 @@ When working with perf sanity tests, use these paths:
 | Local submit (all) | `jenkins/scripts/perf/local/submit.py` |
 | Jenkins pipeline | `jenkins/L0_Test.groovy` |
 | Test database | `tests/integration/test_lists/test-db/` |
+| Test waives | `tests/integration/test_lists/waives.txt` |
+
+## Step-by-Step: Adding or Re-enabling Disaggregated Perf Sanity Tests
+
+When adding a new disaggregated perf sanity test (or uncommenting an existing one), you must update **two files**: the test-db YAML and `jenkins/L0_Test.groovy`. This section describes how to locate and edit each one.
+
+### Step 1: Identify the Disaggregated Config YAML
+
+Config files live in `tests/scripts/perf-sanity/disaggregated/`. The filename encodes the GPU type and test parameters:
+
+```
+{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml
+```
+
+Example: `gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml`
+
+The **base name** (filename without `.yaml`) is used as the test case ID in the test-db.
+
+### Step 2: Calculate Resource Requirements from Config YAML
+
+Read the config YAML and extract these fields:
+
+```yaml
+hardware:
+  gpus_per_node: 4          # GPUs per physical node
+  num_ctx_servers: 1         # Number of context workers
+  num_gen_servers: 1         # Number of generation workers
+worker_config:
+  ctx:
+    tensor_parallel_size: 1  # GPUs per ctx worker
+  gen:
+    tensor_parallel_size: 8  # GPUs per gen worker
+```
+
+Calculate:
+
+| Value | Formula | Example (ctx_tp=1, gen_tp=8, gpus_per_node=4) |
+|-------|---------|-----------------------------------------------|
+| Nodes per ctx worker | `ceil(ctx_tp / gpus_per_node)` | `ceil(1/4) = 1` |
+| Nodes per gen worker | `ceil(gen_tp / gpus_per_node)` | `ceil(8/4) = 2` |
+| Total nodes | `(nodes_per_ctx * num_ctx) + (nodes_per_gen * num_gen)` | `1*1 + 2*1 = 3` |
+| Total GPUs | `total_nodes * gpus_per_node` | `3 * 4 = 12` |
+
+### Step 3: Find the Test-db YAML File
+
+The test-db file name follows this pattern (all in `tests/integration/test_lists/test-db/`):
+
+```
+l0_{gpu_type}_multi_nodes_perf_sanity_ctx{num_ctx}_node{nodes_per_ctx}_gpu{ctx_tp}_gen{num_gen}_node{nodes_per_gen}_gpu{gen_tp}.yml
+```
+
+Example: for ctx_tp=1, gen_tp=8, 1 ctx worker, 1 gen worker on GB200:
+```
+l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8.yml
+```
+
+The `system_gpu_count` in the test-db condition section equals the total GPUs calculated above.
+
+### Step 4: Add or Uncomment the Test in the Test-db File
+
+Each disagg test line in the test-db file follows one of these formats:
+
+```yaml
+# gen_only test:
+- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-{config_base_name}] TIMEOUT (120)
+
+# e2e test:
+- perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-{config_base_name}] TIMEOUT (120)
+
+# ctx_only test (placed in aggregated test-db files, not disagg ones):
+- perf/test_perf_sanity.py::test_e2e[aggr_upload-ctx_only-{config_base_name}] TIMEOUT (120)
+```
+
+- If the test line already exists but is **commented out** (prefixed with `# `), remove the `# ` prefix.
+- If the test line does not exist, add it to the `tests` list.
+- Count the total number of **active (uncommented) tests** in the file — you will need this count for Step 5.
+
+### Step 5: Update `jenkins/L0_Test.groovy`
+
+Open `jenkins/L0_Test.groovy` and search for the `multiNodesSBSAConfigs` section inside `launchTestJobs()`. Disaggregated perf sanity stages are added via `buildStageConfigs()`:
+
+```groovy
+def buildStageConfigs(stageName, platform, testlist, testCount, gpuCount, nodeCount, runWithSbatch=false)
+```
+
+| Parameter | Description |
+|-----------|-------------|
+| `stageName` | CI stage name prefix (see naming convention below) |
+| `platform` | Hardware platform, e.g., `"auto:gb200-flex"` |
+| `testlist` | Test-db filename **without** `.yml`, e.g., `"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4"` |
+| `testCount` | Number of **active (uncommented)** tests in the test-db file. Each disagg test gets its own stage, so `testCount` must equal the number of active tests. |
+| `gpuCount` | Total GPUs from Step 2 (= `total_nodes * gpus_per_node`) |
+| `nodeCount` | Total nodes from Step 2 |
+
+**Stage naming convention:**
+
+```
+GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
+```
+
+**If a `buildStageConfigs` entry already exists** for the test-db file: update `testCount` to match the new total number of active tests.
+
+**If no entry exists** for the test-db file: add a new `buildStageConfigs` block. Insert it in the correct section sorted by node count (2 Nodes, 3 Nodes, 4 Nodes, etc.).
+
+### Step 6: Check Waives
+
+Search `tests/integration/test_lists/waives.txt` for the exact test case string. If the test is listed there with a `SKIP` directive, remove that line (otherwise the test will be skipped even if present in the test-db).
+
+### Worked Example
+
+Adding back `qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX` as a gen_only test:
+
+1. Config file: `tests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml`
+2. From config: `gpus_per_node=4`, `ctx_tp=1`, `gen_tp=4`, `num_ctx=1`, `num_gen=1`
+3. Nodes per ctx = `ceil(1/4)=1`, nodes per gen = `ceil(4/4)=1`, total nodes = 2, total GPUs = 8
+4. Test-db file: `l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml`
+5. Uncomment the line: `- perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX] TIMEOUT (120)`
+6. Count active tests in that file (now 4)
+7. In `L0_Test.groovy`, find the existing `buildStageConfigs` for `l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4`, update `testCount` from 3 to 4
+8. Check `waives.txt` — no matching entry, done

Original file line number	Diff line number	Diff line change
`@@ -3396,15 +3396,15 @@ def launchTestJobs(pipeline, testFilter)`
`3396`	`3396`	`"GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge",`
`3397`	`3397`	`"auto:gb200-flex",`
`3398`	`3398`	`"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2",`
`3399`		`- 2,`
	`3399`	`+ 3,`
`3400`	`3400`	`8,`
`3401`	`3401`	`2`
`3402`	`3402`	`)`
`3403`	`3403`	`multiNodesSBSAConfigs += buildStageConfigs(`
`3404`	`3404`	`"GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge",`
`3405`	`3405`	`"auto:gb200-flex",`
`3406`	`3406`	`"l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4",`
`3407`		`- 3,`
	`3407`	`+ 4,`
`3408`	`3408`	`8,`
`3409`	`3409`	`2`
`3410`	`3410`	`)`
`@@ -3417,6 +3417,14 @@ def launchTestJobs(pipeline, testFilter)`
`3417`	`3417`	`2`
`3418`	`3418`	`)`
`3419`	`3419`	`// 3 Nodes`
	`3420`	`+ multiNodesSBSAConfigs += buildStageConfigs(`
	`3421`	`+ "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge",`
	`3422`	`+ "auto:gb200-flex",`
	`3423`	`+ "l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8",`
	`3424`	`+ 1,`
	`3425`	`+ 12,`
	`3426`	`+ 3`
	`3427`	`+ )`
`3420`	`3428`	`multiNodesSBSAConfigs += buildStageConfigs(`
`3421`	`3429`	`"GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge",`
`3422`	`3430`	`"auto:gb200-flex",`