Skip to content

Revert "Fix e2e-tests dropping highest concurrency benchmark configs"#1021

Closed
Ankur-singh wants to merge 1 commit intomainfrom
revert-1020-nv/fix-e2e-run-test-config-diff
Closed

Revert "Fix e2e-tests dropping highest concurrency benchmark configs"#1021
Ankur-singh wants to merge 1 commit intomainfrom
revert-1020-nv/fix-e2e-run-test-config-diff

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Reverts #1020

Comment on lines 1582 to 1584
"dsr1-fp8-h200-trt",
"gptoss-fp8-b200-sglang",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This revert deletes TestMarkEvalEntries—the only tests that directly exercised mark_eval_entries()—even though the function is still a live production function called in main(), leaving it with zero unit test coverage. If this revert is intentional, TestMarkEvalEntries should be preserved (it does not depend on the reverted eval-only mechanism) and new tests should be added to verify the restored run-eval-based SINGLE filter behavior.

Extended reasoning...

What the bug is

The revert deletes two test classes: TestE2EConfigSplitting and TestMarkEvalEntries. The deletion of TestE2EConfigSplitting is arguably justified: those tests used a _split_e2e_configs helper that replicated the eval-only-based SINGLE filter (not x.get('eval-only', False)), which is exactly the behavior being reverted. Keeping them would test a code path that no longer exists. However, the deletion of TestMarkEvalEntries is not justified and creates a real coverage gap.

The specific coverage gap

mark_eval_entries() still exists verbatim and is still called in main(). Its job is to scan all single-node entries, group them by (model, runner, framework, precision, isl, osl, spec-decoding, dp-attn), and mark the highest and median concurrency entries with run-eval=True. These marks drive both the EVALS split and, via the regression in bug_001, inadvertently drop entries from the SINGLE split. Yet after this revert the function is neither imported nor exercised in the test file.

Why existing tests do not cover it

All test classes test generate_full_sweep and generate_runner_model_sweep_config directly. None of those functions call mark_eval_entries; it is only called from main(), which has no dedicated test coverage. The refutation argument that mark_eval_entries has behavioral coverage through integration-style tests is incorrect: no executed test path reaches mark_eval_entries after this deletion.

Step-by-step proof

  1. test_generate_sweep_configs.py imports: generate_full_sweep, generate_runner_model_sweep_config, apply_node_type_defaults, expand_config_keys. mark_eval_entries is NOT imported (removed by this diff).
  2. All test classes invoke one of those four imported functions.
  3. generate_full_sweep sets run-eval=False on every entry and never calls mark_eval_entries.
  4. mark_eval_entries is only reachable through main(), which is not tested.
  5. Therefore, no test in the suite exercises mark_eval_entries, its grouping logic, its median-conc selection, its highest-conc selection, or its multinode-exclusion guard. Coverage = 0.

Why the deleted tests were valid independent of the revert

The four TestMarkEvalEntries test cases called mark_eval_entries directly with synthesized entry dicts. None of them referenced eval-only, _split_e2e_configs, or any other mechanism introduced by PR #1020. They would remain correct and valuable after the revert with no modification needed.

Impact and fix

Without these tests, the bug introduced by the reverted SINGLE filter (eval-marked entries silently dropped from benchmarks in default mode) cannot be caught by CI. The fix is to restore TestMarkEvalEntries as-is, re-add the mark_eval_entries import, and optionally add a new test class modeling the current run-eval-based SINGLE filter.

CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('eval-only', False)]))")
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This revert re-introduces a regression where the SINGLE filter uses not x.get('run-eval', False), which silently excludes the highest and median concurrency 8k1k entries from benchmark jobs in default mode. In default mode, mark_eval_entries() still runs and marks those entries with run-eval=True—meaning they are only dispatched as evals, and their benchmark data is never collected. To fix this, restore the eval-only flag distinction from PR #1020: use not x.get('eval-only', False) in the SINGLE filter so that entries marked for eval in default mode are still benchmarked.

Extended reasoning...

What the bug is and how it manifests

After this revert, the SINGLE job filter on line 57 of .github/workflows/e2e-tests.yml reads:

SINGLE = [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

Any entry with run-eval=True is excluded from the SINGLE (benchmark) job and routed exclusively to the EVALS job. In default mode (no --no-evals or --evals-only flags), generate_sweep_configs.py calls mark_eval_entries() unconditionally (the guard is if not args.no_evals), which sets run-eval=True on the highest-concurrency and median-concurrency 8k1k single-node entries. Those entries therefore never reach the benchmark job in default mode—their throughput data is silently dropped.

Specific code path that triggers it

  1. main() in generate_sweep_configs.py calls mark_eval_entries(matrix_values) whenever --no-evals is absent.
  2. mark_eval_entries marks the highest and median concurrency entries for each 8k1k group with run-eval=True.
  3. Back in the workflow, the SINGLE filter not x.get('run-eval', False) rejects those entries entirely.
  4. EVALS receives them (via x.get('run-eval', False)), but EVALS is invoked with run-eval: true and eval-only: true—meaning only quality evaluations run, not throughput benchmarks.

Why existing code does not prevent it

Before PR #1020, there was no way to distinguish default mode (should also benchmark) from evals-only mode (benchmark skipped). PR #1020 solved this by introducing an eval-only field: mark_eval_entries sets eval-only=True only in the --evals-only path, and the SINGLE filter was updated to not x.get('eval-only', False). This revert removes both, collapsing the two distinct cases so the SINGLE filter can no longer tell them apart.

Impact

In any default-mode CI run (the most common case), the highest and median concurrency benchmark configurations for all 8k1k workloads are silently omitted from throughput results. Performance regressions at high load will be invisible.

Step-by-step proof with a concrete example

Suppose a config has three 8k1k single-node entries: conc=32, conc=128, conc=512.

  1. mark_eval_entries computes conc_values = [32, 128, 512]. Median index=1 → conc=128. Highest=512. Both get run-eval=True.
  2. Output (default mode): {conc:32, run-eval:false}, {conc:128, run-eval:true}, {conc:512, run-eval:true}.
  3. SINGLE filter not x.get('run-eval', False) → only conc=32 passes. Benchmark job runs one configuration.
  4. EVALS filter → conc=128 and conc=512 go to evals only (eval-only:true in the workflow call).
  5. Result: throughput benchmark data for the two highest concurrency points is never collected.

How to fix it

Restore the changes from PR #1020: (a) re-add EVAL_ONLY to the Fields enum and SingleNodeMatrixEntry, (b) re-add the loop in main() that sets eval-only=True only in the --evals-only path, and (c) update the SINGLE filter to not x.get('eval-only', False).

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Ankur-singh Took another look and I believe the current fix should works.

Thanks for catching it early! 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants