Revert "Fix e2e-tests dropping highest concurrency benchmark configs"#1021
Revert "Fix e2e-tests dropping highest concurrency benchmark configs"#1021Ankur-singh wants to merge 1 commit intomainfrom
Conversation
| "dsr1-fp8-h200-trt", | ||
| "gptoss-fp8-b200-sglang", | ||
| ] |
There was a problem hiding this comment.
🔴 This revert deletes TestMarkEvalEntries—the only tests that directly exercised mark_eval_entries()—even though the function is still a live production function called in main(), leaving it with zero unit test coverage. If this revert is intentional, TestMarkEvalEntries should be preserved (it does not depend on the reverted eval-only mechanism) and new tests should be added to verify the restored run-eval-based SINGLE filter behavior.
Extended reasoning...
What the bug is
The revert deletes two test classes: TestE2EConfigSplitting and TestMarkEvalEntries. The deletion of TestE2EConfigSplitting is arguably justified: those tests used a _split_e2e_configs helper that replicated the eval-only-based SINGLE filter (not x.get('eval-only', False)), which is exactly the behavior being reverted. Keeping them would test a code path that no longer exists. However, the deletion of TestMarkEvalEntries is not justified and creates a real coverage gap.
The specific coverage gap
mark_eval_entries() still exists verbatim and is still called in main(). Its job is to scan all single-node entries, group them by (model, runner, framework, precision, isl, osl, spec-decoding, dp-attn), and mark the highest and median concurrency entries with run-eval=True. These marks drive both the EVALS split and, via the regression in bug_001, inadvertently drop entries from the SINGLE split. Yet after this revert the function is neither imported nor exercised in the test file.
Why existing tests do not cover it
All test classes test generate_full_sweep and generate_runner_model_sweep_config directly. None of those functions call mark_eval_entries; it is only called from main(), which has no dedicated test coverage. The refutation argument that mark_eval_entries has behavioral coverage through integration-style tests is incorrect: no executed test path reaches mark_eval_entries after this deletion.
Step-by-step proof
- test_generate_sweep_configs.py imports: generate_full_sweep, generate_runner_model_sweep_config, apply_node_type_defaults, expand_config_keys. mark_eval_entries is NOT imported (removed by this diff).
- All test classes invoke one of those four imported functions.
- generate_full_sweep sets run-eval=False on every entry and never calls mark_eval_entries.
- mark_eval_entries is only reachable through main(), which is not tested.
- Therefore, no test in the suite exercises mark_eval_entries, its grouping logic, its median-conc selection, its highest-conc selection, or its multinode-exclusion guard. Coverage = 0.
Why the deleted tests were valid independent of the revert
The four TestMarkEvalEntries test cases called mark_eval_entries directly with synthesized entry dicts. None of them referenced eval-only, _split_e2e_configs, or any other mechanism introduced by PR #1020. They would remain correct and valuable after the revert with no modification needed.
Impact and fix
Without these tests, the bug introduced by the reverted SINGLE filter (eval-marked entries silently dropped from benchmarks in default mode) cannot be caught by CI. The fix is to restore TestMarkEvalEntries as-is, re-add the mark_eval_entries import, and optionally add a new test class modeling the current run-eval-based SINGLE filter.
| CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \ | ||
| ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}) | ||
| SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('eval-only', False)]))") | ||
| SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))") |
There was a problem hiding this comment.
🔴 This revert re-introduces a regression where the SINGLE filter uses not x.get('run-eval', False), which silently excludes the highest and median concurrency 8k1k entries from benchmark jobs in default mode. In default mode, mark_eval_entries() still runs and marks those entries with run-eval=True—meaning they are only dispatched as evals, and their benchmark data is never collected. To fix this, restore the eval-only flag distinction from PR #1020: use not x.get('eval-only', False) in the SINGLE filter so that entries marked for eval in default mode are still benchmarked.
Extended reasoning...
What the bug is and how it manifests
After this revert, the SINGLE job filter on line 57 of .github/workflows/e2e-tests.yml reads:
SINGLE = [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]Any entry with run-eval=True is excluded from the SINGLE (benchmark) job and routed exclusively to the EVALS job. In default mode (no --no-evals or --evals-only flags), generate_sweep_configs.py calls mark_eval_entries() unconditionally (the guard is if not args.no_evals), which sets run-eval=True on the highest-concurrency and median-concurrency 8k1k single-node entries. Those entries therefore never reach the benchmark job in default mode—their throughput data is silently dropped.
Specific code path that triggers it
main()ingenerate_sweep_configs.pycallsmark_eval_entries(matrix_values)whenever--no-evalsis absent.mark_eval_entriesmarks the highest and median concurrency entries for each 8k1k group withrun-eval=True.- Back in the workflow, the SINGLE filter
not x.get('run-eval', False)rejects those entries entirely. - EVALS receives them (via
x.get('run-eval', False)), but EVALS is invoked withrun-eval: trueandeval-only: true—meaning only quality evaluations run, not throughput benchmarks.
Why existing code does not prevent it
Before PR #1020, there was no way to distinguish default mode (should also benchmark) from evals-only mode (benchmark skipped). PR #1020 solved this by introducing an eval-only field: mark_eval_entries sets eval-only=True only in the --evals-only path, and the SINGLE filter was updated to not x.get('eval-only', False). This revert removes both, collapsing the two distinct cases so the SINGLE filter can no longer tell them apart.
Impact
In any default-mode CI run (the most common case), the highest and median concurrency benchmark configurations for all 8k1k workloads are silently omitted from throughput results. Performance regressions at high load will be invisible.
Step-by-step proof with a concrete example
Suppose a config has three 8k1k single-node entries: conc=32, conc=128, conc=512.
mark_eval_entriescomputes conc_values = [32, 128, 512]. Median index=1 → conc=128. Highest=512. Both getrun-eval=True.- Output (default mode): {conc:32, run-eval:false}, {conc:128, run-eval:true}, {conc:512, run-eval:true}.
- SINGLE filter
not x.get('run-eval', False)→ only conc=32 passes. Benchmark job runs one configuration. - EVALS filter → conc=128 and conc=512 go to evals only (eval-only:true in the workflow call).
- Result: throughput benchmark data for the two highest concurrency points is never collected.
How to fix it
Restore the changes from PR #1020: (a) re-add EVAL_ONLY to the Fields enum and SingleNodeMatrixEntry, (b) re-add the loop in main() that sets eval-only=True only in the --evals-only path, and (c) update the SINGLE filter to not x.get('eval-only', False).
|
@Ankur-singh Took another look and I believe the current fix should works. Thanks for catching it early! 🫡 |
Reverts #1020