-
Notifications
You must be signed in to change notification settings - Fork 132
Revert "Fix e2e-tests dropping highest concurrency benchmark configs" #1021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,7 +9,6 @@ | |
| generate_runner_model_sweep_config, | ||
| apply_node_type_defaults, | ||
| expand_config_keys, | ||
| mark_eval_entries, | ||
| ) | ||
|
|
||
|
|
||
|
|
@@ -1580,166 +1579,6 @@ | |
| assert result == [ | ||
| "dsr1-fp4-b200-sglang", | ||
| "dsr1-fp8-mi300x-sglang", | ||
| "dsr1-fp8-h200-trt", | ||
| "gptoss-fp8-b200-sglang", | ||
| ] | ||
|
Check failure on line 1584 in utils/matrix_logic/test_generate_sweep_configs.py
|
||
|
Comment on lines
1582
to
1584
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 This revert deletes TestMarkEvalEntries—the only tests that directly exercised mark_eval_entries()—even though the function is still a live production function called in main(), leaving it with zero unit test coverage. If this revert is intentional, TestMarkEvalEntries should be preserved (it does not depend on the reverted eval-only mechanism) and new tests should be added to verify the restored run-eval-based SINGLE filter behavior. Extended reasoning...What the bug is The revert deletes two test classes: TestE2EConfigSplitting and TestMarkEvalEntries. The deletion of TestE2EConfigSplitting is arguably justified: those tests used a _split_e2e_configs helper that replicated the eval-only-based SINGLE filter (not x.get('eval-only', False)), which is exactly the behavior being reverted. Keeping them would test a code path that no longer exists. However, the deletion of TestMarkEvalEntries is not justified and creates a real coverage gap. The specific coverage gap mark_eval_entries() still exists verbatim and is still called in main(). Its job is to scan all single-node entries, group them by (model, runner, framework, precision, isl, osl, spec-decoding, dp-attn), and mark the highest and median concurrency entries with run-eval=True. These marks drive both the EVALS split and, via the regression in bug_001, inadvertently drop entries from the SINGLE split. Yet after this revert the function is neither imported nor exercised in the test file. Why existing tests do not cover it All test classes test generate_full_sweep and generate_runner_model_sweep_config directly. None of those functions call mark_eval_entries; it is only called from main(), which has no dedicated test coverage. The refutation argument that mark_eval_entries has behavioral coverage through integration-style tests is incorrect: no executed test path reaches mark_eval_entries after this deletion. Step-by-step proof
Why the deleted tests were valid independent of the revert The four TestMarkEvalEntries test cases called mark_eval_entries directly with synthesized entry dicts. None of them referenced eval-only, _split_e2e_configs, or any other mechanism introduced by PR #1020. They would remain correct and valuable after the revert with no modification needed. Impact and fix Without these tests, the bug introduced by the reverted SINGLE filter (eval-marked entries silently dropped from benchmarks in default mode) cannot be caught by CI. The fix is to restore TestMarkEvalEntries as-is, re-add the mark_eval_entries import, and optionally add a new test class modeling the current run-eval-based SINGLE filter. |
||
|
|
||
|
|
||
| # ============================================================================= | ||
| # Tests for e2e-tests.yml workflow config splitting | ||
| # ============================================================================= | ||
|
|
||
| def _split_e2e_configs(data): | ||
| """Replicate the splitting logic from e2e-tests.yml get-jobs step. | ||
|
|
||
| Returns (SINGLE, MULTI, EVALS) lists matching the workflow filters. | ||
| """ | ||
| single = [x for x in data if 'prefill' not in x and not x.get('eval-only', False)] | ||
| multi = [x for x in data if 'prefill' in x] | ||
| evals = [x for x in data if 'prefill' not in x and x.get('run-eval', False)] | ||
| return single, multi, evals | ||
|
|
||
|
|
||
| class TestE2EConfigSplitting: | ||
| """Verify the e2e-tests.yml config splitting logic handles all flag | ||
| combinations correctly: default, --no-evals, and --evals-only.""" | ||
|
|
||
| @pytest.fixture | ||
| def mixed_entries(self): | ||
| """Simulates default mode output: single-node (some eval-marked), | ||
| plus multi-node entries.""" | ||
| return [ | ||
| {'exp-name': 'a', 'isl': 1024, 'osl': 1024, 'conc': 64, 'tp': 2, 'run-eval': False}, | ||
| {'exp-name': 'b', 'isl': 1024, 'osl': 1024, 'conc': 128, 'tp': 2, 'run-eval': False}, | ||
| {'exp-name': 'c', 'isl': 8192, 'osl': 1024, 'conc': 256, 'tp': 2, 'run-eval': True}, | ||
| {'exp-name': 'd', 'isl': 8192, 'osl': 1024, 'conc': 512, 'tp': 2, 'run-eval': True}, | ||
| {'exp-name': 'e', 'conc': 64, 'prefill': {'tp': 2, 'num-worker': 1}}, | ||
| ] | ||
|
|
||
| def test_default_mode_benchmarks_all_single_node(self, mixed_entries): | ||
| """Default: all single-node entries (including eval-marked) are benchmarked.""" | ||
| single, multi, evals = _split_e2e_configs(mixed_entries) | ||
| assert len(single) == 4 | ||
| assert all('prefill' not in x for x in single) | ||
|
|
||
| def test_default_mode_evals_only_eval_marked(self, mixed_entries): | ||
| """Default: only eval-marked entries go to EVALS.""" | ||
| single, multi, evals = _split_e2e_configs(mixed_entries) | ||
| assert len(evals) == 2 | ||
| assert all(x['run-eval'] for x in evals) | ||
|
|
||
| def test_default_mode_eval_marked_in_both(self, mixed_entries): | ||
| """Default: eval-marked entries appear in BOTH single and evals.""" | ||
| single, multi, evals = _split_e2e_configs(mixed_entries) | ||
| eval_names = {x['exp-name'] for x in evals} | ||
| single_names = {x['exp-name'] for x in single} | ||
| assert eval_names.issubset(single_names) | ||
|
|
||
| def test_no_evals_all_benchmarked(self): | ||
| """--no-evals: mark_eval_entries is skipped, no run-eval=True entries.""" | ||
| data = [ | ||
| {'exp-name': 'a', 'conc': 64, 'tp': 2, 'run-eval': False}, | ||
| {'exp-name': 'b', 'conc': 128, 'tp': 2, 'run-eval': False}, | ||
| {'exp-name': 'c', 'conc': 256, 'tp': 2, 'run-eval': False}, | ||
| ] | ||
| single, multi, evals = _split_e2e_configs(data) | ||
| assert len(single) == 3 | ||
| assert len(evals) == 0 | ||
|
|
||
| def test_evals_only_no_benchmarks(self): | ||
| """--evals-only: entries have eval-only flag, SINGLE must be empty.""" | ||
| data = [ | ||
| {'exp-name': 'c', 'conc': 256, 'tp': 2, 'run-eval': True, 'eval-only': True}, | ||
| {'exp-name': 'd', 'conc': 512, 'tp': 2, 'run-eval': True, 'eval-only': True}, | ||
| ] | ||
| single, multi, evals = _split_e2e_configs(data) | ||
| assert len(single) == 0, "evals-only should not trigger benchmarks" | ||
| assert len(evals) == 2 | ||
|
|
||
| def test_empty_config(self): | ||
| """Empty input produces empty outputs.""" | ||
| single, multi, evals = _split_e2e_configs([]) | ||
| assert single == [] and multi == [] and evals == [] | ||
|
|
||
| def test_all_eval_marked_without_eval_only_flag_still_benchmarked(self): | ||
| """Default mode where mark_eval_entries marks every entry (e.g. only | ||
| 8k1k with single conc). Without eval-only flag, SINGLE must still | ||
| include them for benchmarking.""" | ||
| data = [ | ||
| {'exp-name': 'a', 'conc': 64, 'tp': 2, 'run-eval': True}, | ||
| {'exp-name': 'b', 'conc': 64, 'tp': 4, 'run-eval': True}, | ||
| ] | ||
| single, multi, evals = _split_e2e_configs(data) | ||
| assert len(single) == 2, "all-eval-marked entries must still be benchmarked in default mode" | ||
| assert len(evals) == 2 | ||
|
|
||
| def test_prefill_entries_never_in_single_or_evals(self, mixed_entries): | ||
| """Prefill (multi-node) entries only appear in MULTI.""" | ||
| single, multi, evals = _split_e2e_configs(mixed_entries) | ||
| assert len(multi) == 1 | ||
| assert all('prefill' in x for x in multi) | ||
| assert all('prefill' not in x for x in single) | ||
| assert all('prefill' not in x for x in evals) | ||
|
|
||
|
|
||
| class TestMarkEvalEntries: | ||
| """Verify mark_eval_entries only marks highest/median concurrency at 8k1k.""" | ||
|
|
||
| def test_marks_highest_and_median_conc(self): | ||
| """Should mark highest and median concurrency for 8k1k entries.""" | ||
| entries = [ | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 32, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False}, | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 128, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False}, | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 512, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False}, | ||
| ] | ||
| result = mark_eval_entries(entries) | ||
| # conc values: [32, 128, 512]. median=128 (index 1), highest=512 | ||
| assert result[0]['run-eval'] is False # conc=32 | ||
| assert result[1]['run-eval'] is True # conc=128 (median) | ||
| assert result[2]['run-eval'] is True # conc=512 (highest) | ||
|
|
||
| def test_non_8k1k_never_marked(self): | ||
| """Entries with non-8k1k seq lengths should never be eval-marked.""" | ||
| entries = [ | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 1024, 'osl': 1024, 'tp': 2, 'conc': 512, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False}, | ||
| ] | ||
| result = mark_eval_entries(entries) | ||
| assert result[0]['run-eval'] is False | ||
|
|
||
| def test_multinode_entries_never_marked(self): | ||
| """Entries without top-level tp (multi-node) should never be eval-marked.""" | ||
| entries = [ | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 8192, 'osl': 1024, 'conc': 512, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False, | ||
| 'prefill': {'tp': 2, 'num-worker': 1}}, | ||
| ] | ||
| result = mark_eval_entries(entries) | ||
| assert result[0]['run-eval'] is False | ||
|
|
||
| def test_never_marks_all_entries(self): | ||
| """mark_eval_entries should never mark every single-node entry, | ||
| ensuring the e2e splitting logic can distinguish default from evals-only.""" | ||
| entries = [ | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': c, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False} | ||
| for c in [32, 64, 128, 256, 512] | ||
| ] + [ | ||
| # Non-8k1k entry that should never be marked | ||
| {'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8', | ||
| 'isl': 1024, 'osl': 1024, 'tp': 2, 'conc': 64, | ||
| 'spec-decoding': False, 'dp-attn': False, 'run-eval': False}, | ||
| ] | ||
| result = mark_eval_entries(entries) | ||
| non_prefill = [x for x in result if 'prefill' not in x] | ||
| assert not all(x['run-eval'] for x in non_prefill), \ | ||
| "mark_eval_entries must not mark all entries — would break e2e splitting" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 This revert re-introduces a regression where the SINGLE filter uses
not x.get('run-eval', False), which silently excludes the highest and median concurrency 8k1k entries from benchmark jobs in default mode. In default mode,mark_eval_entries()still runs and marks those entries withrun-eval=True—meaning they are only dispatched as evals, and their benchmark data is never collected. To fix this, restore theeval-onlyflag distinction from PR #1020: usenot x.get('eval-only', False)in the SINGLE filter so that entries marked for eval in default mode are still benchmarked.Extended reasoning...
What the bug is and how it manifests
After this revert, the SINGLE job filter on line 57 of
.github/workflows/e2e-tests.ymlreads:Any entry with
run-eval=Trueis excluded from the SINGLE (benchmark) job and routed exclusively to the EVALS job. In default mode (no--no-evalsor--evals-onlyflags),generate_sweep_configs.pycallsmark_eval_entries()unconditionally (the guard isif not args.no_evals), which setsrun-eval=Trueon the highest-concurrency and median-concurrency 8k1k single-node entries. Those entries therefore never reach the benchmark job in default mode—their throughput data is silently dropped.Specific code path that triggers it
main()ingenerate_sweep_configs.pycallsmark_eval_entries(matrix_values)whenever--no-evalsis absent.mark_eval_entriesmarks the highest and median concurrency entries for each 8k1k group withrun-eval=True.not x.get('run-eval', False)rejects those entries entirely.x.get('run-eval', False)), but EVALS is invoked withrun-eval: trueandeval-only: true—meaning only quality evaluations run, not throughput benchmarks.Why existing code does not prevent it
Before PR #1020, there was no way to distinguish default mode (should also benchmark) from evals-only mode (benchmark skipped). PR #1020 solved this by introducing an
eval-onlyfield:mark_eval_entriessetseval-only=Trueonly in the--evals-onlypath, and the SINGLE filter was updated tonot x.get('eval-only', False). This revert removes both, collapsing the two distinct cases so the SINGLE filter can no longer tell them apart.Impact
In any default-mode CI run (the most common case), the highest and median concurrency benchmark configurations for all 8k1k workloads are silently omitted from throughput results. Performance regressions at high load will be invisible.
Step-by-step proof with a concrete example
Suppose a config has three 8k1k single-node entries: conc=32, conc=128, conc=512.
mark_eval_entriescomputes conc_values = [32, 128, 512]. Median index=1 → conc=128. Highest=512. Both getrun-eval=True.not x.get('run-eval', False)→ only conc=32 passes. Benchmark job runs one configuration.How to fix it
Restore the changes from PR #1020: (a) re-add
EVAL_ONLYto theFieldsenum andSingleNodeMatrixEntry, (b) re-add the loop inmain()that setseval-only=Trueonly in the--evals-onlypath, and (c) update the SINGLE filter tonot x.get('eval-only', False).