Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/e2e-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
pip install pydantic
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('eval-only', False)]))")
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))")

Check failure on line 57 in .github/workflows/e2e-tests.yml

View check run for this annotation

Claude / Claude Code Review

SINGLE filter incorrectly excludes eval-marked entries from benchmarks in default mode

This revert re-introduces a regression where the SINGLE filter uses `not x.get('run-eval', False)`, which silently excludes the highest and median concurrency 8k1k entries from benchmark jobs in default mode. In default mode, `mark_eval_entries()` still runs and marks those entries with `run-eval=True`—meaning they are only dispatched as evals, and their benchmark data is never collected. To fix this, restore the `eval-only` flag distinction from PR #1020: use `not x.get('eval-only', False)` in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This revert re-introduces a regression where the SINGLE filter uses not x.get('run-eval', False), which silently excludes the highest and median concurrency 8k1k entries from benchmark jobs in default mode. In default mode, mark_eval_entries() still runs and marks those entries with run-eval=True—meaning they are only dispatched as evals, and their benchmark data is never collected. To fix this, restore the eval-only flag distinction from PR #1020: use not x.get('eval-only', False) in the SINGLE filter so that entries marked for eval in default mode are still benchmarked.

Extended reasoning...

What the bug is and how it manifests

After this revert, the SINGLE job filter on line 57 of .github/workflows/e2e-tests.yml reads:

SINGLE = [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

Any entry with run-eval=True is excluded from the SINGLE (benchmark) job and routed exclusively to the EVALS job. In default mode (no --no-evals or --evals-only flags), generate_sweep_configs.py calls mark_eval_entries() unconditionally (the guard is if not args.no_evals), which sets run-eval=True on the highest-concurrency and median-concurrency 8k1k single-node entries. Those entries therefore never reach the benchmark job in default mode—their throughput data is silently dropped.

Specific code path that triggers it

  1. main() in generate_sweep_configs.py calls mark_eval_entries(matrix_values) whenever --no-evals is absent.
  2. mark_eval_entries marks the highest and median concurrency entries for each 8k1k group with run-eval=True.
  3. Back in the workflow, the SINGLE filter not x.get('run-eval', False) rejects those entries entirely.
  4. EVALS receives them (via x.get('run-eval', False)), but EVALS is invoked with run-eval: true and eval-only: true—meaning only quality evaluations run, not throughput benchmarks.

Why existing code does not prevent it

Before PR #1020, there was no way to distinguish default mode (should also benchmark) from evals-only mode (benchmark skipped). PR #1020 solved this by introducing an eval-only field: mark_eval_entries sets eval-only=True only in the --evals-only path, and the SINGLE filter was updated to not x.get('eval-only', False). This revert removes both, collapsing the two distinct cases so the SINGLE filter can no longer tell them apart.

Impact

In any default-mode CI run (the most common case), the highest and median concurrency benchmark configurations for all 8k1k workloads are silently omitted from throughput results. Performance regressions at high load will be invisible.

Step-by-step proof with a concrete example

Suppose a config has three 8k1k single-node entries: conc=32, conc=128, conc=512.

  1. mark_eval_entries computes conc_values = [32, 128, 512]. Median index=1 → conc=128. Highest=512. Both get run-eval=True.
  2. Output (default mode): {conc:32, run-eval:false}, {conc:128, run-eval:true}, {conc:512, run-eval:true}.
  3. SINGLE filter not x.get('run-eval', False) → only conc=32 passes. Benchmark job runs one configuration.
  4. EVALS filter → conc=128 and conc=512 go to evals only (eval-only:true in the workflow call).
  5. Result: throughput benchmark data for the two highest concurrency points is never collected.

How to fix it

Restore the changes from PR #1020: (a) re-add EVAL_ONLY to the Fields enum and SingleNodeMatrixEntry, (b) re-add the loop in main() that sets eval-only=True only in the --evals-only path, and (c) update the SINGLE filter to not x.get('eval-only', False).

MULTI=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' in x]))")
EVALS=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('run-eval', False)]))")
echo "single-node-config=$SINGLE" >> $GITHUB_OUTPUT
Expand Down
2 changes: 0 additions & 2 deletions utils/matrix_logic/generate_sweep_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -934,8 +934,6 @@ def main():
matrix_values = mark_eval_entries(matrix_values)
if args.evals_only:
matrix_values = [e for e in matrix_values if e.get(Fields.RUN_EVAL.value, False)]
for e in matrix_values:
e[Fields.EVAL_ONLY.value] = True

print(json.dumps(matrix_values))
return matrix_values
Expand Down
161 changes: 0 additions & 161 deletions utils/matrix_logic/test_generate_sweep_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
generate_runner_model_sweep_config,
apply_node_type_defaults,
expand_config_keys,
mark_eval_entries,
)


Expand Down Expand Up @@ -1580,166 +1579,6 @@
assert result == [
"dsr1-fp4-b200-sglang",
"dsr1-fp8-mi300x-sglang",
"dsr1-fp8-h200-trt",
"gptoss-fp8-b200-sglang",
]

Check failure on line 1584 in utils/matrix_logic/test_generate_sweep_configs.py

View check run for this annotation

Claude / Claude Code Review

Deletion of TestMarkEvalEntries and TestE2EConfigSplitting leaves mark_eval_entries with zero test coverage

This revert deletes TestMarkEvalEntries—the only tests that directly exercised mark_eval_entries()—even though the function is still a live production function called in main(), leaving it with zero unit test coverage. If this revert is intentional, TestMarkEvalEntries should be preserved (it does not depend on the reverted eval-only mechanism) and new tests should be added to verify the restored run-eval-based SINGLE filter behavior.
Comment on lines 1582 to 1584
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This revert deletes TestMarkEvalEntries—the only tests that directly exercised mark_eval_entries()—even though the function is still a live production function called in main(), leaving it with zero unit test coverage. If this revert is intentional, TestMarkEvalEntries should be preserved (it does not depend on the reverted eval-only mechanism) and new tests should be added to verify the restored run-eval-based SINGLE filter behavior.

Extended reasoning...

What the bug is

The revert deletes two test classes: TestE2EConfigSplitting and TestMarkEvalEntries. The deletion of TestE2EConfigSplitting is arguably justified: those tests used a _split_e2e_configs helper that replicated the eval-only-based SINGLE filter (not x.get('eval-only', False)), which is exactly the behavior being reverted. Keeping them would test a code path that no longer exists. However, the deletion of TestMarkEvalEntries is not justified and creates a real coverage gap.

The specific coverage gap

mark_eval_entries() still exists verbatim and is still called in main(). Its job is to scan all single-node entries, group them by (model, runner, framework, precision, isl, osl, spec-decoding, dp-attn), and mark the highest and median concurrency entries with run-eval=True. These marks drive both the EVALS split and, via the regression in bug_001, inadvertently drop entries from the SINGLE split. Yet after this revert the function is neither imported nor exercised in the test file.

Why existing tests do not cover it

All test classes test generate_full_sweep and generate_runner_model_sweep_config directly. None of those functions call mark_eval_entries; it is only called from main(), which has no dedicated test coverage. The refutation argument that mark_eval_entries has behavioral coverage through integration-style tests is incorrect: no executed test path reaches mark_eval_entries after this deletion.

Step-by-step proof

  1. test_generate_sweep_configs.py imports: generate_full_sweep, generate_runner_model_sweep_config, apply_node_type_defaults, expand_config_keys. mark_eval_entries is NOT imported (removed by this diff).
  2. All test classes invoke one of those four imported functions.
  3. generate_full_sweep sets run-eval=False on every entry and never calls mark_eval_entries.
  4. mark_eval_entries is only reachable through main(), which is not tested.
  5. Therefore, no test in the suite exercises mark_eval_entries, its grouping logic, its median-conc selection, its highest-conc selection, or its multinode-exclusion guard. Coverage = 0.

Why the deleted tests were valid independent of the revert

The four TestMarkEvalEntries test cases called mark_eval_entries directly with synthesized entry dicts. None of them referenced eval-only, _split_e2e_configs, or any other mechanism introduced by PR #1020. They would remain correct and valuable after the revert with no modification needed.

Impact and fix

Without these tests, the bug introduced by the reverted SINGLE filter (eval-marked entries silently dropped from benchmarks in default mode) cannot be caught by CI. The fix is to restore TestMarkEvalEntries as-is, re-add the mark_eval_entries import, and optionally add a new test class modeling the current run-eval-based SINGLE filter.



# =============================================================================
# Tests for e2e-tests.yml workflow config splitting
# =============================================================================

def _split_e2e_configs(data):
"""Replicate the splitting logic from e2e-tests.yml get-jobs step.

Returns (SINGLE, MULTI, EVALS) lists matching the workflow filters.
"""
single = [x for x in data if 'prefill' not in x and not x.get('eval-only', False)]
multi = [x for x in data if 'prefill' in x]
evals = [x for x in data if 'prefill' not in x and x.get('run-eval', False)]
return single, multi, evals


class TestE2EConfigSplitting:
"""Verify the e2e-tests.yml config splitting logic handles all flag
combinations correctly: default, --no-evals, and --evals-only."""

@pytest.fixture
def mixed_entries(self):
"""Simulates default mode output: single-node (some eval-marked),
plus multi-node entries."""
return [
{'exp-name': 'a', 'isl': 1024, 'osl': 1024, 'conc': 64, 'tp': 2, 'run-eval': False},
{'exp-name': 'b', 'isl': 1024, 'osl': 1024, 'conc': 128, 'tp': 2, 'run-eval': False},
{'exp-name': 'c', 'isl': 8192, 'osl': 1024, 'conc': 256, 'tp': 2, 'run-eval': True},
{'exp-name': 'd', 'isl': 8192, 'osl': 1024, 'conc': 512, 'tp': 2, 'run-eval': True},
{'exp-name': 'e', 'conc': 64, 'prefill': {'tp': 2, 'num-worker': 1}},
]

def test_default_mode_benchmarks_all_single_node(self, mixed_entries):
"""Default: all single-node entries (including eval-marked) are benchmarked."""
single, multi, evals = _split_e2e_configs(mixed_entries)
assert len(single) == 4
assert all('prefill' not in x for x in single)

def test_default_mode_evals_only_eval_marked(self, mixed_entries):
"""Default: only eval-marked entries go to EVALS."""
single, multi, evals = _split_e2e_configs(mixed_entries)
assert len(evals) == 2
assert all(x['run-eval'] for x in evals)

def test_default_mode_eval_marked_in_both(self, mixed_entries):
"""Default: eval-marked entries appear in BOTH single and evals."""
single, multi, evals = _split_e2e_configs(mixed_entries)
eval_names = {x['exp-name'] for x in evals}
single_names = {x['exp-name'] for x in single}
assert eval_names.issubset(single_names)

def test_no_evals_all_benchmarked(self):
"""--no-evals: mark_eval_entries is skipped, no run-eval=True entries."""
data = [
{'exp-name': 'a', 'conc': 64, 'tp': 2, 'run-eval': False},
{'exp-name': 'b', 'conc': 128, 'tp': 2, 'run-eval': False},
{'exp-name': 'c', 'conc': 256, 'tp': 2, 'run-eval': False},
]
single, multi, evals = _split_e2e_configs(data)
assert len(single) == 3
assert len(evals) == 0

def test_evals_only_no_benchmarks(self):
"""--evals-only: entries have eval-only flag, SINGLE must be empty."""
data = [
{'exp-name': 'c', 'conc': 256, 'tp': 2, 'run-eval': True, 'eval-only': True},
{'exp-name': 'd', 'conc': 512, 'tp': 2, 'run-eval': True, 'eval-only': True},
]
single, multi, evals = _split_e2e_configs(data)
assert len(single) == 0, "evals-only should not trigger benchmarks"
assert len(evals) == 2

def test_empty_config(self):
"""Empty input produces empty outputs."""
single, multi, evals = _split_e2e_configs([])
assert single == [] and multi == [] and evals == []

def test_all_eval_marked_without_eval_only_flag_still_benchmarked(self):
"""Default mode where mark_eval_entries marks every entry (e.g. only
8k1k with single conc). Without eval-only flag, SINGLE must still
include them for benchmarking."""
data = [
{'exp-name': 'a', 'conc': 64, 'tp': 2, 'run-eval': True},
{'exp-name': 'b', 'conc': 64, 'tp': 4, 'run-eval': True},
]
single, multi, evals = _split_e2e_configs(data)
assert len(single) == 2, "all-eval-marked entries must still be benchmarked in default mode"
assert len(evals) == 2

def test_prefill_entries_never_in_single_or_evals(self, mixed_entries):
"""Prefill (multi-node) entries only appear in MULTI."""
single, multi, evals = _split_e2e_configs(mixed_entries)
assert len(multi) == 1
assert all('prefill' in x for x in multi)
assert all('prefill' not in x for x in single)
assert all('prefill' not in x for x in evals)


class TestMarkEvalEntries:
"""Verify mark_eval_entries only marks highest/median concurrency at 8k1k."""

def test_marks_highest_and_median_conc(self):
"""Should mark highest and median concurrency for 8k1k entries."""
entries = [
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 32,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False},
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 128,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False},
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': 512,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False},
]
result = mark_eval_entries(entries)
# conc values: [32, 128, 512]. median=128 (index 1), highest=512
assert result[0]['run-eval'] is False # conc=32
assert result[1]['run-eval'] is True # conc=128 (median)
assert result[2]['run-eval'] is True # conc=512 (highest)

def test_non_8k1k_never_marked(self):
"""Entries with non-8k1k seq lengths should never be eval-marked."""
entries = [
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 1024, 'osl': 1024, 'tp': 2, 'conc': 512,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False},
]
result = mark_eval_entries(entries)
assert result[0]['run-eval'] is False

def test_multinode_entries_never_marked(self):
"""Entries without top-level tp (multi-node) should never be eval-marked."""
entries = [
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 8192, 'osl': 1024, 'conc': 512,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False,
'prefill': {'tp': 2, 'num-worker': 1}},
]
result = mark_eval_entries(entries)
assert result[0]['run-eval'] is False

def test_never_marks_all_entries(self):
"""mark_eval_entries should never mark every single-node entry,
ensuring the e2e splitting logic can distinguish default from evals-only."""
entries = [
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 8192, 'osl': 1024, 'tp': 2, 'conc': c,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False}
for c in [32, 64, 128, 256, 512]
] + [
# Non-8k1k entry that should never be marked
{'model': 'm', 'runner': 'r', 'framework': 'f', 'precision': 'fp8',
'isl': 1024, 'osl': 1024, 'tp': 2, 'conc': 64,
'spec-decoding': False, 'dp-attn': False, 'run-eval': False},
]
result = mark_eval_entries(entries)
non_prefill = [x for x in result if 'prefill' not in x]
assert not all(x['run-eval'] for x in non_prefill), \
"mark_eval_entries must not mark all entries — would break e2e splitting"
2 changes: 0 additions & 2 deletions utils/matrix_logic/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@ class Fields(Enum):

# Eval
RUN_EVAL = 'run-eval'
EVAL_ONLY = 'eval-only'


"""
Expand Down Expand Up @@ -90,7 +89,6 @@ class SingleNodeMatrixEntry(BaseModel):
exp_name: str = Field(alias=Fields.EXP_NAME.value)
disagg: bool
run_eval: bool = Field(alias=Fields.RUN_EVAL.value)
eval_only: bool = Field(alias=Fields.EVAL_ONLY.value, default=False)


class WorkerConfig(BaseModel):
Expand Down
Loading