Summary
--balance-data currently behaves as if each data-parallel rank must receive exactly the same number of samples. In practice, this causes a hard failure when the valid rollout sample count is not divisible by the data-parallel size.
This appears inconsistent with the documented behavior of --balance-data, which says it should repartition each rollout batch so ranks get a similar total token count via Karmarkar-Karp. That wording implies token-balanced partitioning, not exact equal sample counts.
What happened
A rollout batch produced invalid samples, and those invalid samples were dropped before training conversion. After filtering, the remaining valid sample count was not divisible by the data-parallel size.
At that point, the run failed with an assertion equivalent to:
206 % 4 != 0
Expected behavior
With --balance-data enabled, the remaining valid samples should still be repartitioned across data-parallel ranks based on similar total token count, even when the sample count is not divisible by the number of ranks.
The flag should not require exact equal sample counts unless that behavior is explicitly documented somewhere else.
Actual behavior
The current implementation treats --balance-data as exact-size partitioning in the rollout DP split path, which causes a hard assertion if:
- invalid samples are dropped
- the remaining sample count is not divisible by DP size
Why this is a bug
The documented semantics are about balancing token load, not enforcing exact sample-count equality.
As a result, a run can fail even though there are still enough valid samples to continue training with a reasonable token-balanced repartition.
Minimal fix
Use the existing non-exact balancing mode for the rollout DP split path when --balance-data is enabled.
In other words:
- keep token-balanced repartitioning
- stop requiring exact divisibility by DP size
- do not crash solely because the valid sample count is uneven
Environment
Observed against a current Miles-based training stack in April 2026.
Summary
--balance-datacurrently behaves as if each data-parallel rank must receive exactly the same number of samples. In practice, this causes a hard failure when the valid rollout sample count is not divisible by the data-parallel size.This appears inconsistent with the documented behavior of
--balance-data, which says it should repartition each rollout batch so ranks get a similar total token count via Karmarkar-Karp. That wording implies token-balanced partitioning, not exact equal sample counts.What happened
A rollout batch produced invalid samples, and those invalid samples were dropped before training conversion. After filtering, the remaining valid sample count was not divisible by the data-parallel size.
At that point, the run failed with an assertion equivalent to:
206 % 4 != 0Expected behavior
With
--balance-dataenabled, the remaining valid samples should still be repartitioned across data-parallel ranks based on similar total token count, even when the sample count is not divisible by the number of ranks.The flag should not require exact equal sample counts unless that behavior is explicitly documented somewhere else.
Actual behavior
The current implementation treats
--balance-dataas exact-size partitioning in the rollout DP split path, which causes a hard assertion if:Why this is a bug
The documented semantics are about balancing token load, not enforcing exact sample-count equality.
As a result, a run can fail even though there are still enough valid samples to continue training with a reasonable token-balanced repartition.
Minimal fix
Use the existing non-exact balancing mode for the rollout DP split path when
--balance-datais enabled.In other words:
Environment
Observed against a current Miles-based training stack in April 2026.