Skip to content

--balance-data incorrectly requires exact divisibility by data-parallel size after invalid rollout samples are dropped #969

@mvillmow

Description

@mvillmow

Summary

--balance-data currently behaves as if each data-parallel rank must receive exactly the same number of samples. In practice, this causes a hard failure when the valid rollout sample count is not divisible by the data-parallel size.

This appears inconsistent with the documented behavior of --balance-data, which says it should repartition each rollout batch so ranks get a similar total token count via Karmarkar-Karp. That wording implies token-balanced partitioning, not exact equal sample counts.

What happened

A rollout batch produced invalid samples, and those invalid samples were dropped before training conversion. After filtering, the remaining valid sample count was not divisible by the data-parallel size.

At that point, the run failed with an assertion equivalent to:

206 % 4 != 0

Expected behavior

With --balance-data enabled, the remaining valid samples should still be repartitioned across data-parallel ranks based on similar total token count, even when the sample count is not divisible by the number of ranks.

The flag should not require exact equal sample counts unless that behavior is explicitly documented somewhere else.

Actual behavior

The current implementation treats --balance-data as exact-size partitioning in the rollout DP split path, which causes a hard assertion if:

  • invalid samples are dropped
  • the remaining sample count is not divisible by DP size

Why this is a bug

The documented semantics are about balancing token load, not enforcing exact sample-count equality.

As a result, a run can fail even though there are still enough valid samples to continue training with a reasonable token-balanced repartition.

Minimal fix

Use the existing non-exact balancing mode for the rollout DP split path when --balance-data is enabled.

In other words:

  • keep token-balanced repartitioning
  • stop requiring exact divisibility by DP size
  • do not crash solely because the valid sample count is uneven

Environment

Observed against a current Miles-based training stack in April 2026.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions