--balance-data incorrectly requires exact divisibility by data-parallel size after invalid rollout samples are dropped

## Summary

`--balance-data` currently behaves as if each data-parallel rank must receive exactly the same number of samples. In practice, this causes a hard failure when the valid rollout sample count is not divisible by the data-parallel size.

This appears inconsistent with the documented behavior of `--balance-data`, which says it should repartition each rollout batch so ranks get a similar total token count via Karmarkar-Karp. That wording implies token-balanced partitioning, not exact equal sample counts.

## What happened

A rollout batch produced invalid samples, and those invalid samples were dropped before training conversion. After filtering, the remaining valid sample count was not divisible by the data-parallel size.

At that point, the run failed with an assertion equivalent to:

`206 % 4 != 0`

## Expected behavior

With `--balance-data` enabled, the remaining valid samples should still be repartitioned across data-parallel ranks based on similar total token count, even when the sample count is not divisible by the number of ranks.

The flag should not require exact equal sample counts unless that behavior is explicitly documented somewhere else.

## Actual behavior

The current implementation treats `--balance-data` as exact-size partitioning in the rollout DP split path, which causes a hard assertion if:

- invalid samples are dropped
- the remaining sample count is not divisible by DP size

## Why this is a bug

The documented semantics are about balancing token load, not enforcing exact sample-count equality.

As a result, a run can fail even though there are still enough valid samples to continue training with a reasonable token-balanced repartition.

## Minimal fix

Use the existing non-exact balancing mode for the rollout DP split path when `--balance-data` is enabled.

In other words:
- keep token-balanced repartitioning
- stop requiring exact divisibility by DP size
- do not crash solely because the valid sample count is uneven

## Environment

Observed against a current Miles-based training stack in April 2026.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--balance-data incorrectly requires exact divisibility by data-parallel size after invalid rollout samples are dropped #969

Summary

What happened

Expected behavior

Actual behavior

Why this is a bug

Minimal fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--balance-data incorrectly requires exact divisibility by data-parallel size after invalid rollout samples are dropped #969

Description

Summary

What happened

Expected behavior

Actual behavior

Why this is a bug

Minimal fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions