Skip to content

Add FlatMixture: virtual dataset concatenation with global shuffle#4133

Merged
Helw150 merged 5 commits intomainfrom
agent/20260325-fix-4132
Mar 25, 2026
Merged

Add FlatMixture: virtual dataset concatenation with global shuffle#4133
Helw150 merged 5 commits intomainfrom
agent/20260325-fix-4132

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 25, 2026

FlatMixture logically concatenates multiple AsyncDatasets into a single index space and applies a FeistelPermutation for deterministic global shuffling. This lets small datasets be grouped as one MixtureDataset component without re-tokenizing into a merged cache. Children keep their own caches and FlatMixture resolves shuffled indices to (child, local_offset) via np.searchsorted on cumulative lengths.

Fixes #4132

FlatMixture logically concatenates multiple AsyncDatasets into a single
index space and applies a FeistelPermutation for deterministic global
shuffling. This lets small datasets be grouped as one MixtureDataset
component without re-tokenizing into a merged cache.

Fixes #4132

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude claude Bot added the agent-generated Created by automation/agent label Mar 25, 2026
Comment thread lib/levanter/tests/test_mixture.py Outdated


@pytest.mark.asyncio
async def test_flat_mixture_is_finite():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb

Comment thread lib/levanter/tests/test_mixture.py Outdated
@pytest.mark.asyncio
async def test_flat_mixture_rejects_infinite_children():
flat = FlatMixture({"inf": InfiniteCounterDataset()}, key=key())
assert not flat.is_finite()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just reject infinite datasets as components

return self.datasets[self.dataset_index[id]]


class FlatMixture(AsyncDataset[T]):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should separate flatmixture into just "concatdataset" and (the already extant) permutationdataset

@dlwh
Copy link
Copy Markdown
Member

dlwh commented Mar 25, 2026

@claude delete dumb/tautological tests and generally see PR comments

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Mar 25, 2026

Claude finished @dlwh's task in 4m 54s —— View job


Addressing PR review comments

  • Read current test and implementation files
  • Separate FlatMixture into ConcatDataset + existing PermutationDataset
  • Reject infinite datasets at construction time (not lazily)
  • Delete tautological tests
  • Update __init__.py exports
  • Update tests to use ConcatDataset + PermutationDataset composition
  • Run pre-commit and tests
  • Push changes

Summary

Refactored per review feedback:

  1. FlatMixtureConcatDataset: Pure concatenation with no shuffling. All permutation logic removed — users compose with the existing PermutationDataset for shuffling.

  2. Eager rejection of infinite datasets: The ConcatDataset constructor now raises ValueError immediately if any child is not finite, instead of deferring to async_len().

  3. Deleted tautological/dumb tests:

    • test_flat_mixture_is_finite — trivially true for ListAsyncDataset children
    • test_flat_mixture_single_child — no meaningful behavior tested
    • test_flat_mixture_deterministic — tested int-key convenience, not real behavior
    • test_flat_mixture_rejects_infinite_children — replaced by constructor rejection test
  4. Added test_concat_dataset_preserves_order — verifies that without a permutation wrapper, indices map in child insertion order.

Tests: 26/26 passed (pytest tests/test_mixture.py). Pre-commit clean.
· Branch

Address review feedback:
- Separate FlatMixture into ConcatDataset (pure concatenation) composed
  with the existing PermutationDataset for shuffling
- Reject infinite datasets eagerly at construction time instead of lazily
- Delete tautological tests (is_finite on finite datasets, single-child,
  deterministic-with-int-key)
- Add test for ConcatDataset preserving insertion order without permutation

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread lib/levanter/tests/test_mixture.py Outdated
Comment on lines +310 to +319
@pytest.mark.asyncio
async def test_concat_dataset_rejects_infinite_children():
with pytest.raises(ValueError, match="finite"):
ConcatDataset({"inf": InfiniteCounterDataset()})


@pytest.mark.asyncio
async def test_concat_dataset_rejects_empty():
with pytest.raises(ValueError, match="at least one"):
ConcatDataset({})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude this range of tests still feels dumb for our not live use case

Copy link
Copy Markdown
Contributor Author

@claude claude Bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @Helw150's task in 2m 12s —— View job


Cleaning up tautological ConcatDataset tests

  • Read current tests
  • Identify and remove dumb tests
  • Run tests (20/20 passed)
  • Push changes

Deleted 6 trivial tests that don't validate meaningful behavior:

  • test_concat_dataset_length_is_sum_of_children — trivially true
  • test_concat_dataset_rejects_empty — trivial input validation
  • test_concat_dataset_contains_all_elements — redundant with is_a_permutation
  • test_concat_dataset_preserves_order — trivial
  • test_concat_dataset_getitem_consistent_with_get_batch — internal consistency, not real behavior
  • test_concat_with_permutation_shuffles — single-child version of is_a_permutation

Kept 3 tests that validate real behavior:

  • test_concat_dataset_rejects_infinite_children — important guard
  • test_concat_with_permutation_is_a_permutation — core correctness
  • test_concat_with_permutation_nests_in_mixture_dataset — real use case

Tests: 20/20 passed (pytest lib/levanter/tests/test_mixture.py). Pre-commit clean.
· Branch

Remove trivial tests that don't validate meaningful behavior:
- length_is_sum_of_children, rejects_empty, contains_all_elements,
  preserves_order, getitem_consistent_with_get_batch, permutation_shuffles

Keep: rejects_infinite_children (guard), is_a_permutation (correctness),
nests_in_mixture_dataset (real use case).

Co-authored-by: William Held <Helw150@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Mar 25, 2026

@claude you deleted some tests that I felt were good (making sure that the dataset contains all elements, making sure an index returns a consistent value).

You didn't delete one test that I think is bad (just checking that the concat dataset throws an error)

@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Mar 25, 2026

@claude What's up you eyeballed my comment, but didn't do anything.

I like

test_concat_dataset_getitem_consistent_with_get_batch — internal consistency, not real behavior
test_concat_with_permutation_is_a_permutation — core correctness
test_concat_with_permutation_nests_in_mixture_dataset — real use case

the rest can go tho

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Mar 25, 2026

Claude finished @Helw150's task in 2m 37s —— View job


Keeping only the 3 requested ConcatDataset tests

  • Read current test file
  • Restore test_concat_dataset_getitem_consistent_with_get_batch
  • Delete test_concat_dataset_rejects_infinite_children
  • Run pre-commit and tests
  • Push changes

Summary

Updated ConcatDataset tests to keep exactly the 3 you requested:

  1. test_concat_dataset_getitem_consistent_with_get_batch — restored (was previously deleted)
  2. test_concat_with_permutation_is_a_permutation — kept
  3. test_concat_with_permutation_nests_in_mixture_dataset — kept

Deleted test_concat_dataset_rejects_infinite_children.

Tests: 20/20 passed (pytest lib/levanter/tests/test_mixture.py). Pre-commit clean.

· View job · Branch

Restore test_concat_dataset_getitem_consistent_with_get_batch and delete
test_concat_dataset_rejects_infinite_children per Helw150's review.

Co-authored-by: William Held <Helw150@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Helw150 Helw150 enabled auto-merge (squash) March 25, 2026 23:37
@Helw150 Helw150 merged commit d7c4111 into main Mar 25, 2026
42 checks passed
@Helw150 Helw150 deleted the agent/20260325-fix-4132 branch March 25, 2026 23:47
Helw150 added a commit that referenced this pull request Apr 8, 2026
…4133)

FlatMixture logically concatenates multiple AsyncDatasets into a single
index space and applies a FeistelPermutation for deterministic global
shuffling. This lets small datasets be grouped as one MixtureDataset
component without re-tokenizing into a merged cache. Children keep their
own caches and FlatMixture resolves shuffled indices to (child,
local_offset) via np.searchsorted on cumulative lengths.

Fixes #4132

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: David Hall <dlwh@users.noreply.github.com>
Co-authored-by: William Held <Helw150@users.noreply.github.com>
Co-authored-by: William Held <will.held@openathena.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🤖 FlatMixture: virtual dataset concatenation without re-tokenizing

2 participants