Add FlatMixture: virtual dataset concatenation with global shuffle#4133
Add FlatMixture: virtual dataset concatenation with global shuffle#4133
Conversation
FlatMixture logically concatenates multiple AsyncDatasets into a single index space and applies a FeistelPermutation for deterministic global shuffling. This lets small datasets be grouped as one MixtureDataset component without re-tokenizing into a merged cache. Fixes #4132 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_flat_mixture_is_finite(): |
| @pytest.mark.asyncio | ||
| async def test_flat_mixture_rejects_infinite_children(): | ||
| flat = FlatMixture({"inf": InfiniteCounterDataset()}, key=key()) | ||
| assert not flat.is_finite() |
There was a problem hiding this comment.
should just reject infinite datasets as components
| return self.datasets[self.dataset_index[id]] | ||
|
|
||
|
|
||
| class FlatMixture(AsyncDataset[T]): |
There was a problem hiding this comment.
i think we should separate flatmixture into just "concatdataset" and (the already extant) permutationdataset
|
@claude delete dumb/tautological tests and generally see PR comments |
|
Claude finished @dlwh's task in 4m 54s —— View job Addressing PR review comments
SummaryRefactored per review feedback:
Tests: 26/26 passed ( |
Address review feedback: - Separate FlatMixture into ConcatDataset (pure concatenation) composed with the existing PermutationDataset for shuffling - Reject infinite datasets eagerly at construction time instead of lazily - Delete tautological tests (is_finite on finite datasets, single-child, deterministic-with-int-key) - Add test for ConcatDataset preserving insertion order without permutation Co-authored-by: David Hall <dlwh@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @pytest.mark.asyncio | ||
| async def test_concat_dataset_rejects_infinite_children(): | ||
| with pytest.raises(ValueError, match="finite"): | ||
| ConcatDataset({"inf": InfiniteCounterDataset()}) | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_concat_dataset_rejects_empty(): | ||
| with pytest.raises(ValueError, match="at least one"): | ||
| ConcatDataset({}) |
There was a problem hiding this comment.
@claude this range of tests still feels dumb for our not live use case
There was a problem hiding this comment.
Claude finished @Helw150's task in 2m 12s —— View job
Cleaning up tautological ConcatDataset tests
- Read current tests
- Identify and remove dumb tests
- Run tests (20/20 passed)
- Push changes
Deleted 6 trivial tests that don't validate meaningful behavior:
test_concat_dataset_length_is_sum_of_children— trivially truetest_concat_dataset_rejects_empty— trivial input validationtest_concat_dataset_contains_all_elements— redundant with is_a_permutationtest_concat_dataset_preserves_order— trivialtest_concat_dataset_getitem_consistent_with_get_batch— internal consistency, not real behaviortest_concat_with_permutation_shuffles— single-child version of is_a_permutation
Kept 3 tests that validate real behavior:
test_concat_dataset_rejects_infinite_children— important guardtest_concat_with_permutation_is_a_permutation— core correctnesstest_concat_with_permutation_nests_in_mixture_dataset— real use case
Tests: 20/20 passed (pytest lib/levanter/tests/test_mixture.py). Pre-commit clean.
· Branch
Remove trivial tests that don't validate meaningful behavior: - length_is_sum_of_children, rejects_empty, contains_all_elements, preserves_order, getitem_consistent_with_get_batch, permutation_shuffles Keep: rejects_infinite_children (guard), is_a_permutation (correctness), nests_in_mixture_dataset (real use case). Co-authored-by: William Held <Helw150@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@claude you deleted some tests that I felt were good (making sure that the dataset contains all elements, making sure an index returns a consistent value). You didn't delete one test that I think is bad (just checking that the concat dataset throws an error) |
|
@claude What's up you eyeballed my comment, but didn't do anything. I like test_concat_dataset_getitem_consistent_with_get_batch — internal consistency, not real behavior the rest can go tho |
|
Claude finished @Helw150's task in 2m 37s —— View job Keeping only the 3 requested ConcatDataset tests
SummaryUpdated ConcatDataset tests to keep exactly the 3 you requested:
Deleted Tests: 20/20 passed ( |
Restore test_concat_dataset_getitem_consistent_with_get_batch and delete test_concat_dataset_rejects_infinite_children per Helw150's review. Co-authored-by: William Held <Helw150@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4133) FlatMixture logically concatenates multiple AsyncDatasets into a single index space and applies a FeistelPermutation for deterministic global shuffling. This lets small datasets be grouped as one MixtureDataset component without re-tokenizing into a merged cache. Children keep their own caches and FlatMixture resolves shuffled indices to (child, local_offset) via np.searchsorted on cumulative lengths. Fixes #4132 --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: David Hall <dlwh@users.noreply.github.com> Co-authored-by: William Held <Helw150@users.noreply.github.com> Co-authored-by: William Held <will.held@openathena.ai>
FlatMixture logically concatenates multiple AsyncDatasets into a single index space and applies a FeistelPermutation for deterministic global shuffling. This lets small datasets be grouped as one MixtureDataset component without re-tokenizing into a merged cache. Children keep their own caches and FlatMixture resolves shuffled indices to (child, local_offset) via np.searchsorted on cumulative lengths.
Fixes #4132