What is the correct and optimal way to do a full shuffle? #5906

conceptofmind · 2026-01-01T04:48:40Z

conceptofmind
Jan 1, 2026

I guess this is mostly the title. I need to shuffle around 8TBs of data.

conceptofmind · 2026-01-01T04:56:24Z

conceptofmind
Jan 1, 2026
Author

Maybe:

@daft.func.batch(return_dtype=DataType.float64())
def get_random(series: Series) -> Series:
    rng = np.random.default_rng()
    return Series.from_numpy(rng.random(len(series)))

ddf = ddf.with_column("rand_key", get_random(col("text")))
ddf = ddf.sort(col("rand_key"))

3 replies

colin-ho Jan 8, 2026
Maintainer

This ain't a bad solution actually

conceptofmind Jan 8, 2026
Author

Ok. So if this adequate do you think there is any way that is more optimal or I should just run with this udf?

A node only has around 512GB of RAM.

conceptofmind Jan 8, 2026
Author

Also, does write_parquet shuffle at all?

conceptofmind · 2026-01-07T17:02:24Z

conceptofmind
Jan 7, 2026
Author

Or is doing repartition the best?

import daft

ddf = daft.read_parquet("/common-pile_sampled/raw/**")
ddf = ddf.repartition(15565)
ddf.write_parquet("/common-pile_sampled/shuffled")

3 replies

colin-ho Jan 8, 2026
Maintainer

Repartition is a decent approximation of a full shuffle, but it wont be as truly random as the solution you had with the udf and sort though.

colin-ho Jan 8, 2026
Maintainer

Though note that repartition only works when running on ray, not native.

conceptofmind Jan 8, 2026
Author

Ok. So I had run this with the native runner which I assume will not have done anything haha.

universalmind303 · 2026-01-07T17:12:43Z

universalmind303
Jan 7, 2026
Maintainer

@colin-ho @srilman any advice here?

0 replies

What is the correct and optimal way to do a full shuffle? #5906

Uh oh!

conceptofmind Jan 1, 2026

Replies: 3 comments · 6 replies

Uh oh!

conceptofmind Jan 1, 2026 Author

Uh oh!

colin-ho Jan 8, 2026 Maintainer

Uh oh!

Uh oh!

conceptofmind Jan 8, 2026 Author

Uh oh!

conceptofmind Jan 8, 2026 Author

Uh oh!

conceptofmind Jan 7, 2026 Author

Uh oh!

colin-ho Jan 8, 2026 Maintainer

Uh oh!

colin-ho Jan 8, 2026 Maintainer

Uh oh!

conceptofmind Jan 8, 2026 Author

Uh oh!

universalmind303 Jan 7, 2026 Maintainer

conceptofmind
Jan 1, 2026

Replies: 3 comments 6 replies

conceptofmind
Jan 1, 2026
Author

colin-ho Jan 8, 2026
Maintainer

conceptofmind Jan 8, 2026
Author

conceptofmind Jan 8, 2026
Author

conceptofmind
Jan 7, 2026
Author

colin-ho Jan 8, 2026
Maintainer

colin-ho Jan 8, 2026
Maintainer

conceptofmind Jan 8, 2026
Author

universalmind303
Jan 7, 2026
Maintainer