What is the correct and optimal way to do a full shuffle? #5906
Unanswered
conceptofmind
asked this question in
Q&A
Replies: 3 comments 6 replies
-
|
Maybe: @daft.func.batch(return_dtype=DataType.float64())
def get_random(series: Series) -> Series:
rng = np.random.default_rng()
return Series.from_numpy(rng.random(len(series)))
ddf = ddf.with_column("rand_key", get_random(col("text")))
ddf = ddf.sort(col("rand_key")) |
Beta Was this translation helpful? Give feedback.
3 replies
-
|
Or is doing repartition the best? import daft
ddf = daft.read_parquet("/common-pile_sampled/raw/**")
ddf = ddf.repartition(15565)
ddf.write_parquet("/common-pile_sampled/shuffled") |
Beta Was this translation helpful? Give feedback.
3 replies
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I guess this is mostly the title. I need to shuffle around 8TBs of data.
Beta Was this translation helpful? Give feedback.
All reactions