Fix IndexError that arises when multiple executors per machine where partition ids are not continuous #742

lijinf2 · 2024-09-23T22:15:42Z

The partition ids are not continuous when spark standalone mode launches multiple executors per machine. And this triggers an error in gen_data_distributed.py.

Signed-off-by: Jinfeng <[email protected]>

…nuous

eordentlich · 2024-09-23T22:57:41Z

python/benchmark/gen_data_distributed.py

                    logging.warning("cupy import failed; falling back to numpy.")

            partition_index = pyspark.TaskContext().partitionId()
+            my_seed = partition_seeds[partition_index % len(partition_seeds)]


Good finding on this, but is there a way to guarantee unique seeds across the partitions? With this, we could have two partitions_indexes that differ by a multiple of partition_seeds and then they would have the same seed.

Maybe we can set task id as the random seed. However, the generated dataset is non-deterministic because the set of task ids changes in different runs.

Will think about if there is a way to guarantee both unique seeds and deterministic dataset.

Will you be revising this PR?

Revised to "partition_seed = global_random_state + partition id".

eordentlich · 2024-09-24T00:28:52Z

This is tricky. I think the main problem is that the final rdd when the computation happens can have a different number of partitions than the intended num_partitions.

eordentlich · 2024-09-24T00:41:17Z

Maybe we can do this a different way and simply run double precision random number generator with an initial deterministic seed partitionID number of times to get the partition's seed. Then no need to know number of total partitions and partition seeds will be unique with high probability.

wbo4958 · 2024-09-24T02:43:23Z

The first stage has set the numPartitions explicitly, so even if there's repartition, the num partitions of shuffle write stage should be equal to numPartitions, which should not be a problem here.

# Initial DataFrame with only row numbers
init = spark.range(rows, numPartitions=num_partitions)

res = init.mapInPandas(make_sparse_regression_udf, schema)

lijinf2 · 2024-09-25T19:33:19Z

build

lijinf2 · 2024-09-25T19:37:10Z

Maybe we can do this a different way and simply run double precision random number generator with an initial deterministic seed partitionID number of times to get the partition's seed. Then no need to know number of total partitions and partition seeds will be unique with high probability.

Does "double precision random number" mean integer random number generator? The random seed expects an integer value.

eordentlich

👍

…rminism of SparseRegressionDataGen (#894) Issue: #892 Relevant PR: #742 --------- Signed-off-by: Jinfeng <[email protected]>

lijinf2 added 4 commits September 23, 2024 22:20

reproduce

a89c582

Signed-off-by: Jinfeng <[email protected]>

fix per multiple executors single machine case partition id not conti…

8d99d3b

…nuous

remove debug info

a1200b0

restore tests_large/conftest.py per bug found

4b235cb

lijinf2 force-pushed the qa_indexerror branch from c1e902d to 4b235cb Compare September 23, 2024 22:21

eordentlich reviewed Sep 23, 2024

View reviewed changes

revise to avoid duplicate partition seed and generated data

30114ee

eordentlich approved these changes Sep 25, 2024

View reviewed changes

lijinf2 merged commit 3bb8a51 into NVIDIA:branch-24.10 Sep 25, 2024
2 checks passed

lijinf2 deleted the qa_indexerror branch September 25, 2024 21:21

lijinf2 mentioned this pull request Apr 17, 2025

Make test_make_sparse_regression more stable and explain the non-determinism of SparseRegressionDataGen #894

Merged

lijinf2 added a commit that referenced this pull request Apr 25, 2025

Make test_make_sparse_regression more stable and explain the non-dete…

191a033

…rminism of SparseRegressionDataGen (#894) Issue: #892 Relevant PR: #742 --------- Signed-off-by: Jinfeng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix IndexError that arises when multiple executors per machine where partition ids are not continuous #742

Fix IndexError that arises when multiple executors per machine where partition ids are not continuous #742

Uh oh!

lijinf2 commented Sep 23, 2024

Uh oh!

eordentlich Sep 23, 2024

Uh oh!

lijinf2 Sep 23, 2024

Uh oh!

eordentlich Sep 25, 2024

Uh oh!

lijinf2 Sep 25, 2024

Uh oh!

eordentlich commented Sep 24, 2024

Uh oh!

eordentlich commented Sep 24, 2024

Uh oh!

wbo4958 commented Sep 24, 2024 •

edited

Loading

Uh oh!

lijinf2 commented Sep 25, 2024

Uh oh!

lijinf2 commented Sep 25, 2024

Uh oh!

eordentlich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix IndexError that arises when multiple executors per machine where partition ids are not continuous #742

Fix IndexError that arises when multiple executors per machine where partition ids are not continuous #742

Uh oh!

Conversation

lijinf2 commented Sep 23, 2024

Uh oh!

eordentlich Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

lijinf2 Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

eordentlich Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

lijinf2 Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

eordentlich commented Sep 24, 2024

Uh oh!

eordentlich commented Sep 24, 2024

Uh oh!

wbo4958 commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijinf2 commented Sep 25, 2024

Uh oh!

lijinf2 commented Sep 25, 2024

Uh oh!

eordentlich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wbo4958 commented Sep 24, 2024 •

edited

Loading