Skip to content

Conversation

@lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Sep 23, 2024

The partition ids are not continuous when spark standalone mode launches multiple executors per machine. And this triggers an error in gen_data_distributed.py.

logging.warning("cupy import failed; falling back to numpy.")

partition_index = pyspark.TaskContext().partitionId()
my_seed = partition_seeds[partition_index % len(partition_seeds)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good finding on this, but is there a way to guarantee unique seeds across the partitions? With this, we could have two partitions_indexes that differ by a multiple of partition_seeds and then they would have the same seed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can set task id as the random seed. However, the generated dataset is non-deterministic because the set of task ids changes in different runs.

Will think about if there is a way to guarantee both unique seeds and deterministic dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be revising this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to "partition_seed = global_random_state + partition id".

@eordentlich
Copy link
Collaborator

This is tricky. I think the main problem is that the final rdd when the computation happens can have a different number of partitions than the intended num_partitions.

@eordentlich
Copy link
Collaborator

Maybe we can do this a different way and simply run double precision random number generator with an initial deterministic seed partitionID number of times to get the partition's seed. Then no need to know number of total partitions and partition seeds will be unique with high probability.

@wbo4958
Copy link
Collaborator

wbo4958 commented Sep 24, 2024

The first stage has set the numPartitions explicitly, so even if there's repartition, the num partitions of shuffle write stage should be equal to numPartitions, which should not be a problem here.

# Initial DataFrame with only row numbers
init = spark.range(rows, numPartitions=num_partitions)

res = init.mapInPandas(make_sparse_regression_udf, schema)

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Sep 25, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Sep 25, 2024

Maybe we can do this a different way and simply run double precision random number generator with an initial deterministic seed partitionID number of times to get the partition's seed. Then no need to know number of total partitions and partition seeds will be unique with high probability.

Does "double precision random number" mean integer random number generator? The random seed expects an integer value.

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lijinf2 lijinf2 merged commit 3bb8a51 into NVIDIA:branch-24.10 Sep 25, 2024
2 checks passed
@lijinf2 lijinf2 deleted the qa_indexerror branch September 25, 2024 21:21
lijinf2 added a commit that referenced this pull request Apr 25, 2025
…rminism of SparseRegressionDataGen (#894)

Issue: #892
Relevant PR: #742

---------

Signed-off-by: Jinfeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants