Skip to content

🌀 Correctly manage random seeds to improve reproducibility #70

@yamilbknsu

Description

@yamilbknsu

We currently provide the option to set a random seed in order to guarantee reproducibility. However, this is applied using a simple call to np.random.seed(). This is problematic for primarily two reasons:

  • The order of the random operations affects the outcome. This also affects calls to random generators with a dynamic size
  • Some randomness does not follow numpy's seed (pandas.DataFrame.sample, for example)

Ideally, we would have an instance of np.random.SeedSequence (reference) and spawn different random stream from it for each source of randomness. The master seed sequence could be injected using orca or provided global access through a public class.

Example pseudo-code for usage:

master_ss = np.random.SeedSequence(12345)

# Spawn one child SeedSequence per process
child_ss = master_ss.spawn(2)

# Build a Generator for each
rng_P1 = np.random.default_rng(child_ss[0])
rng_P2 = np.random.default_rng(child_ss[1])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions