Skip to content

256 fix seed#261

Open
marianovitasari20 wants to merge 14 commits intomainfrom
256-fix-seed
Open

256 fix seed#261
marianovitasari20 wants to merge 14 commits intomainfrom
256-fix-seed

Conversation

@marianovitasari20
Copy link
Copy Markdown
Contributor

No description provided.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

I’ve merged the latest changes from main and accepted the main version due to conflicts. Since the function I modified in this branch was changed or deleted in main, I now need to reapply my own changes, so I’m working on it again.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  icenet_mp
  model_service.py 28-32, 194-208
  icenet_mp/losses
  weighted_mse_loss.py 23-26, 44-48
Project Total  

This report was generated by python-coverage-comment-action

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

It’s becoming nondeterministic again after I merge main into this branch. I’ve been investigating it since yesterday. I’ve only run it locally and haven’t tried it on Isambard yet.

Copy link
Copy Markdown
Member

@jemrobinson jemrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally convinced that this is a thing we want to be doing.

If we believe that the real performance of a model training has quite a wide distribution of performance (i.e. a large spread of values across any particular metric) then what do we achieve if we manage to make this deterministic? If we want to compare "model configuration A" to "model configuration B" we would still need to run each of these 10-20 times with different seeds in order to convince ourselves that we understand the model uncertainty.

Would we be better working on running model ensembles instead?

Comment thread icenet_mp/model_service.py Outdated


class _DeterministicInterpolate:
"""Monkey-patch F.interpolate to strip antialias=True for deterministic CUDA backward.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is something we want to do. We previously needed the antialias argument to avoid artifacts when resizing.

seed_everything(int(seed), workers=True)
seed = int(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference https://docs.nvidia.com/cuda/cublas/ and explain what this means

@IFenton
Copy link
Copy Markdown
Contributor

IFenton commented Apr 30, 2026

If we want to compare "model configuration A" to "model configuration B" we would still need to run each of these 10-20 times with different seeds in order to convince ourselves that we understand the model uncertainty.

My understanding was that it would allow us to compare two model configurations better, e.g. if model config A always has a lower error than model config B when they are run with the same seed then we can be more confident that model config A is a sensible thing to do. You'd still need to run each pairing a number of times with different seeds to get more of an idea of spread (/do model ensembles), but it seems like it allows a more like-for-like comparison of two configurations. Is that logic wrong @jemrobinson?

Would we be better working on running model ensembles instead?
This feels like something we should be doing going forward, but as well rather instead.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

marianovitasari20 commented Apr 30, 2026

I'm not totally convinced that this is a thing we want to be doing.

If we believe that the real performance of a model training has quite a wide distribution of performance (i.e. a large spread of values across any particular metric) then what do we achieve if we manage to make this deterministic? If we want to compare "model configuration A" to "model configuration B" we would still need to run each of these 10-20 times with different seeds in order to convince ourselves that we understand the model uncertainty.

Would we be better working on running model ensembles instead?

I think these are two separate things, @jemrobinson. First, we want to fairly investigate why a particular method behaves the way it does, and for that we need a fair, controlled comparison to draw meaningful conclusions. The deterministic mode with a fixed seed is just a tool for that, it doesn't remove uncertainty from the model itself.
Regarding ensembles, we can absolutely still do that later. Adding a seed here doesn't prevent us from running ensemble experiments, the seed is just for controlled comparisons during development. Under normal settings, we wouldn't use a seed at all.

Yes, I’m aware of the effect of removing antialiasing. I thought we could still use this to compare runs in a fully deterministic mode, and we wouldn't need to use a seed under normal settings. I take your point that ensembles could also be used for uncertainty quantification, and that they can be applied under normal settings without seeds later. However, I’m still inclined to keep the deterministic approach, as it makes it easier to compare different configurations, such as methods or loss functions. The purpose is to ensure a fair comparison. We will revert to running without a seed once we have decided which methods (e.g., loss functions or approaches) is better.

@jemrobinson
Copy link
Copy Markdown
Member

Hi @marianovitasari20. I guess my underlying point is this: what does comparing two different configurations with the same seed actually tell us? Each one is a single draw from an underlying distribution with large variance, so we can't simply compare e.g. MAE for the two configurations and say that one is performing better than the other.

What's the thing you want to compare that is made easier by this?

@jemrobinson
Copy link
Copy Markdown
Member

My understanding was that it would allow us to compare two model configurations better, e.g. if model config A always has a lower error than model config B when they are run with the same seed then we can be more confident that model config A is a sensible thing to do. You'd still need to run each pairing a number of times with different seeds to get more of an idea of spread (/do model ensembles), but it seems like it allows a more like-for-like comparison of two configurations. Is that logic wrong @jemrobinson?

Yes, if model A always has lower error than model B across N runs, then we can conclude that it's probably performing better. However, I think this applies whether or not you use the same seed.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

My understanding was that it would allow us to compare two model configurations better, e.g. if model config A always has a lower error than model config B when they are run with the same seed then we can be more confident that model config A is a sensible thing to do. You'd still need to run each pairing a number of times with different seeds to get more of an idea of spread (/do model ensembles), but it seems like it allows a more like-for-like comparison of two configurations. Is that logic wrong @jemrobinson?

Yes, if model A always has lower error than model B across N runs, then we can conclude that it's probably performing better. However, I think this applies whether or not you use the same seed.

I understand your point, that’s why we want to run A vs. B with the same seed, and then repeat with different seeds. This still gives multiple runs, but in a fairer way.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

Just as an example, I ran these experiments using the same seed (while still in fully deterministic mode, before merging main into this branch) to compare the effect of using L1 loss vs. weighted MSE loss. I still need to run multiple experiments with different seeds before drawing conclusions later, but this is a fairer way.
#258

@jemrobinson
Copy link
Copy Markdown
Member

I don't quite see why it's fairer to compare different models with the same seed rather than different seeds. They're going to be drawing a different number of random numbers and using them for different purposes, so I think you're still comparing a single sample from a distribution of trained models against a single sample from a different distribution of trained models.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

Discussed further during the coworking meeting with @IFenton and @jemrobinson today.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

Removing _DeterministicInterpolate, the team agreed not to go with this approach. Keeping a note here for reference: if anyone wants fully deterministic behaviour in future, this function shows one way to do it. It does come with a performance cost, though personally I think it's acceptable for comparison purposes.

@jemrobinson
Copy link
Copy Markdown
Member

Removing _DeterministicInterpolate, the team agreed not to go with this approach. Keeping a note here for reference: if anyone wants fully deterministic behaviour in future, this function shows one way to do it. It does come with a performance cost, though personally I think it's acceptable for comparison purposes.

To be clear - performance cost isn't a problem, but removing the "antialiasing" argument means that the behaviour of the model is changed.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

marianovitasari20 commented Apr 30, 2026

Removing _DeterministicInterpolate, the team agreed not to go with this approach. Keeping a note here for reference: if anyone wants fully deterministic behaviour in future, this function shows one way to do it. It does come with a performance cost, though personally I think it's acceptable for comparison purposes.

To be clear - performance cost isn't a problem, but removing the "antialiasing" argument means that the behaviour of the model is changed.

Sorry for the confusion, by "performance cost" I meant model quality, not compute. Stripping antialias changes the interpolation behaviour which could affect results. So yes, we're agreeing on the same thing.

@marianovitasari20
Copy link
Copy Markdown
Contributor Author

I've removed it now.

antialias=True → better signal processing, less aliasing, but nondeterministic
antialias=False → slightly worse interpolation quality, but deterministic

In most ML cases, the difference in quality is negligible compared to reproducibility and stable debugging.

After more testing, I’m actually still not fully convinced we should remove this, but we can merge it for now rather than delay things further due to disagreement.

@jemrobinson
Copy link
Copy Markdown
Member

Can you add some plots/metrics here for discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants