256 fix seed#261
Conversation
dc512e0 to
5e221b0
Compare
|
I’ve merged the latest changes from main and accepted the main version due to conflicts. Since the function I modified in this branch was changed or deleted in main, I now need to reapply my own changes, so I’m working on it again. |
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||
|
It’s becoming nondeterministic again after I merge |
jemrobinson
left a comment
There was a problem hiding this comment.
I'm not totally convinced that this is a thing we want to be doing.
If we believe that the real performance of a model training has quite a wide distribution of performance (i.e. a large spread of values across any particular metric) then what do we achieve if we manage to make this deterministic? If we want to compare "model configuration A" to "model configuration B" we would still need to run each of these 10-20 times with different seeds in order to convince ourselves that we understand the model uncertainty.
Would we be better working on running model ensembles instead?
|
|
||
|
|
||
| class _DeterministicInterpolate: | ||
| """Monkey-patch F.interpolate to strip antialias=True for deterministic CUDA backward. |
There was a problem hiding this comment.
I'm not sure this is something we want to do. We previously needed the antialias argument to avoid artifacts when resizing.
| seed_everything(int(seed), workers=True) | ||
| seed = int(seed) | ||
| os.environ["PYTHONHASHSEED"] = str(seed) | ||
| os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" |
There was a problem hiding this comment.
Reference https://docs.nvidia.com/cuda/cublas/ and explain what this means
My understanding was that it would allow us to compare two model configurations better, e.g. if model config A always has a lower error than model config B when they are run with the same seed then we can be more confident that model config A is a sensible thing to do. You'd still need to run each pairing a number of times with different seeds to get more of an idea of spread (/do model ensembles), but it seems like it allows a more like-for-like comparison of two configurations. Is that logic wrong @jemrobinson?
|
I think these are two separate things, @jemrobinson. First, we want to fairly investigate why a particular method behaves the way it does, and for that we need a fair, controlled comparison to draw meaningful conclusions. The deterministic mode with a fixed seed is just a tool for that, it doesn't remove uncertainty from the model itself. Yes, I’m aware of the effect of removing antialiasing. I thought we could still use this to compare runs in a fully deterministic mode, and we wouldn't need to use a seed under normal settings. I take your point that ensembles could also be used for uncertainty quantification, and that they can be applied under normal settings without seeds later. However, I’m still inclined to keep the deterministic approach, as it makes it easier to compare different configurations, such as methods or loss functions. The purpose is to ensure a fair comparison. We will revert to running without a seed once we have decided which methods (e.g., loss functions or approaches) is better. |
|
Hi @marianovitasari20. I guess my underlying point is this: what does comparing two different configurations with the same seed actually tell us? Each one is a single draw from an underlying distribution with large variance, so we can't simply compare e.g. MAE for the two configurations and say that one is performing better than the other. What's the thing you want to compare that is made easier by this? |
Yes, if model A always has lower error than model B across N runs, then we can conclude that it's probably performing better. However, I think this applies whether or not you use the same seed. |
I understand your point, that’s why we want to run A vs. B with the same seed, and then repeat with different seeds. This still gives multiple runs, but in a fairer way. |
|
Just as an example, I ran these experiments using the same seed (while still in fully deterministic mode, before merging main into this branch) to compare the effect of using L1 loss vs. weighted MSE loss. I still need to run multiple experiments with different seeds before drawing conclusions later, but this is a fairer way. |
|
I don't quite see why it's fairer to compare different models with the same seed rather than different seeds. They're going to be drawing a different number of random numbers and using them for different purposes, so I think you're still comparing a single sample from a distribution of trained models against a single sample from a different distribution of trained models. |
|
Discussed further during the coworking meeting with @IFenton and @jemrobinson today. |
|
Removing _DeterministicInterpolate, the team agreed not to go with this approach. Keeping a note here for reference: if anyone wants fully deterministic behaviour in future, this function shows one way to do it. It does come with a performance cost, though personally I think it's acceptable for comparison purposes. |
To be clear - performance cost isn't a problem, but removing the "antialiasing" argument means that the behaviour of the model is changed. |
Sorry for the confusion, by "performance cost" I meant model quality, not compute. Stripping antialias changes the interpolation behaviour which could affect results. So yes, we're agreeing on the same thing. |
|
I've removed it now. antialias=True → better signal processing, less aliasing, but nondeterministic In most ML cases, the difference in quality is negligible compared to reproducibility and stable debugging. After more testing, I’m actually still not fully convinced we should remove this, but we can merge it for now rather than delay things further due to disagreement. |
|
Can you add some plots/metrics here for discussion? |
No description provided.