-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Hi all,
I've been using the air pollution fine-tune of Aurora and found that values in many variables explode and collapse (into NaN) between 60-80 rollout steps.
For context, I'm researching the model for use in long-term climate applications, so the lengths of rollouts I'm running far exceed the model's intended 5-day horizon. I've seen the paper caveats the air pollution fine-tune in several ways so perhaps this result isn't unexpected, but I've encountered no such issue with AuroraPretrained over rollouts exceeding one year (1,460 rollout steps).
Explosion and collapse always occurred earlier when simulate_indexing_bug was set to True relative to False when using the same data as the initial state. The location of collapse also changed, with the former setting producing increasingly extreme values over the Himalayas / Tibet and the latter over Northern China and South America. In all cases, it appears to be identical pixels that precipitate explosion every time, with gradually more extreme values propagating outwards from these over 10-20 steps to produce the blocky artefacts seen in the below images.
Thus far, I've found the only solution to be filtering of the 2t surface variable. With a Gaussian or uniform filter applied to 2t in each rollout step and the result assigned to the prediction Batch object's .surf_vars["2t"] attribute, I can run rollouts of arbitrary length.
What else I've tried:
- Using CAMS data from the train and test periods, as well as data outside of these - no effect
- Replaced CAMS climatic variables with coarsened ERA5 0.25 data - no effect
- Clamping 2t to sensible values (record global extremes) - no effect
- Per variable re-use of initial state data (i.e. replacing each rollout step's prediction for a given variable with said variable's initial state) - doing this with 2t alone resolved the issue
- Reduced window size - produced less variation in predictions between steps and accelerated collapse
- Changed timestep - 6 hour timestep accelerated collapse
Just looking for your perspectives or thoughts on this, thanks!

