Skip to content

Aurora Air Pollution value explosion and collapse #161

@duncanmartyn

Description

@duncanmartyn

Hi all,

I've been using the air pollution fine-tune of Aurora and found that values in many variables explode and collapse (into NaN) between 60-80 rollout steps.

For context, I'm researching the model for use in long-term climate applications, so the lengths of rollouts I'm running far exceed the model's intended 5-day horizon. I've seen the paper caveats the air pollution fine-tune in several ways so perhaps this result isn't unexpected, but I've encountered no such issue with AuroraPretrained over rollouts exceeding one year (1,460 rollout steps).

Explosion and collapse always occurred earlier when simulate_indexing_bug was set to True relative to False when using the same data as the initial state. The location of collapse also changed, with the former setting producing increasingly extreme values over the Himalayas / Tibet and the latter over Northern China and South America. In all cases, it appears to be identical pixels that precipitate explosion every time, with gradually more extreme values propagating outwards from these over 10-20 steps to produce the blocky artefacts seen in the below images.

simulate_indexing_bug=True:
Image

simulate_indexing_bug=False:
Image

Thus far, I've found the only solution to be filtering of the 2t surface variable. With a Gaussian or uniform filter applied to 2t in each rollout step and the result assigned to the prediction Batch object's .surf_vars["2t"] attribute, I can run rollouts of arbitrary length.

What else I've tried:

  • Using CAMS data from the train and test periods, as well as data outside of these - no effect
  • Replaced CAMS climatic variables with coarsened ERA5 0.25 data - no effect
  • Clamping 2t to sensible values (record global extremes) - no effect
  • Per variable re-use of initial state data (i.e. replacing each rollout step's prediction for a given variable with said variable's initial state) - doing this with 2t alone resolved the issue
  • Reduced window size - produced less variation in predictions between steps and accelerated collapse
  • Changed timestep - 6 hour timestep accelerated collapse

Just looking for your perspectives or thoughts on this, thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions