Skip to content

Conversation

@NickGeneva
Copy link
Collaborator

Earth2Studio Pull Request

Description

A PR to discuss the proposal of adding a new type of core component / model to the package, an analysis model.
There is a markdown file that discusses the general idea about this, but the mode immediate usage would be to better support things like TC trackers / non AI processes that can be executed in pipelines and can be modularized but dont really fit the mold of producing ND arrays.

Related PR: #485

Hopefully this could open the package for supporting in pipeline impact models as well potentially, or at least easy to use APIs to leverage them

More information to be added.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Analysis models do not need to inherit this protocol, this is simply used to define the
required APIs.
The key distinguishing feature is that the `__call__` method returns `None`, making
these models inference sinks rather than data sources or transformers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my one complaint about this being a requirement is that in the more interactive/notebook setting, it's nicer to return some analysis "result" directly rather than dumping it to disk and then re-reading it back in later. But I'm not sure if there is a great way to handle arbitrary return types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., taking the two use-cases you mentioned, the "validate" thing is easily handled, you could just have some console log/print statement show the result. But for something like a TC track/location it'd be nice to be able to get like tracks = TCTracks(c, coords) and tracks would contain the identified TC tracks. Obviously this is more of a nit/preference thing though, there are pretty easy workarounds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might take the stronger view that I do not think it is appropriate to have a "model" that returns None, that this is really a signal that there is something else wrong with our approach to these types of models and that this is really just a band-aid. I also don't really know how this generalizes beyond TC track/locations, which I would argue are fine to organize by ND arrays anyways. I would say perhaps we need to better understand this class of models and what an appropriate output type might be (dataframe, for example).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of having the model either return the input data (passthrough) or the analysis result (e.g., pd.DataFrame). If the model returns the input data, one could do something like:

for model in pipeline:
    x, coords = model(x, coords)

without having to check the model type (AnalysisModel or not). I believe returning some actual output data (e.g., pd.DataFrame) would be preferrable, though. Only returning None does not seem to be very convenient in any way. Since according to the current description, AnalysisModel is a designated "end-of-pipeline", I don't think the return type would be a major issue, since the returned data does not have to be compatible with any downstream pipeline elements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being general open regarding return type would already help a lot, esp for cases in which the shape of the returned data is not known a priori.
Object tracking would be obvious examples here, right now the pressing issue is around tc trackers, but could imagine with more regional and hi-res models cloud tracking could become interesting as well at some point. And I'm sure there are other occasions which are not on my radar.

However, to make the situation even more complicated, analysis models might be lightweight enough so they can run on the host while the GPU continues with the forecast.
This is now the case with the TempestExtremes implementation. In that mode it can run on top of the inference with effectively zero overhead, compressing PBs of data into MBs, but the result is returned at an arbitrary point in time, making it hard to catch in an inference loop. What could be returned is either the Future (as is done now) or the input data as Stefan suggested. That would at least avoid returning None, but it's not esp helpful for the remainder of the inference.

Overall, data sinks could also nudge users towards plugging their downstream models directly into the pipeline in case it makes sense from compute resources perspective.

Analysis models in Earth2Studio provides a set of models that serve as inference sinks
and pipeline endpoints.
Unlike {ref}`diagnostic_model_userguide` and {ref}`prognostic_model_userguide` which
transform and return data, analysis models consume data without returning modified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "modified" does not make clear whether the model returns anything at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants