-
Notifications
You must be signed in to change notification settings - Fork 74
Analysis model proposal #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Analysis models do not need to inherit this protocol, this is simply used to define the | ||
| required APIs. | ||
| The key distinguishing feature is that the `__call__` method returns `None`, making | ||
| these models inference sinks rather than data sources or transformers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my one complaint about this being a requirement is that in the more interactive/notebook setting, it's nicer to return some analysis "result" directly rather than dumping it to disk and then re-reading it back in later. But I'm not sure if there is a great way to handle arbitrary return types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., taking the two use-cases you mentioned, the "validate" thing is easily handled, you could just have some console log/print statement show the result. But for something like a TC track/location it'd be nice to be able to get like tracks = TCTracks(c, coords) and tracks would contain the identified TC tracks. Obviously this is more of a nit/preference thing though, there are pretty easy workarounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might take the stronger view that I do not think it is appropriate to have a "model" that returns None, that this is really a signal that there is something else wrong with our approach to these types of models and that this is really just a band-aid. I also don't really know how this generalizes beyond TC track/locations, which I would argue are fine to organize by ND arrays anyways. I would say perhaps we need to better understand this class of models and what an appropriate output type might be (dataframe, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of having the model either return the input data (passthrough) or the analysis result (e.g., pd.DataFrame). If the model returns the input data, one could do something like:
for model in pipeline:
x, coords = model(x, coords)
without having to check the model type (AnalysisModel or not). I believe returning some actual output data (e.g., pd.DataFrame) would be preferrable, though. Only returning None does not seem to be very convenient in any way. Since according to the current description, AnalysisModel is a designated "end-of-pipeline", I don't think the return type would be a major issue, since the returned data does not have to be compatible with any downstream pipeline elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being general open regarding return type would already help a lot, esp for cases in which the shape of the returned data is not known a priori.
Object tracking would be obvious examples here, right now the pressing issue is around tc trackers, but could imagine with more regional and hi-res models cloud tracking could become interesting as well at some point. And I'm sure there are other occasions which are not on my radar.
However, to make the situation even more complicated, analysis models might be lightweight enough so they can run on the host while the GPU continues with the forecast.
This is now the case with the TempestExtremes implementation. In that mode it can run on top of the inference with effectively zero overhead, compressing PBs of data into MBs, but the result is returned at an arbitrary point in time, making it hard to catch in an inference loop. What could be returned is either the Future (as is done now) or the input data as Stefan suggested. That would at least avoid returning None, but it's not esp helpful for the remainder of the inference.
Overall, data sinks could also nudge users towards plugging their downstream models directly into the pipeline in case it makes sense from compute resources perspective.
| Analysis models in Earth2Studio provides a set of models that serve as inference sinks | ||
| and pipeline endpoints. | ||
| Unlike {ref}`diagnostic_model_userguide` and {ref}`prognostic_model_userguide` which | ||
| transform and return data, analysis models consume data without returning modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "modified" does not make clear whether the model returns anything at all.
Earth2Studio Pull Request
Description
A PR to discuss the proposal of adding a new type of core component / model to the package, an analysis model.
There is a markdown file that discusses the general idea about this, but the mode immediate usage would be to better support things like TC trackers / non AI processes that can be executed in pipelines and can be modularized but dont really fit the mold of producing ND arrays.
Related PR: #485
Hopefully this could open the package for supporting in pipeline impact models as well potentially, or at least easy to use APIs to leverage them
More information to be added.
Checklist
Dependencies