Description
This code is focused on a specific part of the workflow folks may need to do -- but we are also provided tools and utilities for other bits. So I think it's helpful to Document the suggested workflow, and that will also help us determine where to put code.
My first draft:
Goal:
Starting Point:
User has a set of data that can be loaded into xarray: could be files on disk, or files on AMS, or Kerchunked zarr dataset, or ....
User needs a subset of that data:
- Restricted to:
- a polygon in space
- particular time frame
- either a single vertical layer or all vertical layers (proper vertical subsetting can wait ...)
- only the variables they need.
Outcome:
An xarray Dataset all ready to save to netcdf, or .....
That Dataset contains only what the user wants -- and is as similar as the original as possible. e.g. same names for all variables, maybe some additional metadata.
Workflow:
Step One:
User does any pre-processing required to get their data into a single, conforming dataset.
In many cases, there's nothing to be done, but it some cases, there may be work to be done:
- The grid and dat variables are in multiple files, they need to be combined into one dataset
- If there are "troublesome" variables -- e.g. time coordinates that aren't correct, etc.
As a rule, this will be model specific, maybe even implementation-of-model specific.
This package can't provide all of that, but it can (and should) provide a few examples for common cases.
e.g. SCHISM (STOFS), maybe FVCOM fixing teh time variable (some use single precision float days :-()
Step 2:
The user processes the Dataset to make it CF compliant (or enough so that the subsetting code can work)
This package will contain utilities to do that, e.g.
ugrid.assign_ugrid_topology()
Step 3:
The Dataset can be queried by the user to find out what they need to know in order to specify a subset:
- what variables are in the dataset
- what timespan is covered
- what region is covered (maybe?)
- whether it's 2D or 3D ?
Step 4:
The user makes a request for a subset.
Result -- a subset Dataset.