-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Consider the common pattern for using interface.get_metadata():
metadata = interface.get_metadata() # source metadata plus defaults
# user can modify metadata here
interface.add_to_nwbfile(nwbfile, metadata=metadata) # this would fail if we did not have defaultsCurrently, this method returns both the metadata extracted from the source format and the defaults required to ensure that add_to_nwbfile does not fail. This is convenient because “the code just works.” However, this convenience has a downside: by adding defaults, we hide from users the fact that some required metadata fields are missing and have been filled with placeholders (see #1525).
After the refactor in #1511, this convenience is no longer necessary, since defaults are now added explicitly and locally when constructing the neurodata type objects. We could therefore make get_metadata() a fully provenance-based method that returns only what is available in the source format. This approach is safer, as it avoids fabricating information and reduces the risk of missing metadata due to omission.
However, returning more complete metadata also has advantages. It gives users a kind of template they can easily modify. It is more intuitive to see a field like emission_lambda and update it than to see only the source metadata and later discover that other required fields are missing (often only after add_to_nwbfile fails).
This leads to a design decision about which role get_metadata() should play. Ordered from most provenance-faithful to most convenient, the options are:
- Option 1 (Provenance): Return only metadata extractable from the source format, without any defaults. This clearly reflects what came from the source files and nothing else.
- Option 2 (Current behavior): Return source metadata merged with defaults for required values to prevent failures. This reflects what will actually be written to the NWB file.
- Option 3 (Discovery): Return all possible NWB fields for the modality as a complete schema template. This maximizes discoverability by showing every field users could provide.
I believe Option 1 (Provenance) is the right choice. The convenience of Options 2 and 3 is no longer necessary after #1511, and for a library focused on data conversion and provenance, we should avoid fabricating metadata whenever possible. A provenance-first approach ensures users know exactly what came from their data versus what they need to supply explicitly, following the principle that explicit is better than implicit.
To make it easier for the users, we could either provide modality-specific helpers such as get_ophys_metadata() and get_ecephys_metadata() that return full metadata templates for users to complete, or include a documentation page offering downloadable YAML templates for each modality that serve the same purpose. These approaches preserve the provenance-first philosophy while giving users clear, explicit tools for building complete metadata (see discussion in [PR #1448] (#1448 (comment)))