Provenance vs convenience: What should  `get_metadata()` return?

Consider the common pattern for using `interface.get_metadata()`:

```python
metadata = interface.get_metadata()  # source metadata plus defaults
# user can modify metadata here
interface.add_to_nwbfile(nwbfile, metadata=metadata) # this would fail if we did not have defaults
```

Currently, this method returns **both** the metadata extracted from the source format and the defaults required to ensure that `add_to_nwbfile` does not fail. This is convenient because “the code just works.” However, this convenience has a downside: by adding defaults, we hide from users the fact that some required metadata fields are missing and have been filled with placeholders (see #1525).

After the refactor in #1511, this convenience is no longer necessary, since defaults are now added [explicitly and locally](https://github.com/catalystneuro/neuroconv/blob/6f10c989000e14e4016cbec1dceef1316776304e/src/neuroconv/tools/roiextractors/roiextractors.py#L286-L290) when constructing the neurodata type objects. We could therefore make `get_metadata()` a fully provenance-based method that returns only what is available in the source format. This approach is safer, as it avoids fabricating information and reduces the risk of missing metadata due to omission.

However, returning more complete metadata also has advantages. It gives users a kind of template they can easily modify. It is more intuitive to see a field like `emission_lambda` and update it than to see only the source metadata and later discover that other required fields are missing (often only after `add_to_nwbfile` fails).

This leads to a design decision about which role `get_metadata()` should play. Ordered from most provenance-faithful to most convenient, the options are:

- **Option 1 (Provenance):** Return only metadata extractable from the source format, without any defaults. This clearly reflects what came from the source files and nothing else.  
- **Option 2 (Current behavior):** Return source metadata merged with defaults for required values to prevent failures. This reflects what will actually be written to the NWB file.  
- **Option 3 (Discovery):** Return all possible NWB fields for the modality as a complete schema template. This maximizes discoverability by showing every field users could provide.  

I believe **Option 1 (Provenance)** is the right choice. The convenience of Options 2 and 3 is no longer necessary after #1511, and for a library focused on data conversion and provenance, we should avoid fabricating metadata whenever possible. A provenance-first approach ensures users know exactly what came from their data versus what they need to supply explicitly, following the principle that *explicit is better than implicit*.

To make it easier for the users, we could either provide modality-specific helpers such as `get_ophys_metadata()` and `get_ecephys_metadata()` that return full metadata templates for users to complete, or include a documentation page offering downloadable YAML templates for each modality that serve the same purpose. These approaches preserve the provenance-first philosophy while giving users clear, explicit tools for building complete metadata  (see discussion in [PR #1448] (https://github.com/catalystneuro/neuroconv/pull/1448#issuecomment-3120540732))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provenance vs convenience: What should `get_metadata()` return? #1557

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provenance vs convenience: What should get_metadata() return? #1557

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Provenance vs convenience: What should `get_metadata()` return? #1557