Skip to content

Better uncertainties for yadg-7.0 #240

@PeterKraus

Description

@PeterKraus

The current way of handling uncertainties in yadg is not optimal for the following reasons:

  • we always store absolute uncertainties for each datapoint in each data_var, sometimes at each uts leading to a huge amount of duplicate data
  • the provenance of the uncertainty is unclear: how is it determined? Is it the str->float rounding error? Is it the int->float scale factor? Is it from instrument spec sheets?
  • other properties of the uncertainty (what is the distribution - normal, rectangular?; what is the coverage factor?)

There is very little prior art on how to systematically annotate this:

  • The NetCDF CF Metadata Standards propose annotating the nominal value as follows:
    float q(time) ;
        q:standard_name = "specific_humidity" ;
        q:units = "g/g" ;
        q:ancillary_variables = "q_error_limit q_detection_limit" ;
    float q_error_limit(time)
        q_error_limit:standard_name = "specific_humidity standard_error" ;
        q_error_limit:units = "g/g" ;
    
    The only additional piece of metadata discussed is a standard_error_multiplier attribute of the standard error ancillary variable (here q_error_limit), which is the coverage factor.
  • In NeXus, at NIAC2014 the following uncertainty proposal was not accepted:
    NXroot
      NXentry
        NXdata
           @signal=“data”
           @data_axes=“xy”
           @data_uncertainty=“esd”
           @esd_uncertainty_components=“esd_uncertainties”
           data: float[300, 300]
           xy: float[300, 300]
           esd: float[300, 300]
           esd_uncertainties:NXuncertainty
              electronic : float[300, 300]
                 @basis=“Johnson`` ``noise”
              counting_statistics: float[300, 300]
                 @basis=“shot`` ``noise”
              secondary_standard: float[300, 300]
                 @basis=“esd”
    
    The NXuncertainty class definition is (no longer?) available.
  • Also in NeXus, the NXaberration and NXem definitions may contain an uncertainty NX_FLOAT and uncertainty_model NX_CHAR attributes.
  • The FAIRMAT NeXus reserve the _errors suffix for uncertainty data. Note that "The dimensions of the FIELDNAME_errors field must match the dimensions of the corresponding FIELDNAME field."
  • In the h5rdmtoolbox, an uncertainty dataset can be attached to its parent via field.ancillary_datasets, but there is no further convention.

Since there's no standard way of annotating uncertainty metadata in either NetCDF or NeXus, it looks like we'll have to roll our own:

  • annotate the nominal variable using its ancillary_variables; the name of the ancillary is not strictly regulated, but uncertainty is preferred to error:

    float val(uts) ;
      val.units = "..."
      val.ancillary_variables = "val_uncertainty ..."
    

    As the space " " character is used as separator in the ancillary_variables field, nominal variables with whitespace in their names have to be clobbered.

  • annotate the uncertainty variable to indicate it's an uncertainty using the NetCDF conventions, attach other metadata:

    float val_uncertainty(...) ;
      val_uncertainty.units = "..."
      val_uncertainty.standard_name = "val standard_error"
      val_uncertainty.standard_error_multiplier = 1
      val_uncertainty.comment = "..."
      val_uncertainty.references = "..."
      val_uncertainty.yadg_uncertainty_absolute = {0, 1}
      val_uncertainty.yadg_uncertainty_distribution = {"normal", "rectangular", ...}
      val_uncertainty.yadg_uncertainty_source = {"sigfig", "scaling", "datasheets", "explicit", ...}
    

    Here, we introduce the following three yadg-specific metadata fields:

    • The yadg_uncertainty_absolute indicates whether the uncertainty is absolute (1) or relative (0). If the uncertainty is relative, the val.units should de dimensionless (%, ppm or similar).
    • The yadg_uncertainty_distribution indicates whether the underlying distribution is normal (most common), rectangular (e.g. from rounding); further options to be defined later as necessary.
    • The yadg_uncertainty_source indicates the origin of the uncertainty, where sigfig means str->float conversion, scaling means int->float conversion, datasheets means from datasheets programmed into yadg, and explicit means explicitly specified in the source data

    The other "standard" NetCDF metadata has its usual meaning. The comment and references field in particular can be used to provide more information about where the uncertainty determination comes from.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions