Skip to content

Specification should provide guidance on non-required elements: flow argument in values/variances and axis edges #58

Open
@jpivarski

Description

@jpivarski

From scikit-hep/uproot5#580 (comment).

Uproot and hist both define a flow argument in values and variances (though the UHI specification does not require it). They are in agreement with how they use it when they produce histograms. Uproot can only produce histograms that have built-in overflow and underflow, and when either library has a histogram with a built-in flow, flow=True includes those flow bins and flow=False does not include those flow bins in the output. (That is, flow affects the shape of the return value of values and variances.)

When a hist histogram does not have built-in flow bins, the flow argument does nothing: flow=True and flow=False both return an array of the same shape, representing the bin contents without flow bins.

This is not what Uproot assumes when it consumes histograms:

https://github.com/scikit-hep/uproot4/blob/c968c5fa26047f0e21614854b98e07de7c4789c8/src/uproot/writing/identify.py#L261-L280

In the above, the try block does the wrong thing for obj without flow bins, because I assumed that obj.values(flow=True) would create flow bins of value zero. (That's what the fallback code does for cases in which the flow argument does not exist and values() is equivalent to values(flow=False).) Because of that mismatch of assumptions, @andrzejnovak noticed that saving hist histograms without flow bins corrupted them and submitted PR scikit-hep/uproot5#580.

Along the way, we found another discrepancy between Uproot and hist: Uproot's axes[0].edges is a method that takes a values argument and hist's axis[0].edges is a property that acts like Uproot's axis[0].edges(flow=False). This discrepancy is unconnected to the first.

I propose the following: even though the UHI specification does not require a flow argument in values/variances or axes[0].edges, it should say what they should be if you do choose to implement them. They remain optional, but their behavior, if implemented, would be specified. I also propose that they be specified this way:

  • If values/variances has a flow argument, then flow=True should return an array with flow bins and flow=False should return an array without flow bins regardless of whether the object has built-in flow bins. This choice minimizes the information a user (such as Uproot) needs to know about the object to use it correctly. When the object does not have built-in flow values, flow=True should create them with value equal to zero. (That's a weaker preference: value equal to nan would also make sense, though nans are contagious.)
  • If the axis object has an edges (or labels, intervals, centers, or widths), they should be properties without flow bins. This is how hist is implemented, not how Uproot is currently implemented. The reason I'm deciding against Uproot's implementation is (a) the extra bins that are added when flow=True are constants: -inf, inf, the word "underflow", or "overflow", and (b) it would be complicated to specify "If implementing edges AND implementing flow in values/variances, THEN implement flow in edges in such-and-such a way." Conceding to hist's behavior where Uproot and hist diverge makes the specification simpler.

And then we can think about backward compatibility. This is an API breaking change, so it at least needs a minor version number.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions