Description
From scikit-hep/uproot5#580 (comment).
Uproot and hist both define a flow
argument in values
and variances
(though the UHI specification does not require it). They are in agreement with how they use it when they produce histograms. Uproot can only produce histograms that have built-in overflow and underflow, and when either library has a histogram with a built-in flow, flow=True
includes those flow bins and flow=False
does not include those flow bins in the output. (That is, flow
affects the shape
of the return value of values
and variances
.)
When a hist histogram does not have built-in flow bins, the flow
argument does nothing: flow=True
and flow=False
both return an array of the same shape
, representing the bin contents without flow bins.
This is not what Uproot assumes when it consumes histograms:
In the above, the try
block does the wrong thing for obj
without flow bins, because I assumed that obj.values(flow=True)
would create flow bins of value zero. (That's what the fallback code does for cases in which the flow
argument does not exist and values()
is equivalent to values(flow=False)
.) Because of that mismatch of assumptions, @andrzejnovak noticed that saving hist histograms without flow bins corrupted them and submitted PR scikit-hep/uproot5#580.
Along the way, we found another discrepancy between Uproot and hist: Uproot's axes[0].edges
is a method that takes a values
argument and hist's axis[0].edges
is a property that acts like Uproot's axis[0].edges(flow=False)
. This discrepancy is unconnected to the first.
I propose the following: even though the UHI specification does not require a flow
argument in values
/variances
or axes[0].edges
, it should say what they should be if you do choose to implement them. They remain optional, but their behavior, if implemented, would be specified. I also propose that they be specified this way:
- If
values
/variances
has aflow
argument, thenflow=True
should return an array with flow bins andflow=False
should return an array without flow bins regardless of whether the object has built-in flow bins. This choice minimizes the information a user (such as Uproot) needs to know about the object to use it correctly. When the object does not have built-in flow values,flow=True
should create them with value equal to zero. (That's a weaker preference: value equal to nan would also make sense, though nans are contagious.) - If the axis object has an
edges
(or labels, intervals, centers, or widths), they should be properties without flow bins. This is how hist is implemented, not how Uproot is currently implemented. The reason I'm deciding against Uproot's implementation is (a) the extra bins that are added whenflow=True
are constants:-inf
,inf
, the word"underflow"
, or"overflow"
, and (b) it would be complicated to specify "If implementingedges
AND implementingflow
invalues
/variances
, THEN implementflow
inedges
in such-and-such a way." Conceding to hist's behavior where Uproot and hist diverge makes the specification simpler.
And then we can think about backward compatibility. This is an API breaking change, so it at least needs a minor version number.