Skip to content

[BUG] Histogram with IntCategory and StrCategory axes can be zeroed by out-of-bounds fill values. #960

Open
@NJManganelli

Description

@NJManganelli

Describe the bug

When an IntCategory axis with growth=False and overflow=False is filled with values that fall outside the defined bins (which should be a valid combination, and is useful in certain contexts to 'ignore' invalid values without explicit masking of them), the result can be 0 for the entire histogram, at least when a StrCategory axis and Weight storage are used. This directly appears with using named versions of these axes types in scikit-hep/hist, and produces a stranger error in dask_histogram.

Steps to reproduce

This also includes the reproducer for hist and dask_histogram, to demonstrate some of the discrepancies

import awkward as ak
import dask_awkward as dak
import numpy as np
import hist
from hist.hist import Hist as CHist
from hist.dask import Hist as DHist
import boost_histogram as bh
from boost_histogram import Histogram as BHist
import dask

include_out_of_bounds_bin = False

mpc = ak.from_iter([-1, 93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]*1000)
ptc = ak.Array(np.random.normal(loc=200, scale=40, size=len(mpc)))


model_pointlazy = dak.from_awkward(mpc, npartitions=33)
ptlazy = dak.from_awkward(ptc, npartitions=33)
wgtlazy = ak.ones_like(model_pointlazy, dtype=float)

if include_out_of_bounds_bin:
    axvals = [-1, 93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]
else:
    axvals =     [93, 185, 74, 48, 67, 231, 128, 170, 213, 120, 17]

common = [
    hist.axis.StrCategory(['nominal'], name='systematic', label='Systematic'),
    #hist.axis.Regular(100, 20, 520, name='pt', label=r'1st Jet $p_T$ [GeV]'),
    hist.storage.Weight(),
    ]

axesstor = [
    hist.axis.IntCategory(axvals,
                          name="model_point",
                          label="{'uncategorized': -1, 'TChiZH_200_74': 93, 'TChiZH_250_124': 185, 'TChiZH_300_150': 74, 'TChiZH_400_200': 48, 'TChiZH_450_300': 67, 'TChiZH_600_1': 231, 'TChiZH_600_400': 128, 'TChiZH_750_450': 170, 'TChiZH_800_200': 213, 'TChiZH_900_1': 120, 'TChiZH_1500_550': 17}::Gen Model",
                          growth=False,
                          flow=True),
] + common
axesstornoflow = [
    hist.axis.IntCategory(axvals,
                          name="model_point",
                          label="{'uncategorized': -1, 'TChiZH_200_74': 93, 'TChiZH_250_124': 185, 'TChiZH_300_150': 74, 'TChiZH_400_200': 48, 'TChiZH_450_300': 67, 'TChiZH_600_1': 231, 'TChiZH_600_400': 128, 'TChiZH_750_450': 170, 'TChiZH_800_200': 213, 'TChiZH_900_1': 120, 'TChiZH_1500_550': 17}::Gen Model",
                          growth=False,
                          flow=False),
] + common 

bint = bh.axis.IntCategory(axvals, growth=False, overflow=True)
bintnoflow = bh.axis.IntCategory(axvals, growth=False, overflow=False)
bstr = bh.axis.StrCategory(['nominal'])
bstor = bh.storage.Weight()


bhistunmasked, bhistmasked = BHist(bint, bstr, storage=bstor), BHist(bint, bstr, storage=bstor)
bhistunmaskednoflow, bhistmaskednoflow = BHist(bintnoflow, bstr, storage=bstor), BHist(bintnoflow, bstr, storage=bstor)
chistunmasked, chistmasked = CHist(*axesstor), CHist(*axesstor)
chistunmaskednoflow, chistmaskednoflow = CHist(*axesstornoflow), CHist(*axesstornoflow)
dhistunmasked, dhistmasked = DHist(*axesstor), DHist(*axesstor)
dhistunmaskednoflow, dhistmaskednoflow = DHist(*axesstornoflow), DHist(*axesstornoflow)


msklazy = (model_pointlazy > -1)
dhistmasked.fill(**{
    "model_point": model_pointlazy[msklazy],
    "systematic": "nominal",
    "weight": wgtlazy[msklazy],
})
dhistunmasked.fill(**{
    "model_point": model_pointlazy,
    "systematic": "nominal",
    "weight": wgtlazy,
})
dhistmaskednoflow.fill(**{
    "model_point": model_pointlazy[msklazy],
    "systematic": "nominal",
    "weight": wgtlazy[msklazy],
})
dhistunmaskednoflow.fill(**{
    "model_point": model_pointlazy,
    "systematic": "nominal",
    "weight": wgtlazy,
})

tocompute = {"dhistmasked": dhistmasked, "dhistunmasked": dhistunmasked,
             "dhistmaskednoflow": dhistmaskednoflow, "dhistunmaskednoflow": dhistunmaskednoflow,
             "pt": ptlazy, "model_point": model_pointlazy, "wgt": wgtlazy}
comped = dask.compute(tocompute)[0]

model_point, pt, wgt = comped["model_point"], comped["pt"], comped["wgt"]

msk = (model_point > -1)
chistmasked.fill(**{
    "model_point": model_point[msk],
    "systematic": "nominal",
    "weight": wgt[msk],
})
chistunmasked.fill(**{
    "model_point": model_point,
    "systematic": "nominal",
    "weight": wgt,
})
chistmaskednoflow.fill(**{
    "model_point": model_point[msk],
    "systematic": "nominal",
    "weight": wgt[msk],
})
chistunmaskednoflow.fill(**{
    "model_point": model_point,
    "systematic": "nominal",
    "weight": wgt,
})

bhistmasked.fill(
    model_point[msk],
    "nominal",
    weight=wgt[msk],
)
bhistunmasked.fill(
    model_point,
    "nominal",
    weight=wgt,
)
bhistmaskednoflow.fill(
    model_point[msk],
    "nominal",
    weight=wgt[msk],
)
bhistunmaskednoflow.fill(
    model_point,
    "nominal",
    weight=wgt,
)

print_flow = False

print("histogram | BoostHistogram | Hist | DaskHistogram")
for key in ["histmasked",  "histmaskednoflow", "histunmasked", "histunmaskednoflow"]:
    ckey = f"c{key}"
    dkey = f"d{key}"
    bkey = f"b{key}"
    b = eval(bkey)
    c = eval(ckey)
    d = comped[dkey]
    print(key, b.sum().value, c.sum(flow=print_flow).value, d.sum(flow=print_flow).value)

This produces the following output, where the unmasked fill into the histogram without an overflow on the IntCategory produces 0 sum for (boost-)hist; and a junk-like value in DaskHistogram (presumably the partitioned fill plays an additional role here)

histogram | BoostHistogram | Hist | DaskHistogram
histmasked 11000.0 11000.0 11000.0
histmaskednoflow 11000.0 11000.0 11000.0
histunmasked 11000.0 11000.0 11000.0
histunmaskednoflow 0.0 0.0 7337.0

This bug was reproduced in 1.4.1 of boost-histogram, on an M1 Max chip, using MacOS 14.4; it was originally discovered on AlmaLinux8 (on Fermilab's LPC cluster) using a coffea container. Other principle libraries used are a mix of development versions corresponding mostly to the latest git version from approximately a month ago, and can be provided if necessary, but given the reproduction across different environments, it seems to be a relatively 'stable' bug

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions