[RF] Support combined Chi2 fits with correlated binned data

### Feature description

This was requested by @TomasDado, and is indeed a feature that is not available in RooFit yet. The motivation is that nowadays, fits with the same dataset projected to many observables are getting increasingly common, as fully multidimensional models are very hard to validate but one still wants to still use a maximum of information from the data.

In a combined fit with the same dataset projected on multiple 1D distributions that represent the different channels, one has to be careful that the projections are not correlated. Otherwise one "double counts information" when summing the log-likelihood over channels. In practice, that often meas that information has to be *thrown away* because one has to avoid wrong results from correlated corrections.

One possible remedy is to do a chi2 approximation where the yield in each bin is Gaussian, and then one can do a multivariate Gaussian fit, considering the covariance between data bins.

However, RooFit/HistFactory is not made for correlated data bins. Every bin in a dataset is considered independent. It is possible to mock up such a fit using a `RooMultiVarGaussian` and a dataset with a single entry and a variable for each bin, but this is highly inefficient and doesn't correspond to HistFactory models at all.

We need to find a solution to fit HistFactory models to datasets on 1D projections and their correlations.

To illustrate the mathematical concept, here a toy example where two channels with two bins each are assumed to be 100 % correlated. In this case, the simultaneous fit *should* give the same result as fitting a single channel, because the second channel contains no more information. However, one can see that the parameter uncertainties reduce as if the sample size has doubled.

To account for this, one can instead fit a RooMultiVarGaussian with the following correlation matrix. The indices 0 and 1 correspond to the expected values in the two bins of the first channel, and 2 and 3 to the bins in the second channel. The off-diagonal terms fully correlate corresponding bins over the two channels. Note that the off diagonal terms need to be smaller than one by some epsilon in practice, to avoid a singular matrix.
```txt
4x4 matrix is as follows

     |      0    |      1    |      2    |      3    |
---------------------------------------------------------
   0 |          1           0           1           0
   1 |          0           1           0           1
   2 |          1           0           1           0
   3 |          0           1           0           1
```

Below is some demo code that shows the concept, with the following output:
```txt
Processing script.C...
Fitting single channel:
2000 +/- 44.7209
Fitting both channels with RooSimultaneous:
2000 +/- 31.6226

Fitting uncorrelated Gaussian approximation:
2000 +/- 31.6226

Fitting 50 % correlated Gaussian approximation:
2000 +/- 38.7296

Fitting 100 % correlated Gaussian approximation:
2000 +/- 44.7209
```
One can see that by tweaking the correlation, the uncertainty blends between the case where you assume the channels to have no correlation to 100 % correlation.
```c++
// script.C

TMatrixDSym corrToCov(const TMatrixDSym &corr, const std::vector<double> &var)
{
   int n = corr.GetNrows();
   TMatrixDSym cov(n);

   for (int i = 0; i < n; ++i) {
      for (int j = 0; j <= i; ++j) {
         double value = corr(i, j) * std::sqrt(var[j] * var[i]);
         cov(i, j) = cov(j, i) = value;
      }
   }

   return cov;
}

void createTemplateModel(RooWorkspace &ws, std::vector<double> vals, int index)
{
   auto name = [&](std::string s) { return s + std::to_string(index); };

   RooRealVar x{name("x").c_str(), "", 0., 0., 2.};
   x.setBins(2);

   TH1F h(name("h").c_str(), "", 2, 0.0, 2.0);
   h.SetBinContent(1, vals[0]);
   h.SetBinContent(2, vals[1]);

   // Convert TH1 to a simple template model
   RooDataHist dh(name("dh").c_str(), "", x, &h);
   RooHistPdf histPdf(name("histPdf").c_str(), "", x, dh);
   double n = h.Integral();
   RooRealVar nHist("nHist", "", n, 0.0, 10. * n);
   RooExtendPdf model(name("model").c_str(), "", histPdf, nHist);

   // Let's go Asimov for the dataset
   RooDataHist data{name("data").c_str(), "", x, RooFit::Import(h)};

   ws.import(model, RooFit::Silence());
   ws.import(data, RooFit::Silence());
}

void createCombinedTemplateModel(RooWorkspace &ws)
{
   using namespace RooFit;
   createTemplateModel(ws, {500, 1500}, 0);
   createTemplateModel(ws, {500, 1500}, 1);

   RooCategory sample("sample", "sample", {{"cat0", 0}, {"cat1", 1}});

   RooSimultaneous simPdf("simPdf", "simultaneous pdf", sample);
   simPdf.addPdf(*ws.pdf("model0"), "cat0");
   simPdf.addPdf(*ws.pdf("model1"), "cat1");
   ws.import(simPdf, Silence());

   RooArgSet params{*ws.var("nHist")};
   RooArgSet observables{*ws.var("x0"), *ws.var("x1")};

   ws.saveSnapshot("parameters", params);
   ws.saveSnapshot("observables", observables);

   // Construct combined dataset in (x,sample)
   std::map<std::string, RooDataHist *> dataMap{{"cat0", static_cast<RooDataHist *>(ws.data("data0"))},
                                                {"cat1", static_cast<RooDataHist *>(ws.data("data1"))}};
   RooDataSet combData("combData", "", observables, Index(sample), Import(dataMap));
   ws.import(combData, Silence());
}

void createCombinedCorrelatedModel(RooWorkspace &ws, double correlation)
{

   RooRealVar x1("x1", "x1", 500, 0, 10000);
   RooRealVar x2("x2", "x2", 1500, 0, 10000);
   RooRealVar x3("x3", "x3", 500, 0, 10000);
   RooRealVar x4("x4", "x4", 1500, 0, 10000);

   RooArgSet xset(x1, x2, x3, x4);

   RooRealVar nHist("nHist", "nHist", 2000, 0, 20000);

   RooFormulaVar m1("m1", "nHist * 1./4", nHist);
   RooFormulaVar m2("m2", "nHist * 3./4", nHist);
   RooFormulaVar m3("m3", "nHist * 1./4", nHist);
   RooFormulaVar m4("m4", "nHist * 3./4", nHist);

   RooArgList meanList(m1, m2, m3, m4);

   TMatrixDSym corr(4);
   corr(0, 0) = 1.;
   corr(1, 1) = 1.;
   corr(2, 2) = 1.;
   corr(3, 3) = 1.;
   corr(0, 2) = correlation;
   corr(2, 0) = correlation;
   corr(1, 3) = correlation;
   corr(3, 1) = correlation;

   RooMultiVarGaussian mvGauss("mvGauss", "", xset, meanList, corrToCov(corr, {500, 1500, 500, 1500}));
   RooDataSet data{"data", "", xset};
   data.add(xset);

   ws.import(mvGauss);
   ws.import(data);
   ws.saveSnapshot("parameters", *ws.var("nHist"));
}

void chi2fit(RooWorkspace &ws, std::string const &modelName, std::string const &dataName)
{
   using namespace RooFit;
   ws.pdf(modelName)->chi2FitTo(static_cast<RooDataHist&>(*ws.data(dataName)), PrintLevel(-1));
   std::cout << ws.var("nHist")->getVal() << " +/- " << ws.var("nHist")->getError() << std::endl;
   ws.loadSnapshot("parameters");
}

void fit(RooWorkspace &ws, std::string const &modelName, std::string const &dataName)
{
   using namespace RooFit;
   ws.pdf(modelName)->fitTo(static_cast<RooDataHist&>(*ws.data(dataName)), PrintLevel(-1));
   std::cout << ws.var("nHist")->getVal() << " +/- " << ws.var("nHist")->getError() << std::endl;
   ws.loadSnapshot("parameters");
}

void script()
{
   auto changeMsgLvl = std::make_unique<RooHelpers::LocalChangeMsgLevel>(RooFit::WARNING);
   {

      RooWorkspace ws{"ws"};
      createCombinedTemplateModel(ws);

      std::cout << "Fitting single channel:" << std::endl;
      chi2fit(ws, "model1", "data1");
      std::cout << "Fitting both channels with RooSimultaneous:" << std::endl;
      chi2fit(ws, "simPdf", "combData");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      // zero correlation, equivalent to simultaneous template fit
      createCombinedCorrelatedModel(ws, /*correlation=*/0.0);

      std::cout << "Fitting uncorrelated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      createCombinedCorrelatedModel(ws, /*correlation=*/0.5);

      std::cout << "Fitting 50 \% correlated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      // almost fully correlated to avoid singular matrix
      createCombinedCorrelatedModel(ws, /*correlation=*/1. - 1e-6);

      std::cout << "Fitting 100 \% correlated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RF] Support combined Chi2 fits with correlated binned data #20557

Feature description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RF] Support combined Chi2 fits with correlated binned data #20557

Description

Feature description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions