Skip to content

Parametrized DNN, mass fitting, Datacard #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 95 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
2c296ad
added QCD Samples and its cross-section
raj2022 Jan 8, 2025
afba296
Added QCD samples in the analyzer and further plotting.c
raj2022 Jan 8, 2025
1754edd
Updated hhbbgg_Plotter
raj2022 Jan 9, 2025
b90b0eb
restructured folders and files
raj2022 Jan 14, 2025
73fd653
restructured folder
raj2022 Jan 14, 2025
1bd3777
Added parametrized dnn files python
raj2022 Jan 15, 2025
3eea284
organization
raj2022 Jan 15, 2025
cc2f42a
addedd fitting notebook
Jan 17, 2025
6541ff8
updated signal study
Jan 18, 2025
b7e6e9f
updated folders
Jan 18, 2025
d868ea3
updated folder
Jan 19, 2025
943a669
stat study
Jan 21, 2025
798c995
added stat study
Jan 22, 2025
b7c0b26
Added variation json file
Jan 22, 2025
3f751cc
added signal
Jan 22, 2025
98f5c4d
updated signal mass fitting
Jan 22, 2025
59c7a62
updated stat study
Jan 24, 2025
7d81582
updated correlation plot :
Jan 24, 2025
5473ea4
For hhbbgg_Analyzer added parquet reading as well
raj2022 Jan 24, 2025
7680173
modified: hhbbgg_Analyzer.py
raj2022 Jan 24, 2025
17fcfc5
updated to avoid memory issues.
raj2022 Jan 24, 2025
a1edbc6
corrected Analyzer to get parquet files
raj2022 Jan 24, 2025
ef84f52
updated varible rading
Jan 25, 2025
9ba94b6
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
Jan 25, 2025
2e89aa7
updated varible rading
Jan 25, 2025
54f77c3
updated the Analyzer
raj2022 Jan 25, 2025
971a4e6
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
raj2022 Jan 25, 2025
f34952f
updated bin
Jan 27, 2025
fb2642c
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
Jan 27, 2025
95ef386
added README stats
Jan 28, 2025
fb121be
added a new file for parquet reading
raj2022 Jan 28, 2025
3caad11
updated bin
Jan 25, 2025
edc8b47
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
raj2022 Jan 29, 2025
f61c1a2
updated bin
raj2022 Jan 29, 2025
09c74f0
updated and resturctured parquet file reading
raj2022 Jan 29, 2025
286a747
finishing merging
raj2022 Jan 29, 2025
36995e2
updated parquet file reading
raj2022 Jan 29, 2025
4a7b47b
added stat study
Jan 29, 2025
a58612a
parquet file reading
raj2022 Jan 29, 2025
f3b2d38
Reading parquet file
raj2022 Jan 29, 2025
3f491db
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
raj2022 Jan 29, 2025
b4fee4a
updated stat study signal
Jan 30, 2025
8bc9e74
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
Jan 30, 2025
41de0f1
restuctured signal study
Jan 30, 2025
60830e2
updated stat study for signal
Jan 30, 2025
3ca2b1f
updated REDME
Jan 30, 2025
9346589
updated REDME
Jan 30, 2025
2eae630
updated REDME
Jan 30, 2025
a0235ea
updated sisignal stats study
Jan 31, 2025
8a8da8c
updated dibjets
Feb 3, 2025
efdd894
updated dibjets
Feb 3, 2025
261b3ba
updated dijets plot
Feb 3, 2025
555c6aa
updated dijets plot
Feb 3, 2025
c75cf18
updated dijets script
Feb 3, 2025
53fea0b
added stats reading background
Feb 5, 2025
6d8f923
updated resonant and non-resonant backgrounds
Feb 5, 2025
e8936e0
dded datacard folder
raj2022 Feb 6, 2025
b3783ff
updated datacard reading
Feb 9, 2025
c0fa4de
added datacard
Feb 9, 2025
5a63ff3
updated all
raj2022 Feb 12, 2025
0c5c51e
removed GGJETs reading
raj2022 Feb 12, 2025
f390c38
updated Plotter with working files
raj2022 Feb 13, 2025
01ee97f
updated hhbbgg plotter
raj2022 Feb 13, 2025
21451d5
Removed error signal sample in plotter
raj2022 Feb 13, 2025
4e12eab
Update .gitignore
raj2022 Feb 14, 2025
69cee63
updated parametrized DNN
Feb 14, 2025
30d5016
added parametrized DNNN:
Feb 18, 2025
d8ddb74
updated pDNN
Feb 18, 2025
52c9df2
updated all signal files pDNN
raj2022 Feb 18, 2025
ad69e1d
trying to fix overtraining issues
raj2022 Feb 18, 2025
8329de8
removed device error
raj2022 Feb 19, 2025
97dbdf8
updated pDNN
Feb 19, 2025
7cbfdcb
updated stats dof
Feb 25, 2025
6513042
signal efficiency:
Feb 27, 2025
e77aca8
updated signal efficiency
raj2022 Feb 28, 2025
ac0693e
updated combine limit study
Mar 7, 2025
679932b
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
Mar 7, 2025
ff83b22
changed binning
raj2022 Mar 20, 2025
6a73ce4
updated binning
raj2022 Mar 22, 2025
7bacfdd
new folder 1D
raj2022 Mar 22, 2025
44674a3
udpdated 1D README
raj2022 Mar 22, 2025
8ec9642
updated README
raj2022 Mar 22, 2025
ce90e2b
updated README
raj2022 Mar 24, 2025
ebb55fc
Update .gitignore
raj2022 Mar 24, 2025
d3b3b19
addded 1D fitting
Mar 24, 2025
39e18a8
updated pDNN README
raj2022 Mar 24, 2025
ac57ffe
updated 1D
Mar 28, 2025
c8f58d3
restructured folder
Mar 28, 2025
076c2ec
updated 1D mass fitting
raj2022 Apr 8, 2025
6fa4267
updated singal README
raj2022 Apr 8, 2025
641b0b2
updated mass grid
Apr 28, 2025
e565095
Merge branch 'QCD' of github.com:raj2022/hhbbgg_AwkwardAnalyzer into QCD
Apr 28, 2025
79582e9
updated v3 README
May 27, 2025
5943a76
v3 README updated
May 27, 2025
4f49827
updated README for v3 with error
raj2022 May 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ outputfiles/*.ipynb
**.vscode
bdt/notebook/**/*.pt
bdt/notebook/**/*.pth
stats_study/CMSSW_*/
stats_study/datacards/CMSSW*
File renamed without changes.
File renamed without changes.
24 changes: 24 additions & 0 deletions ML_Application/parametrized_DNN/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Paraemtrized DNN on the v2 HiggsDNA files
for the DNN training we were training seperately for each mass points. For the parametrized DNN, we will provide weights as well for the sample training. We divided the mass point ijnto different ranges based on its kinematics.
With parametrized DNN, where we will provides weights into the model during the traning and we will be able to do the training for all sample in once.

Parametrized DNN on the dataset, we can implement as follows:
* pNN method is employed for a wide range of mass points
* Train on all signal MC{m1, m2, m3...}
* Give background MC random values of mass from {m1, m2, m3...}
* Provide same input variable as DNN
* Split the MC signal in half, with one half used as input for the classifier, andtheother half(weight ×2) will be used for the final signal model construction
$$
f(\vec{x}; m) =
\begin{cases}
f^1(\vec{x}) & \text{if } m = m_1 \\
f^2(\vec{x}) & \text{if } m = m_2 \\
\vdots
\end{cases}
$$
for the above taken from this preentation, [here](https://indico.cern.ch/event/1507349/contributions/6364202/attachments/3009726/5317821/preapproval.pdf)


## Ref
1. https://link.springer.com/article/10.1140/epjc/s10052-016-4099-4
2. https://arxiv.org/pdf/2202.00424
5 changes: 5 additions & 0 deletions ML_Application/parametrized_DNN/output.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
nohup: ignoring input
/eos/home-s/sraj/Work_/CUA_20--/Analysis/hhbbgg_AwkwardAnalyzer/ML_Application/parametrized_DNN/pDNN.py:121: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
df_balanced = pd.concat([df_majority_downsampled, df_minority])
/cvmfs/sft.cern.ch/lcg/views/LCG_105_cuda/x86_64-el9-gcc11-opt/lib/python3.9/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /build/jenkins/workspace/lcg_release_pipeline/build/pyexternals/torch-2.1.1/src/torch/2.1.1/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
295 changes: 295 additions & 0 deletions ML_Application/parametrized_DNN/pDNN.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
import os
import pandas as pd
import uproot
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import Adam
from torch.nn import BCEWithLogitsLoss



# Taking mass X and corresponding Y mass points
mass_points = [300, 400, 500, 550, 600, 650, 700, 900] # Example mass points
y_values = [100, 125, 150, 200, 300, 400, 500, 600] # Example Y values

# Initialize list to store data and a dictionary for missing files
signal_data = []
missing_files = {}

# Load signal data from Parquet files
for mass in mass_points:
for y in y_values:
file_path = f"../../../output_parquet/final_production_Syst/merged/NMSSM_X{mass}_Y{y}/nominal/NOTAG_merged.parquet"

if os.path.exists(file_path): # Check if file exists
try:
df = pd.read_parquet(file_path) # Load the Parquet file
df["mass"] = mass
df["y_value"] = y # Store Y value if needed
df["label"] = 1 # Assuming signal label
signal_data.append(df)
except Exception as e:
print(f"Warning: Could not read {file_path}. Error: {e}")
else:
print(f"Warning: File {file_path} does not exist.")
# Track missing files
if mass not in missing_files:
missing_files[mass] = []
missing_files[mass].append(y)

# Combine all signal data into a single DataFrame
signal_df = pd.concat(signal_data, ignore_index=True) if signal_data else pd.DataFrame()

# print the missing files
if missing_files:
print("Missing files for the following mass points and Y values:")
for mass, ys in missing_files.items():
print(f"Mass point {mass} is missing Y values: {ys}")

print(f"singal shape is",signal_df.shape)

# Reading background files
# Load background data from ROOT files
background_files = [
("../../outputfiles/hhbbgg_analyzer-v2-trees.root", "/GGJets/preselection"),
("../../outputfiles/hhbbgg_analyzer-v2-trees.root", "/GJetPt20To40/preselection"),
("../../outputfiles/hhbbgg_analyzer-v2-trees.root", "/GJetPt40/preselection"),
]
background_data = []
for file_path, tree_name in background_files:
try:
with uproot.open(file_path) as file:
tree = file[tree_name]
df = tree.arrays(library="pd")
df["mass"] = np.random.choice(mass_points, len(df)) # Random mass assignment
df["label"] = 0
background_data.append(df)
except Exception as e:
print(f"Warning: Could not read {file_path}. Error: {e}")

df_background = pd.concat(background_data, ignore_index=True) if background_data else pd.DataFrame()

# Define features and labels
features = [
'bbgg_eta', 'bbgg_phi', 'lead_pho_phi', 'sublead_pho_eta',
'sublead_pho_phi', 'diphoton_eta', 'diphoton_phi', 'dibjet_eta', 'dibjet_phi',
'lead_bjet_pt', 'sublead_bjet_pt', 'lead_bjet_eta', 'lead_bjet_phi', 'sublead_bjet_eta',
'sublead_bjet_phi', 'sublead_bjet_PNetB', 'lead_bjet_PNetB', 'CosThetaStar_gg',
'CosThetaStar_jj', 'CosThetaStar_CS', 'DeltaR_jg_min', 'pholead_PtOverM',
'phosublead_PtOverM', 'lead_pho_mvaID', 'sublead_pho_mvaID'
]

# Reduce background dataset size by random sampling
background_fraction = 0.2 # 20% of the background
df_background = df_background.sample(frac=background_fraction, random_state=42)

# Combine signal and background
df_combined = pd.concat([signal_df, df_background], ignore_index=True)

# Ensure df_combined is not empty
if df_combined.empty:
raise ValueError("Error: Combined DataFrame is empty. Check input files.")

# Convert feature data to DataFrame to prevent AttributeError
df_features = df_combined[features]

# Fill missing values with column mean
df_features = df_features.fillna(df_features.mean())

# Extract features (X) and labels (y)
X = df_features.values
y = df_combined["label"].values

print(f"total features", df_features.shape)

# Undersampling the Majority Class

from sklearn.utils import resample

df_majority = df_combined[df_combined["label"] == 0]
df_minority = df_combined[df_combined["label"] == 1]

df_majority_downsampled = resample(df_majority,
replace=False,
n_samples=len(df_minority),
random_state=42)

df_balanced = pd.concat([df_majority_downsampled, df_minority])


# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Check for GPU
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print(f"Using device: {device}")

# Move data to GPU
# X_tensor = X_tensor.to(device)
# y_tensor = y_tensor.to(device)

# Create DataLoader
dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

# Checking class imabalance
class_counts = np.bincount(y)
print(f"Class distribution: {dict(enumerate(class_counts))}")


import torch
import torch.nn as nn
from torch.optim import Adam

class ParameterizedDNN(nn.Module):
def __init__(self, input_dim):
super(ParameterizedDNN, self).__init__()
self.model = nn.Sequential(
nn.Linear(input_dim, 256), # Increase neurons
nn.ReLU(),
nn.Dropout(0.3), # Reduce dropout

nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),

nn.Linear(128, 64), # Increase size from 4 → 16
nn.ReLU(),
nn.Dropout(0.2), # Reduce dropout further

nn.Linear(64, 1) # Output layer (No activation function)
)

def forward(self, x):
return self.model(x) # No sigmoid here!



# Initialize model
input_dim = X.shape[1]
model = ParameterizedDNN(input_dim)
# criterion = nn.BCEWithLogitsLoss() # Expecting raw logits
# criterion = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([weight]))
optimizer = Adam(model.parameters(), lr=0.0001, weight_decay=1e-5) # Reduce learning rate
# Compute class weights
pos_weight = torch.tensor([class_counts[0] / class_counts[1]], dtype=torch.float32)

# Update loss function
criterion = BCEWithLogitsLoss(pos_weight=pos_weight)


import torch
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

num_epochs = 100
train_losses = []
train_accuracies = []
train_aucs = []
fpr_all, tpr_all, thresholds_all = [], [], []

for epoch in range(num_epochs):
epoch_loss = 0
y_true = []
y_pred = []

model.train() # Set to training mode
for batch in dataloader:
X_batch, y_batch = batch
X_batch, y_batch = X_batch.to(device), y_batch.to(device) # Move data to GPU

optimizer.zero_grad()
outputs = model(X_batch).squeeze() # Get raw logits

loss = criterion(outputs, y_batch)
loss.backward()
optimizer.step()

epoch_loss += loss.item()

# Store predictions for accuracy & AUC calculation
y_true.extend(y_batch.cpu().numpy()) # True labels
y_pred.extend(torch.sigmoid(outputs).detach().cpu().numpy()) # Apply sigmoid AFTER training

# Compute Metrics
avg_loss = epoch_loss / len(dataloader)
y_pred_binary = [1 if p > 0.5 else 0 for p in y_pred] # Convert to 0/1 labels
accuracy = accuracy_score(y_true, y_pred_binary)
auc = roc_auc_score(y_true, y_pred) # Use probabilities, not logits

# Store metrics
train_losses.append(avg_loss)
train_accuracies.append(accuracy)
train_aucs.append(auc)

# Compute ROC curve for current epoch (for plotting)
fpr, tpr, thresholds = roc_curve(y_true, y_pred)
fpr_all.append(fpr)
tpr_all.append(tpr)
thresholds_all.append(thresholds)

print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}, AUC: {auc:.4f}")




# Plot Loss
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.plot(range(1, num_epochs+1), train_losses, marker='o', linestyle='-', color='blue')
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss vs. Epochs")


plt.tight_layout()
plt.savefig("loss_vs_epochs.png")
plt.savefig("loss_vs_epochs.pdf")


# Plot Accuracy
plt.subplot(1, 3, 2)
plt.plot(range(1, num_epochs+1), train_accuracies, marker='o', linestyle='-', color='green')
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. Epochs")

plt.tight_layout()
plt.savefig("accuracy_vs_epochs.png")
plt.savefig("accuracy_vs_epochs.pdf")


# Plot AUC


# Plot the final ROC curve
# Select the ROC curve from the last epoch
fpr_last = fpr_all[-1]
tpr_last = tpr_all[-1]

plt.figure(figsize=(10, 6))
plt.plot(fpr_last, tpr_last, color='darkorange', lw=2, label=f'ROC curve (AUC = {train_aucs[-1]:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') # Random classifier line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'Final ROC Curve (AUC = {train_aucs[-1]:.2f})')
plt.legend(loc="lower right")
plt.savefig(AUC.png)
plt.savefig(AUC.pdf)


Loading