Skip to content

RDataFrame based NanoAOD dataset skimmer and processing

Notifications You must be signed in to change notification settings

alaha999/nanoRDF

Repository files navigation

NanoRDF framework

The following folders and scripts are useful.

Folder / Script Description
crab_setup/ Processing NanoAOD files using RDataFrame-based framework. Follow the instructions in readme.txt inside.
input/ Input sample list from CRAB job outputs in .py or .json format, organized by campaign.
createInputSample.py Automates creation of input sample configs from CRAB output for running the analysis.
createJson.py Automates creation of JSON files from DAS dataset strings using dasgoclient.
database.py, database.yaml Sample database in Python or YAML format containing sample and dataset metadata, including DAS strings.
printDatabase.py Prints the database.yaml file in a nice, colorful format on the terminal.
nanoRDF.py Main NanoAOD analysis script using RDataFrame.
submitCondor.py Condor submission script for launching jobs on the CERN Condor system.

CRAB Jobs Output Location

crab output path: /eos/user/a/alaha/nanoRDFjobs/

Instructions

Database

The database is a YAML file that consists of campaign and sample information, with a few additional pieces of information required for an analysis. Typically you need to update this file once new samples or campaigns are needed or released.

database format

Follow the structure to create the sample database.

DYM50_AMC:
  - dasno: 1010
    samplename: DYto2L-M50-amc
    das:
      - Run3Summer22: /DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM
      - Run3Summer22EE: None
    
  - dasno: 1011
    samplename: DYto2L-M50-amc_ext
    das:
      - Run3Summer22: /DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v1/NANOAODSIM
      - Run3Summer22EE: None

Print the database in a nice format. Use it to extract information about the database.

python3 printDatabase.py

Print the database on terminal

Output of printDatabase.py on terminal:

No DASno   Group           Sample                         Campaign           events/files/size       DAS               
1  1010    DYM50_AMC       DYto2L-M50-amc                 Run3Summer22       65997137/115/90.5GB     /DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
2  1011    DYM50_AMC       DYto2L-M50-amc_ext             Run3Summer22       97500666/259/134.0GB    /DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v1/NANOAODSIM 
3  1020    TTto2L2Nu_POW   TTto2L2Nu-pow                  Run3Summer22       23778148/67/54.3GB      /TTto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
4  1030    TTtoLNu2Q_POW   TTtoLNu2Q-pow                  Run3Summer22       76955324/716/180.0GB    /TTtoLNu2Q_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v2/NANOAODSIM 
5  1040    WtoLNu_AMC      WtoLNu-2Jets_amc               Run3Summer22       84739011/148/95.8GB     /WtoLNu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
6  1040    WZto3LNu_POW    WZto3LNu-pow                   Run3Summer22       2776339/25/4.3GB        /WZto3LNu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
7  1041    WZto3LNu_POW    WZto3LNu-pow_ext               Run3Summer22       8876662/84/13.7GB       /WZto3LNu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v3/NANOAODSIM 
8  1050    WWto2L2Nu_POW   WWto2L2Nu-pow                  Run3Summer22       6135192/40/9.6GB        /WWto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
9  1061    ZZto4L_AMC      ZZto4L-amc                     Run3Summer22       3554880/45/5.6GB        /ZZto4L-1Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v2/NANOAODSIM 
10 1070    ZZto4L_POW      ZZto4L-pow                     Run3Summer22       14629101/82/22.0GB      /ZZto4L_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
11 1071    ZZto4L_POW      ZZto4L-pow_ext                 Run3Summer22       14458880/270/22.3GB     /ZZto4L_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v2/NANOAODSIM 
12 1080    ZZto2L2Nu_POW   ZZto2L2Nu-pow                  Run3Summer22       14555802/118/20.7GB     /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM 
13 1081    ZZto2L2Nu_POW   ZZto2L2Nu-pow_ext              Run3Summer22       16858024/277/24.4GB     /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v2/NANOAODSIM 
14 100     SingleMuon      Run3Summer22_EraC_SingleMuon   Run3Summer22       20162441/35/16.9GB      /SingleMuon/Run2022C-22Sep2023-v1/NANOAOD 
15 110     Muon            Run3Summer22_EraC_Muon         Run3Summer22       138427345/124/113.7GB   /Muon/Run2022C-22Sep2023-v1/NANOAOD 
16 120     Muon            Run3Summer22_EraD_Muon         Run3Summer22       75468381/82/61.8GB      /Muon/Run2022D-22Sep2023-v1/NANOAOD 

dataset:  {'DYM50_AMC', 'WWto2L2Nu_POW', 'TTto2L2Nu_POW', 'ZZto4L_POW', 'WtoLNu_AMC', 'SingleMuon', 'TTtoLNu2Q_POW', 'ZZto4L_AMC', 'Muon', 'ZZto2L2Nu_POW', 'WZto3LNu_POW'}
     

Sample Configuration file

Cern condor jobs use this config file and submit jobs with the required information.

Example input sample config

example file: input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py

samplegroup = 'DYM50_AMC'
samplename = 'DYto2L-M50-amc'
sampletype = 'MC'
dasno      = 1010
campaign   = 'Run3Summer22'
fileN = 58
files = [
'/eos/user/a/alaha/nanoRDFjobs/DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/nanoRDF_DYto2L-M50-amc/250515_205342/0000/ntuple_skim_1.root',
'/eos/user/a/alaha/nanoRDFjobs/DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/nanoRDF_DYto2L-M50-amc/250515_205342/0000/ntuple_skim_10.root',
'/eos/user/a/alaha/nanoRDFjobs/DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/nanoRDF_DYto2L-M50-amc/250515_205342/0000/ntuple_skim_11.root',
'/eos/user/a/alaha/nanoRDFjobs/DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/nanoRDF_DYto2L-M50-amc/250515_205342/0000/ntuple_skim_12.root',
'/eos/user/a/alaha/nanoRDFjobs/DYto2L-2Jets_MLL-50_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/nanoRDF_DYto2L-M50-amc/250515_205342/0000/ntuple_skim_13.root',
.
.
.
]

How to create a sample config file

To create input sample configs

python3 createInputSample.py --crabdir /eos/user/a/alaha/nanoRDFjobs/ --campaign Run3Summer22 --database database.yaml --save True

CERN condor instructions

Setting up the environments: Add the following in your execution file to set all the required packages

source /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el8-gcc11-opt/setup.sh

Arguments of submitCondor.py macro are:

#Mandatory
parser.add_argument('--jobname'   ,type=str,required=True ,default="condorjob" ,help='AnalysisName: Such as VLL2018_Mar1_v0')
parser.add_argument('--script'    ,type=str,required=True                      ,help='Give path to your VLLAna.C directory')
parser.add_argument('--config'    ,type=str,required=True                      ,help='Input sample config file')

#Optional
parser.add_argument('--treeN'     ,type=int ,required=False,default=10000       ,help='Mention no of trees for simple test run')
parser.add_argument('--bunch'     ,type=int ,required=False,default=1           ,help='No of root files per job')
parser.add_argument('--dryrun'    ,type=bool,required=False                     ,help='Check before submitting jobs')
parser.add_argument('--cmsdir'    ,type=str ,required=False                     ,help='Give path to your CMSSW_x/src/ directory')
parser.add_argument('--chainAdd'  ,type=str ,required=False                     ,help='Chain any other config file to extend the filelist')

To submit jobs on all input files:

python3 submitCondor.py --jobname test_condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py

To submit jobs on a few input files:

python3 submitCondor.py --jobname test_condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py --treeN 5

To submit jobs in chunks:

python3 submitCondor.py --jobname test_condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py --bunch 10

In each condor job, 10 files will be processed (use --bunch to control this).

Add multiple additional config files using --chainAdd:

python3 submitCondor.py --jobname test_condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py --chainAdd input/Run3Summer22/DYto2L-M50-amc_ext_Run3Summer22.py,input/Run3Summer22/DYto2L-M50-amc_ext_ext_Run3Summer22.py

Check parameters without submitting using --dryrun option:

python3 submitCondor.py --jobname test_condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py --chainAdd input/Run3Summer22/DYto2L-M50-amc_ext_Run3Summer22
.py --dryrun True

Important configuration imported from --config or sample config

inputConfig = importlib.import_module(f"{args.dataset.replace('/','.').split('.py')[0]}")
samplename  = inputConfig.samplename  # Unique sample name
sampletype  = inputConfig.sampletype  # Data or MC
dasno       = inputConfig.dasno       # dasno. A real number is associated with each sample to manipulate later (like PdgId). >1000: MC, <1000: Data
campaign    = inputConfig.campaign    # Campaigns, Run3Summer22, Run3Summer22EE, Run3Summer23, Run3Summer23BPix, 2018, 2017, 2016preVFP, 2016postVFP
files       = inputConfig.files       # List of input files with full path

NB: Extra sample config files added using --chainAdd arguments have no influence on setting the parameters mentioned above. They will always be taken from --config. Only files will be extended from the --chainAdd arguments.

To execute the condor jobs, the following command is executed in each job:

>> python3 nanoRDF.py $1 $2 $3 $4 $5
>> argumentString = f"{inputFiles} {outputFileName} {dasno} {samplename} {campaign}"

To create a bash file submitting condor jobs in bulk for all configs from Run3Summer22/ campaign, as an example:

ls input/Run3Summer22/*.py | awk '{printf "python3 submitCondor.py --jobname condorjob --script nanoRDF.py --config %s --bunch 10\n", $1}' > submit_all.sh

output:

python3 submitCondor.py --jobname condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_Run3Summer22.py --bunch 10
python3 submitCondor.py --jobname condorjob --script nanoRDF.py --config input/Run3Summer22/DYto2L-M50-amc_ext_Run3Summer22.py --bunch 10
python3 submitCondor.py --jobname condorjob --script nanoRDF.py --config input/Run3Summer22/Run3Summer22_EraC_Muon_Run3Summer22.py --bunch 10
python3 submitCondor.py --jobname condorjob --script nanoRDF.py --config input/Run3Summer22/ZZto4L-pow_ext_Run3Summer22.py --bunch 10

About

RDataFrame based NanoAOD dataset skimmer and processing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published