Reduce peak memory in PostProcess.py PDF weight extraction#334
Open
dprim7 wants to merge 4 commits intoLPC-HH:mainfrom
Open
Reduce peak memory in PostProcess.py PDF weight extraction#334dprim7 wants to merge 4 commits intoLPC-HH:mainfrom
dprim7 wants to merge 4 commits intoLPC-HH:mainfrom
Conversation
…t contract - calculate_trigger_weights: data returns ones, MC/signal apply SF with uncertainty - calculate_txbb_weights: per-jet SF multiplication, data/signal/single-jet paths - discretize_var: default/custom bins, clipping, integer output - Output contract: define required columns per sample type (data/signal/ttbar/bg), verify event list columns are a subset, cross-check consistency These tests guard against regressions when refactoring load_process_run3_samples for memory optimization (early events_dict deletion).
Per-column df[i].to_numpy() called 101 times in a loop creates 17+ GB of transient allocations due to pandas column-access overhead. Bulk df.to_numpy() then column slicing avoids this entirely. Isolated benchmark (vbfhh4b-k2v0, 232K events, 101 PDF columns): Before: 19.0 GB peak RSS After: 2.2 GB peak RSS (-88%) Full pipeline (2022, 20 samples): Before: 23.0 GB peak RSS After: 19.3 GB peak RSS (-16%) Template output verified bit-identical (90/90 histograms).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduce peak memory in PostProcess.py PDF weight extraction
Problem
load_process_run3_samples()consumes 17+ GB of transient RSS while extracting PDF weight columns for signal samples, contributing to OOM crashes during template production.Cause
The PDF weights are pulled out one column at a time:
{f"pdf_weights_{i}": events_dict["pdf_weights"][i].to_numpy() for i in range(n_pdf_weights)}For a 232K-event sample with 101 PDF columns, the per-column
df[i].to_numpy()calls pile up enough pandas intermediates to push RSS to 19 GB before the GC catches up. The actual data is ~188 MB.Fix
Materialize the DataFrame once, then slice columns out of the numpy array:
Same pattern applied to
scale_weights. +4 / -7 lines.Benchmarks
Isolated, single sample (
vbfhh4b-k2v0, 232K events, 101 PDF columns):Full pipeline (
PostProcess.py --years 2022, 20 samples, glopart-v2):The full-pipeline delta is smaller because the accumulated
events_dict_postprocessbaseline (~11 GB after the data sample) dominates. The fix removes the single largest transient spike, which is what trips OOM on 2024.Correctness
90/90 template histograms bit-identical between before/after (
np.allclosewithequal_nan=True).Things ruled out along the way
jemallocandMALLOC_TRIM_THRESHOLD_=0both gave 22-23 GBgc.collect()+malloc_trim(0)between samples — spike is intra-sampledel events_dictafter last use — peak is duringmore_varsconstruction, whileevents_dictis still liveHow to reproduce
Compare
Maximum resident set sizebetween the two.