Skip to content

Reduce peak memory in PostProcess.py PDF weight extraction#334

Open
dprim7 wants to merge 4 commits intoLPC-HH:mainfrom
dprim7:reduce-peak-memory
Open

Reduce peak memory in PostProcess.py PDF weight extraction#334
dprim7 wants to merge 4 commits intoLPC-HH:mainfrom
dprim7:reduce-peak-memory

Conversation

@dprim7
Copy link
Copy Markdown
Collaborator

@dprim7 dprim7 commented Apr 7, 2026

Reduce peak memory in PostProcess.py PDF weight extraction

Problem

load_process_run3_samples() consumes 17+ GB of transient RSS while extracting PDF weight columns for signal samples, contributing to OOM crashes during template production.

Cause

The PDF weights are pulled out one column at a time:

{f"pdf_weights_{i}": events_dict["pdf_weights"][i].to_numpy() for i in range(n_pdf_weights)}

For a 232K-event sample with 101 PDF columns, the per-column df[i].to_numpy() calls pile up enough pandas intermediates to push RSS to 19 GB before the GC catches up. The actual data is ~188 MB.

Fix

Materialize the DataFrame once, then slice columns out of the numpy array:

pdf_np = events_dict["pdf_weights"].to_numpy()
{f"pdf_weights_{i}": pdf_np[:, i].copy() for i in range(n_pdf_weights)}

Same pattern applied to scale_weights. +4 / -7 lines.

Benchmarks

Isolated, single sample (vbfhh4b-k2v0, 232K events, 101 PDF columns):

Peak RSS
before 19.0 GB
after 2.2 GB

Full pipeline (PostProcess.py --years 2022, 20 samples, glopart-v2):

Peak RSS
before 23.0 GB
after 19.3 GB

The full-pipeline delta is smaller because the accumulated events_dict_postprocess baseline (~11 GB after the data sample) dominates. The fix removes the single largest transient spike, which is what trips OOM on 2024.

Correctness

90/90 template histograms bit-identical between before/after (np.allclose with equal_nan=True).

Things ruled out along the way

  • glibc fragmentation — jemalloc and MALLOC_TRIM_THRESHOLD_=0 both gave 22-23 GB
  • gc.collect() + malloc_trim(0) between samples — spike is intra-sample
  • del events_dict after last use — peak is during more_vars construction, while events_dict is still live

How to reproduce

PYTHONPATH=src /usr/bin/time -v micromamba run -n hh4b python bench_spike_isolate.py
PYTHONPATH=src /usr/bin/time -v micromamba run -n hh4b python bench_fix_validation.py

Compare Maximum resident set size between the two.

dprim7 and others added 4 commits April 2, 2026 17:06
…t contract

- calculate_trigger_weights: data returns ones, MC/signal apply SF with uncertainty
- calculate_txbb_weights: per-jet SF multiplication, data/signal/single-jet paths
- discretize_var: default/custom bins, clipping, integer output
- Output contract: define required columns per sample type (data/signal/ttbar/bg),
  verify event list columns are a subset, cross-check consistency

These tests guard against regressions when refactoring load_process_run3_samples
for memory optimization (early events_dict deletion).
Per-column df[i].to_numpy() called 101 times in a loop creates 17+ GB
of transient allocations due to pandas column-access overhead. Bulk
df.to_numpy() then column slicing avoids this entirely.

Isolated benchmark (vbfhh4b-k2v0, 232K events, 101 PDF columns):
  Before: 19.0 GB peak RSS
  After:   2.2 GB peak RSS (-88%)

Full pipeline (2022, 20 samples):
  Before: 23.0 GB peak RSS
  After:  19.3 GB peak RSS (-16%)

Template output verified bit-identical (90/90 histograms).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants