Skip to content

feat(build-dsp): add per-processor profiling report#200

Open
gipert wants to merge 2 commits into
mainfrom
feat/processing-chain-profiling
Open

feat(build-dsp): add per-processor profiling report#200
gipert wants to merge 2 commits into
mainfrom
feat/processing-chain-profiling

Conversation

@gipert
Copy link
Copy Markdown
Member

@gipert gipert commented May 31, 2026

Summary

  • ProcessorManager.execute() now measures wall time (perf_counter) and RSS delta (psutil) around each processor call per block, accumulating total wall time, per-block max RSS delta, and per-block average RSS delta
  • ProcessingChain gains get_profile() returning the full stats dict; get_timing() delegates to it (backward compatible)
  • build_dsp logs an INFO-level profiling table after each table is processed, showing the top 15 processors by wall time with wall time (s + %), max RSS delta, and average RSS delta; remaining processors are collapsed into a single summary line
  • psutil added as a package dependency

Example output:

=== DSP profiling report (34 processors, 10000 entries) ===
 processor                                            wall_s       %    Δrss_max    Δrss_avg
────────────────────────────────────────────────────────────────────────────────────────────────────
 fft_convolve_wf(wf_blsub[:round(…)], cusp_k…)        1.234   25.0%   +12.30 MB    +0.41 MB
 fft_convolve_wf(wf_blsub[:round(…)], zac_ke…)        1.107   22.4%   +12.10 MB    +0.39 MB
 cusp_filter(db.cusp.sigma/wf_blsub.period, …)        0.781   15.8%    +8.40 MB    +0.28 MB
 ...
 [19 more processors]                                  0.512   10.4%         ...         ...
────────────────────────────────────────────────────────────────────────────────────────────────────
 total                                                  4.941  100.0%   +41.00 MB    +1.34 MB

Test plan

  • All existing tests pass (tests/test_processing_chain.py, tests/test_build_dsp.py)
  • Visually verify report appears in INFO log output when running build_dsp on a real file

After processing each table, log an INFO-level report showing wall time
and RSS delta (max and rolling average) per processor, sorted by wall
time descending. Top 15 processors are shown explicitly; any remainder
is collapsed into a single summary line.

Mechanics: ProcessorManager.execute() now snapshots RSS via psutil
before/after each block call and accumulates wall time with
perf_counter. ProcessingChain gains get_profile() returning the full
stats dict; get_timing() delegates to it for backward compatibility.
psutil added as a package dependency.
@gipert gipert requested a review from iguinn May 31, 2026 14:58
@gipert
Copy link
Copy Markdown
Member Author

gipert commented May 31, 2026

this would be very useful in the dataflow

@gipert gipert requested a review from ggmarshall May 31, 2026 14:58
@iguinn
Copy link
Copy Markdown
Collaborator

iguinn commented Jun 1, 2026

Hey Luigi, agreed that we want this kind of monitoring, but I don't think this is an accurate picture of the memory usage for processors. This is because most of the memory is allocated during setup. The memory buffers are allocated when the processor is added (these are just numpy arrays so you could probably just get the sizes directly; they also shouldn't be that big in the end) and the numba processors (I've never found a good way to figure out how to attribute this unfortunately). This will capture any added memory that isn't happening in place within buffers (I guess this is probably why the ffts are on top, since fft uses an additional buffer).

edit: I just realized I was misinterpretting things and the delta isn't maximum memory usage of the processor while it is running, but memory that is allocated and not immediately deallocated...Which could imply that there is some sort of a leak, but I'm curious if there isn't clean-up happening after the delta is calculated? Otherwise, adding this around add_processor would be very interesting

Similarly, the timing won't capture everything (particular IO), but I think it is probably capturing what is important (and the built in profilers do awful with numba functions).

Replace per-block RSS measurement in ProcessorManager.execute() with a
setup-time snapshot in build_processing_chain. The new measurement wraps
both the init_args factory call (where filter kernels are allocated) and
add_processor (where output buffers are created), giving a meaningful
per-processor memory footprint. Per-execution RSS was misleading since
buffers are pre-allocated and only FFT temporaries showed non-zero
deltas. The profile report column is renamed to setup_MB accordingly.
@gipert
Copy link
Copy Markdown
Member Author

gipert commented Jun 1, 2026

@iguinn does it make more sense now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants