feat(build-dsp): add per-processor profiling report#200
Conversation
After processing each table, log an INFO-level report showing wall time and RSS delta (max and rolling average) per processor, sorted by wall time descending. Top 15 processors are shown explicitly; any remainder is collapsed into a single summary line. Mechanics: ProcessorManager.execute() now snapshots RSS via psutil before/after each block call and accumulates wall time with perf_counter. ProcessingChain gains get_profile() returning the full stats dict; get_timing() delegates to it for backward compatibility. psutil added as a package dependency.
|
this would be very useful in the dataflow |
|
Hey Luigi, agreed that we want this kind of monitoring, but I don't think this is an accurate picture of the memory usage for processors. This is because most of the memory is allocated during setup. The memory buffers are allocated when the processor is added (these are just numpy arrays so you could probably just get the sizes directly; they also shouldn't be that big in the end) and the numba processors (I've never found a good way to figure out how to attribute this unfortunately). This will capture any added memory that isn't happening in place within buffers (I guess this is probably why the ffts are on top, since fft uses an additional buffer). edit: I just realized I was misinterpretting things and the delta isn't maximum memory usage of the processor while it is running, but memory that is allocated and not immediately deallocated...Which could imply that there is some sort of a leak, but I'm curious if there isn't clean-up happening after the delta is calculated? Otherwise, adding this around add_processor would be very interesting Similarly, the timing won't capture everything (particular IO), but I think it is probably capturing what is important (and the built in profilers do awful with numba functions). |
Replace per-block RSS measurement in ProcessorManager.execute() with a setup-time snapshot in build_processing_chain. The new measurement wraps both the init_args factory call (where filter kernels are allocated) and add_processor (where output buffers are created), giving a meaningful per-processor memory footprint. Per-execution RSS was misleading since buffers are pre-allocated and only FFT temporaries showed non-zero deltas. The profile report column is renamed to setup_MB accordingly.
|
@iguinn does it make more sense now? |
Summary
ProcessorManager.execute()now measures wall time (perf_counter) and RSS delta (psutil) around each processor call per block, accumulating total wall time, per-block max RSS delta, and per-block average RSS deltaProcessingChaingainsget_profile()returning the full stats dict;get_timing()delegates to it (backward compatible)build_dsplogs an INFO-level profiling table after each table is processed, showing the top 15 processors by wall time with wall time (s + %), max RSS delta, and average RSS delta; remaining processors are collapsed into a single summary linepsutiladded as a package dependencyExample output:
Test plan
tests/test_processing_chain.py,tests/test_build_dsp.py)build_dspon a real file