| title | Full-System Profiler — Overview | ||||||
|---|---|---|---|---|---|---|---|
| tags |
|
A unified profiling suite that simultaneously collects GPU hardware counters, CPU utilization, memory usage, and disk I/O — both system-wide and per-process. All components are controlled by a single .pbtxt config file and produce time-aligned protobuf traces visualized on one plot.
┌─────────────────────────────────────────────────────────────────────┐
│ Your Application (links against libcupti_profiler.so) │
│ │
│ ProfilerSuite suite; │
│ suite.LoadConfig("config.pbtxt"); │
│ suite.Configure(); │
│ suite.Start(); │
│ // ... your CUDA workload ... │
│ suite.Stop(); │
└───────┬─────────────────────┬───────────────────────┬───────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌───────────────────┐
│ GpuProfiler │ │ SystemProfiler │ │ DiskProfiler │
│ │ │ │ │ │
│ CUPTI PM │ │ /proc/stat │ │ /proc/diskstats │
│ Sampling │ │ /proc/meminfo │ │ /sys/block/*/ │
│ HW counters │ │ /proc/[PID]/* │ │ /proc/[PID]/io │
└───────┬───────┘ └────────┬────────┘ └─────────┬─────────┘
│ │ │
▼ ▼ ▼
gpu_metrics.pb system_metrics.pb disk_metrics.pb
│ │ │
└────────────┬───────┘───────────────────────┘
▼
tools/visualize_all.py → full_profile.png
Each profiler runs independently with its own sampling frequency and flush interval. They write to separate .pb files using length-delimited protobuf streaming.
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)gpu {
enabled: true
device_index: 0
sampling_interval_ns: 100000
metrics: "sm__cycles_active.avg"
metrics: "sm__cycles_elapsed.avg"
flush_interval_ms: 10000
output_file: "gpu_metrics.pb"
}
system {
enabled: true
sampling_interval_ms: 100
pids: 0 # 0 = current process
flush_interval_ms: 5000
output_file: "system_metrics.pb"
}
disk {
enabled: true
sampling_interval_ms: 100
devices: "nvme0n1"
pids: 0
flush_interval_ms: 5000
output_file: "disk_metrics.pb"
}Tip
PID 0 is a special sentinel — it is resolved to the current process PID (getpid()) at runtime, so you don't need to know it in advance.
./build/examples/full_system_profiling -c configs/example.pbtxt# Python protobuf bindings are generated by the cmake build into
# generated/proto/ — no manual `protoc` step needed.
# Static PNG (driven by the session manifest the suite emits at Start())
python tools/visualize_all.py profiling_output/session_metadata.pb \
-o full_profile.png
# Interactive Bokeh HTML with built-in HTTP server
python tools/visualize_interactive.py profiling_output/session_metadata.pbBoth visualizers consume a single session_metadata.pb and
auto-discover the per-probe .pb files from the manifest. Probes
that didn't run are simply omitted from the layout. See
docs/tools/README.md for the full flag set
(--smooth-window-s, --display-hz, --render-backend, --theme,
--panel-layout, etc.).
The config uses protobuf text format (.pbtxt). Lines starting with # are comments.
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false | Enable GPU profiling |
device_index |
int32 | 0 | CUDA device index |
sampling_interval_ns |
uint64 | 100000 | HW counter sampling period (ns). 100000 = 10 kHz |
hw_buffer_size |
uint64 | 536870912 | GPU ring buffer size (bytes). 512 MB default |
max_samples |
uint64 | 50000 | Decode buffer capacity per cycle |
metrics |
string[] | (empty) | CUPTI metric names. Must fit single pass |
flush_interval_ms |
uint64 | 10000 | Periodic flush interval. 0 = flush at end only |
output_file |
string | (empty) | Output .pb path |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false | Enable CPU + memory profiling |
sampling_interval_ms |
uint64 | 100 | Sampling period in milliseconds |
pids |
uint32[] | (empty) | PIDs for per-process tracking. 0 = self |
flush_interval_ms |
uint64 | 5000 | Periodic flush interval |
output_file |
string | (empty) | Output .pb path |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false | Enable disk I/O profiling |
sampling_interval_ms |
uint64 | 100 | Sampling period in milliseconds |
devices |
string[] | (empty) | Block device names (e.g. "nvme0n1", "sda") |
pids |
uint32[] | (empty) | PIDs for per-process I/O. 0 = self |
flush_interval_ms |
uint64 | 5000 | Periodic flush interval |
output_file |
string | (empty) | Output .pb path |
Warning
Per-process disk I/O (/proc/[PID]/io) requires the profiler to run as the same user as the target process, or with CAP_SYS_PTRACE. If access is denied, a warning is printed and per-process disk data is skipped.
The visualizers are descriptor-driven: panels are declared in
configs/visualizer_panels.pbtxt (override with --panel-layout).
Each panel binds an FQN glob to a subplot, and the renderer walks
the layout — there is no hand-coded row table. See
docs/tools/README.md for the per-tool flag
reference and docs/metric-model.md for the
underlying type system.
The default layout produces 12 panels (some auto-skipped when no
series matches): SM Utilization, Active Warps / Cycle, DRAM
Bandwidth, PCIe Bandwidth, NVLink Bandwidth, CPU Utilization, Per-PID
CPU, System Memory, Per-PID Resident Memory, Disk Bandwidth, Disk
Queue Depth, Per-PID I/O. Panels with aggregation: PANEL_AGGREGATION_INTEGRATE (PCIe / NVLink / Disk Bandwidth /
Per-PID I/O) add a companion cumulative-total panel directly below.
Region annotations (shaded spans) overlay every metric panel.
Usage: full_system_profiling [-c config.pbtxt]
-c Config file path (default: configs/example.pbtxt)
Usage: gemm_profiling [-d device] [-i interval_ns] [-o output.pb]
-d Device index (default: 0)
-i Sampling interval in nanoseconds (default: 100000)
-o Output protobuf file (default: gpu_metrics.pb)
All three take a single session_metadata.pb (or, for
visualize_single.py, a single gpu_metrics.pb) and auto-discover
the per-probe files from the manifest. See
docs/tools/README.md for the full flag set:
--catalog, --panel-layout, --smooth-window-s, --display-hz,
--render-backend, --theme, --host, --port, --live, etc.
python tools/visualize_all.py session_metadata.pb -o full_profile.png
python tools/visualize_interactive.py session_metadata.pb
python tools/visualize_single.py -i gpu_metrics.pb -o gpu.png
#include <cupti_profiler/gpu_profiler.h>
cupti_profiler::ProfilerConfig config;
config.outputFile = "gpu.pb";
config.metrics = { "sm__cycles_active.avg", "sm__cycles_elapsed.avg" };
cupti_profiler::GpuProfiler profiler;
profiler.Configure(config);
profiler.Start();
// ... your CUDA workload ...
profiler.Stop();#include <cupti_profiler/profiler_suite.h>
cupti_profiler::ProfilerSuite suite;
suite.LoadConfig("my_config.pbtxt");
suite.Configure();
suite.Start();
// ... your workload ...
suite.Stop();#include <cupti_profiler/system_profiler.h>
#include <cupti_profiler/disk_profiler.h>
cupti_profiler::SystemProfilerConfig sysCfg;
sysCfg.samplingIntervalMs = 50;
sysCfg.PIDs = { static_cast<uint32_t>(getpid()) };
sysCfg.outputFile = "system.pb";
cupti_profiler::SystemProfiler sysProfiler;
sysProfiler.Configure(sysCfg);
sysProfiler.Start();
// ... workload ...
sysProfiler.Stop();- [[full-system-internals|Detailed internals documentation]]
- [[system-guide|GPU profiler system guide]]
- [[cupti-overhead-analysis|CUPTI overhead analysis]]