Skip to content

Latest commit

 

History

History
275 lines (217 loc) · 11 KB

File metadata and controls

275 lines (217 loc) · 11 KB
title Full-System Profiler — Overview
tags
profiling
gpu
cpu
memory
disk
documentation

Full-system profiler overview

A unified profiling suite that simultaneously collects GPU hardware counters, CPU utilization, memory usage, and disk I/O — both system-wide and per-process. All components are controlled by a single .pbtxt config file and produce time-aligned protobuf traces visualized on one plot.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  Your Application (links against libcupti_profiler.so)              │
│                                                                     │
│   ProfilerSuite suite;                                              │
│   suite.LoadConfig("config.pbtxt");                                 │
│   suite.Configure();                                                │
│   suite.Start();                                                    │
│   // ... your CUDA workload ...                                     │
│   suite.Stop();                                                     │
└───────┬─────────────────────┬───────────────────────┬───────────────┘
        │                     │                       │
        ▼                     ▼                       ▼
┌───────────────┐   ┌─────────────────┐   ┌───────────────────┐
│  GpuProfiler  │   │ SystemProfiler  │   │  DiskProfiler     │
│               │   │                 │   │                   │
│ CUPTI PM      │   │ /proc/stat      │   │ /proc/diskstats   │
│ Sampling      │   │ /proc/meminfo   │   │ /sys/block/*/     │
│ HW counters   │   │ /proc/[PID]/*   │   │ /proc/[PID]/io    │
└───────┬───────┘   └────────┬────────┘   └─────────┬─────────┘
        │                    │                       │
        ▼                    ▼                       ▼
  gpu_metrics.pb      system_metrics.pb       disk_metrics.pb
        │                    │                       │
        └────────────┬───────┘───────────────────────┘
                     ▼
          tools/visualize_all.py  →  full_profile.png

Each profiler runs independently with its own sampling frequency and flush interval. They write to separate .pb files using length-delimited protobuf streaming.


Quick start

Build

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Create a config

gpu {
    enabled: true
    device_index: 0
    sampling_interval_ns: 100000
    metrics: "sm__cycles_active.avg"
    metrics: "sm__cycles_elapsed.avg"
    flush_interval_ms: 10000
    output_file: "gpu_metrics.pb"
}
system {
    enabled: true
    sampling_interval_ms: 100
    pids: 0                          # 0 = current process
    flush_interval_ms: 5000
    output_file: "system_metrics.pb"
}
disk {
    enabled: true
    sampling_interval_ms: 100
    devices: "nvme0n1"
    pids: 0
    flush_interval_ms: 5000
    output_file: "disk_metrics.pb"
}

Tip

PID 0 is a special sentinel — it is resolved to the current process PID (getpid()) at runtime, so you don't need to know it in advance.

Run

./build/examples/full_system_profiling -c configs/example.pbtxt

Visualize

# Python protobuf bindings are generated by the cmake build into
# generated/proto/ — no manual `protoc` step needed.

# Static PNG (driven by the session manifest the suite emits at Start())
python tools/visualize_all.py profiling_output/session_metadata.pb \
    -o full_profile.png

# Interactive Bokeh HTML with built-in HTTP server
python tools/visualize_interactive.py profiling_output/session_metadata.pb

Both visualizers consume a single session_metadata.pb and auto-discover the per-probe .pb files from the manifest. Probes that didn't run are simply omitted from the layout. See docs/tools/README.md for the full flag set (--smooth-window-s, --display-hz, --render-backend, --theme, --panel-layout, etc.).


Config file reference

The config uses protobuf text format (.pbtxt). Lines starting with # are comments.

GPU section

Field Type Default Description
enabled bool false Enable GPU profiling
device_index int32 0 CUDA device index
sampling_interval_ns uint64 100000 HW counter sampling period (ns). 100000 = 10 kHz
hw_buffer_size uint64 536870912 GPU ring buffer size (bytes). 512 MB default
max_samples uint64 50000 Decode buffer capacity per cycle
metrics string[] (empty) CUPTI metric names. Must fit single pass
flush_interval_ms uint64 10000 Periodic flush interval. 0 = flush at end only
output_file string (empty) Output .pb path

System section (CPU + memory)

Field Type Default Description
enabled bool false Enable CPU + memory profiling
sampling_interval_ms uint64 100 Sampling period in milliseconds
pids uint32[] (empty) PIDs for per-process tracking. 0 = self
flush_interval_ms uint64 5000 Periodic flush interval
output_file string (empty) Output .pb path

Disk section

Field Type Default Description
enabled bool false Enable disk I/O profiling
sampling_interval_ms uint64 100 Sampling period in milliseconds
devices string[] (empty) Block device names (e.g. "nvme0n1", "sda")
pids uint32[] (empty) PIDs for per-process I/O. 0 = self
flush_interval_ms uint64 5000 Periodic flush interval
output_file string (empty) Output .pb path

Warning

Per-process disk I/O (/proc/[PID]/io) requires the profiler to run as the same user as the target process, or with CAP_SYS_PTRACE. If access is denied, a warning is printed and per-process disk data is skipped.


Visualization output

The visualizers are descriptor-driven: panels are declared in configs/visualizer_panels.pbtxt (override with --panel-layout). Each panel binds an FQN glob to a subplot, and the renderer walks the layout — there is no hand-coded row table. See docs/tools/README.md for the per-tool flag reference and docs/metric-model.md for the underlying type system.

The default layout produces 12 panels (some auto-skipped when no series matches): SM Utilization, Active Warps / Cycle, DRAM Bandwidth, PCIe Bandwidth, NVLink Bandwidth, CPU Utilization, Per-PID CPU, System Memory, Per-PID Resident Memory, Disk Bandwidth, Disk Queue Depth, Per-PID I/O. Panels with aggregation: PANEL_AGGREGATION_INTEGRATE (PCIe / NVLink / Disk Bandwidth / Per-PID I/O) add a companion cumulative-total panel directly below. Region annotations (shaded spans) overlay every metric panel.


CLI reference

full_system_profiling

Usage: full_system_profiling [-c config.pbtxt]
  -c  Config file path (default: configs/example.pbtxt)

gemm_profiling (GPU-only, legacy)

Usage: gemm_profiling [-d device] [-i interval_ns] [-o output.pb]
  -d  Device index (default: 0)
  -i  Sampling interval in nanoseconds (default: 100000)
  -o  Output protobuf file (default: gpu_metrics.pb)

visualize_all.py / visualize_interactive.py / visualize_single.py

All three take a single session_metadata.pb (or, for visualize_single.py, a single gpu_metrics.pb) and auto-discover the per-probe files from the manifest. See docs/tools/README.md for the full flag set: --catalog, --panel-layout, --smooth-window-s, --display-hz, --render-backend, --theme, --host, --port, --live, etc.

python tools/visualize_all.py session_metadata.pb -o full_profile.png
python tools/visualize_interactive.py session_metadata.pb
python tools/visualize_single.py -i gpu_metrics.pb -o gpu.png

Programmatic usage

Minimal integration (GPU only)

#include <cupti_profiler/gpu_profiler.h>

cupti_profiler::ProfilerConfig config;
config.outputFile = "gpu.pb";
config.metrics = { "sm__cycles_active.avg", "sm__cycles_elapsed.avg" };

cupti_profiler::GpuProfiler profiler;
profiler.Configure(config);
profiler.Start();
// ... your CUDA workload ...
profiler.Stop();

Full suite with config file

#include <cupti_profiler/profiler_suite.h>

cupti_profiler::ProfilerSuite suite;
suite.LoadConfig("my_config.pbtxt");
suite.Configure();
suite.Start();
// ... your workload ...
suite.Stop();

Individual profilers without config file

#include <cupti_profiler/system_profiler.h>
#include <cupti_profiler/disk_profiler.h>

cupti_profiler::SystemProfilerConfig sysCfg;
sysCfg.samplingIntervalMs = 50;
sysCfg.PIDs = { static_cast<uint32_t>(getpid()) };
sysCfg.outputFile = "system.pb";

cupti_profiler::SystemProfiler sysProfiler;
sysProfiler.Configure(sysCfg);
sysProfiler.Start();
// ... workload ...
sysProfiler.Stop();

References

  • [[full-system-internals|Detailed internals documentation]]
  • [[system-guide|GPU profiler system guide]]
  • [[cupti-overhead-analysis|CUPTI overhead analysis]]