| title | Full-System Profiler — Implementation Internals | |||||
|---|---|---|---|---|---|---|
| tags |
|
Detailed implementation documentation covering /proc parsing, threading model, protobuf streaming format, timestamp alignment, permission requirements, and design decisions.
lib/
├── include/cupti_profiler/
│ ├── gpu_profiler.h Public API — GPU PM Sampling (existing)
│ ├── system_profiler.h Public API — CPU + memory
│ ├── disk_profiler.h Public API — disk I/O
│ └── profiler_suite.h Public API — orchestrator
└── src/
├── gpu_profiler.cpp GpuProfiler::Impl (existing)
├── proc_readers.h/cpp /proc/stat, /proc/meminfo, /proc/[PID]/*
├── disk_readers.h/cpp /proc/diskstats, /sys/block/*, /proc/[PID]/io
├── system_profiler.cpp SystemProfiler::Impl + sample thread
├── system_flush_thread.h/cpp Flush for SystemMetricsTrace
├── disk_profiler.cpp DiskProfiler::Impl + sample thread
├── disk_flush_thread.h/cpp Flush for DiskMetricsTrace
└── profiler_suite.cpp Config loading + lifecycle orchestration
Source: First line of /proc/stat.
cpu user nice system idle iowait irq softirq steal guest guest_nice
All values are cumulative jiffies (ticks of CLK_TCK, typically 100 Hz). To compute utilization between two samples:
total = user + nice + system + idle + iowait + irq + softirq + steal
busy = total - idle - iowait
CPU utilization % = delta(busy) / delta(total) × 100
User % = delta(user + nice) / delta(total) × 100
System % = delta(system) / delta(total) × 100
IOWait % = delta(iowait) / delta(total) × 100
Note
guest and guest_nice values are already included in user and nice respectively. They must not be added again.
Implementation: ReadCPUStat() in lib/src/proc_readers.cpp reads the first line, parses 8 fields into a CPUStatSnapshot. Delta computation happens in SystemProfiler::Impl's sample thread.
Source: Three integer fields on one line, per thread
(kernel docs: scheduler/sched-stats.rst):
1: sum_exec_runtime — nanoseconds the thread has spent on a CPU
2: run_delay — nanoseconds spent waiting in the runqueue
(gated by `kernel.sched_schedstats` sysctl)
3: pcount — number of times scheduled onto a CPU
The profiler reads only field 1. The other two are intentionally
ignored — run_delay requires a sysctl that defaults to off on
Linux 4.6+, and pcount isn't a metric we surface.
Why per-thread, not /proc/[PID]/schedstat: the TGID-level
inode reports the thread-group leader's task_struct only — not
aggregated across the thread group. (Contrast /proc/[PID]/stat,
where the kernel's do_task_stat(..., whole=1) aggregates
utime+stime across the group.) Reading only the leader would
under-report any multi-threaded workload — capped at 100% of one
core no matter how many cores the process actually consumes — so
we walk /proc/[PID]/task/ and sum each thread's
sum_exec_runtime.
Per-process CPU %:
Σ_t∈threads (cur[t] − prev[t]) ← cur[t]: this tick's
Active % = ──────────────────────────── × 100 sum_exec_runtime for TID t
dt_ns (prev[t] = 0 if t is new)
dt_ns is the actual wall-clock elapsed between the previous
sample tick and this one — not the nominal sample period.
sleep_for + per-PID /proc reads + scheduler jitter make the
real period strictly ≥ nominal; using the nominal value as the
denominator would systematically inflate the result (a tick that
takes 15 ms with the sampler configured at 100 Hz would report a
fully-busy thread as 150% instead of 100%).
Thread churn handling: TIDs visible only in cur are new threads
— their sum_exec_runtime is attributed in full to this window.
TIDs visible only in prev exited mid-window; their last partial
slice (from previous tick to exit) is discarded, which keeps the
per-PID baseline bounded with no per-thread bookkeeping beyond
the live task/ directory.
Important
Per-process CPU % can exceed 100% on multi-core systems (e.g., a
process using 4 cores reports ~400%). The panel peak_expr
caps the y-axis at ncpus × 100.
Note
sum_exec_runtime is always tracked by the kernel scheduler
regardless of the kernel.sched_schedstats sysctl, so per-PID
CPU works out-of-the-box on every mainstream distro. This is the
reason we switched from /proc/[PID]/stat's utime/stime (10 ms
CLK_TCK quantization → 0/100/200% staircase at 100 Hz sampling)
to schedstat's nanosecond-precise field 1. The cost is that we
no longer split per-PID time into user / kernel / iowait —
sum_exec_runtime is total on-CPU time only.
Implementation: ReadPIDSchedStatPerThread() in
lib/src/proc_readers.cpp; per-PID per-thread baselines + actual
dt live on SystemProfiler::Impl::prevPID.
Source: Key-value pairs in kB.
MemTotal: 131072000 kB
MemFree: 12345678 kB
MemAvailable: 98765432 kB
Buffers: 1234567 kB
Cached: 45678901 kB
Used memory (matches free command):
Used = MemTotal - MemFree - Buffers - Cached
MemAvailable (kernel 3.14+) is the best estimate of memory available to applications without swapping.
Implementation: ReadMemInfo() in lib/src/proc_readers.cpp. Values are converted from kB to bytes (× 1024) when stored in protobuf.
Source: 7 space-separated integers in pages (multiply by sysconf(_SC_PAGESIZE), typically 4096).
Fields: size resident shared text lib data dt
[0] size = VMS (total virtual memory)
[1] resident = RSS (resident set size)
[2] shared = shared pages
Implementation: ReadPIDStatm() in lib/src/proc_readers.cpp.
Source: One line per block device.
major minor name rd_ios rd_merges rd_sectors rd_ticks wr_ios wr_merges wr_sectors wr_ticks ios_inflight io_ticks weighted_io_ticks
Fields of interest (0-indexed after name):
| Index | Field | Type | Notes |
|---|---|---|---|
| 2 | rd_sectors |
cumulative | Sectors read (× 512 = bytes) |
| 6 | wr_sectors |
cumulative | Sectors written (× 512 = bytes) |
| 8 | ios_inflight |
instantaneous | Currently in-flight IOs |
Throughput:
Read MB/s = delta(rd_sectors) × 512 / dt_seconds / 1e6
Write MB/s = delta(wr_sectors) × 512 / dt_seconds / 1e6
Note
Sectors are always 512 bytes regardless of the disk's physical sector size. This is a kernel convention.
Implementation: ReadDiskStats() in lib/src/disk_readers.cpp. Filters by the device list from config.
Source: Single line with two integers.
<read_inflight> <write_inflight>
This gives the instantaneous number of in-flight read and write requests, which is the queue depth at the moment of sampling.
Implementation: ReadDiskInflight() in lib/src/disk_readers.cpp.
Source: Key-value pairs.
rchar: 12345678 ← logical reads (includes page cache)
wchar: 87654321 ← logical writes
read_bytes: 4096000 ← physical reads (actual storage I/O)
write_bytes: 2048000 ← physical writes
We use read_bytes and write_bytes (physical I/O) rather than rchar/wchar (which include page cache hits).
Warning
This file requires same-UID ownership or CAP_SYS_PTRACE. If access is denied, the profiler logs a warning once per PID and skips per-process disk data rather than crashing.
Implementation: ReadPIDIO() in lib/src/disk_readers.cpp. Returns accessible = false on EACCES.
Each profiler follows the same 2-thread pattern:
┌───────────────────┐
│ Profiler::Start() │
└────┬──────────┬───┘
│ │
▼ ▼
┌─────────┐ ┌──────────┐
│ Sample │ │ Flush │
│ Thread │ │ Thread │
│ │ │ │
│ Reads │ │ Drains │
│ /proc at │ │ samples │
│ interval │ │ at flush │
│ │ │ interval │
│ Computes │ │ │
│ deltas │ │ Writes │
│ │ │ length- │
│ Pushes │ │ delimited│
│ to batch │ │ protobuf │
│ (mutex) │ │ to file │
└─────────┘ └──────────┘
The GPU profiler has the same conceptual structure but uses CUPTI-specific APIs:
- Decode thread (equivalent to sample thread): calls
cuptiPmSamplingDecodeData()every 5 ms, evaluates metrics viacuptiProfilerHostEvaluateToGpuValues() - Flush thread: drains evaluated
SamplerRangesamples, writes length-delimitedGpuMetricsTrace
- Sample batch:
std::vectorof protobuf sample messages, protected bystd::mutex batchMutex - Output file:
std::ofstreamprotected bystd::mutex outMutex - Stop signals:
std::atomic<bool>per thread (stopSample,stopFlush)
Stop()setsstopSample = true, joins sample thread- Sets
stopFlush = true, joins flush thread - Drains any remaining samples from the batch
- Writes final length-delimited message (with regions for GPU)
- Closes output file
All three profilers use the same length-delimited streaming format:
File layout:
[varint: msg_size][serialized TraceMessage]
[varint: msg_size][serialized TraceMessage]
...
[varint: msg_size][serialized TraceMessage] ← final (may contain regions)
- Varint encoding: standard protobuf variable-length integer (1–5 bytes for uint32)
- Each message is self-contained: includes metadata (hostname, interval, tracked PIDs/devices) plus a batch of samples
- Crash safety: if the process dies, all previously flushed messages are intact. Only the in-progress batch is lost.
The Python visualization reads these with a manual varint decoder, then merges all messages into a single trace by concatenating sample arrays.
| Profiler | Clock source | Resolution |
|---|---|---|
| GPU | cuptiGetTimestamp() (CUPTI internal clock) |
Nanoseconds |
| System | std::chrono::steady_clock |
Nanoseconds |
| Disk | std::chrono::steady_clock |
Nanoseconds |
GPU and CPU/Disk use different clock domains. The visualization aligns them by normalizing each trace to "time from first sample" — each trace's first timestamp becomes t=0. Since ProfilerSuite::Start() starts all profilers within microseconds of each other, this provides adequate alignment for the millisecond-scale phenomena being measured.
Tip
For tighter alignment in future work, ProfilerSuite::Start() could record both steady_clock and cuptiGetTimestamp() at the same moment and embed the offset in each trace.
The config is a protobuf text format file parsed via google::protobuf::TextFormat::ParseFromString(). This is included in libprotobuf which is already a dependency — no new libraries needed.
Features:
#line comments- Human-readable field names matching the
.protoschema - Type checking at parse time
The ProfilerSuite::LoadConfig() method:
- Reads the file into a string
- Parses into
ProfilerSuiteConfigprotobuf message - Converts proto fields to C++ config structs (
ProfilerConfig,SystemProfilerConfig,DiskProfilerConfig) - Resolves PID
0→getpid()for both system and disk profilers
| Resource | Required permission | Fallback |
|---|---|---|
/proc/stat |
World-readable | Always works |
/proc/meminfo |
World-readable | Always works |
/proc/[PID]/stat |
World-readable | Always works |
/proc/[PID]/statm |
World-readable | Always works |
/proc/diskstats |
World-readable | Always works |
/sys/block/*/inflight |
World-readable | Always works |
/proc/[PID]/io |
Same UID or CAP_SYS_PTRACE |
Warns once, skips |
| CUPTI PM Sampling | GPU access + compute capability ≥ 7.5 | Fails at Configure() |
| Decision | Rationale |
|---|---|
Separate .pb files per component |
Different sampling rates produce different trace sizes. Independent files allow partial collection (GPU-only, system-only, etc.) |
.pbtxt config via TextFormat::Parse |
Zero new C++ dependencies. Human-readable. Supports comments. Type-checked at parse time. |
PID 0 sentinel resolved at runtime |
User doesn't need to know their PID. Config files are reusable across runs. |
steady_clock for CPU/Disk timestamps |
Monotonic (no NTP jumps). Nanosecond resolution. Standard C++17. |
/proc files opened and closed each read |
Standard practice for /proc virtual filesystem. No stale file descriptors. Negligible overhead at 10–100 Hz. |
| Concrete flush threads per trace type | The three protobuf message types have different field structures. Concrete implementations are clearer than templates in a shared library. |
Graceful EACCES for /proc/[PID]/io |
Warns once per PID, skips per-process disk data. Avoids crashing when profiling other users' processes. |
All profilers in one libcupti_profiler.so |
Single library simplifies linking. CPU/Disk code has no CUDA runtime calls but co-locating is harmless. |
| Sample thread + flush thread per profiler | Mirrors GPU's decode + flush pattern. Decouples high-frequency collection from lower-frequency serialization. |
| Pimpl on all public classes | Public headers have zero internal/CUDA/CUPTI/protobuf includes. Users compile with any C++17 compiler. |
| Metric | Unit | Source |
|---|---|---|
total_utilization_pct |
% | delta(busy) / delta(total) × 100 from /proc/stat |
user_pct |
% | delta(user+nice) / delta(total) × 100 |
system_pct |
% | delta(system) / delta(total) × 100 |
iowait_pct |
% | delta(iowait) / delta(total) × 100 |
| Metric | Unit | Source |
|---|---|---|
cpu_pct |
% of one core | Σ_t delta(sum_exec_runtime_ns[t]) / actual_dt_ns × 100 summed across every TID under /proc/[PID]/task/*/schedstat field 1 |
| Metric | Unit | Source |
|---|---|---|
total_bytes |
bytes | MemTotal from /proc/meminfo |
used_bytes |
bytes | MemTotal - MemFree - Buffers - Cached |
available_bytes |
bytes | MemAvailable |
buffers_bytes |
bytes | Buffers |
cached_bytes |
bytes | Cached |
| Metric | Unit | Source |
|---|---|---|
rss_bytes |
bytes | Field 1 × PAGE_SIZE from /proc/[PID]/statm |
vms_bytes |
bytes | Field 0 × PAGE_SIZE |
shared_bytes |
bytes | Field 2 × PAGE_SIZE |
| Metric | Unit | Source |
|---|---|---|
read_bytes_per_sec |
bytes/s | delta(sectors_read) × 512 / dt from /proc/diskstats |
write_bytes_per_sec |
bytes/s | delta(sectors_written) × 512 / dt |
read_queue_depth |
count | Field 0 from /sys/block/<dev>/inflight |
write_queue_depth |
count | Field 1 from /sys/block/<dev>/inflight |
| Metric | Unit | Source |
|---|---|---|
read_bytes_per_sec |
bytes/s | delta(read_bytes) / dt from /proc/[PID]/io |
write_bytes_per_sec |
bytes/s | delta(write_bytes) / dt |
- [[full-system-overview|Overview and quick-start guide]]
- [[system-guide|GPU profiler system guide]]
- Linux /proc/stat documentation
- Linux I/O statistics (iostats.rst)
- CUPTI PM Sampling API