This tutorial walks through collecting profiler data and generating the IKP Explorer -- a Compiler-Explorer-style interactive dashboard that consolidates every profiling signal (timing, instructions, memory, stalls, hardware counters) into a single HTML page.
The running example throughout is a shared-memory tiled GEMM
(examples/gemm/tiled_gemm.cu) with five named regions:
| Region ID | Name | What it covers |
|---|---|---|
| 1 | total |
Entire tile computation envelope |
| 2 | load_A |
Global -> shared: load A tile |
| 3 | load_B |
Global -> shared: load B tile |
| 4 | compute |
Shared -> registers: multiply-accumulate |
| 5 | store |
Registers -> global: write-back C |
All commands assume you are working from the intra_kernel_profiler/
directory root.
For the impatient -- build everything, run all profilers on the GEMM example,
and open the Explorer. Requires NVBIT_PATH to be set (see Prerequisites).
# 1. Build trace examples (CMake)
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# 2. Build CUPTI + NVBit tools
make -C tools/cupti_region_profiler -j
make -C tools/nvbit_region_profiler NVBIT_PATH=$NVBIT_PATH ARCH=90a -j
# 3. Run everything end-to-end (all profiler modes + post-processing)
bash scripts/run_all_examples.sh --out=_demo_out --nvbit-path=$NVBIT_PATH
# 4. Generate the Explorer
python3 scripts/generate_explorer.py \
--demo-dir _demo_out \
--source examples/gemm/tiled_gemm.cu \
--output explorer.html
# 5. Open it
python3 -m http.server 8080 -d _demo_out
# Then visit: http://localhost:8080/explorer.htmlThe Explorer uses Monaco Editor (loaded from CDN), so it must be served over
HTTP -- opening the .html file directly via file:// will not work.
If you want to understand what each step produces and how it maps to Explorer tabs, read on.
| Dependency | Required for | Notes |
|---|---|---|
| NVIDIA GPU + CUDA driver | Everything | |
CUDA Toolkit (nvcc, CUPTI) |
Everything | >= 11.0, tested on 12.x |
| CMake >= 3.20 | Trace examples | pip install cmake if needed |
| NVBit 1.7+ | tools/nvbit_region_profiler |
Architecture-specific download |
| Python 3 >= 3.8 | Explorer + analysis scripts | Standard library only for core scripts |
| NumPy + Matplotlib | generate_gallery.py (optional) |
Only for publication-quality static charts |
See docs/install.md for detailed installation
instructions for each dependency.
Three separate build targets: the CMake trace examples, the CUPTI injection libraries, and the NVBit region profiler tool.
Builds four example binaries, including ikp_gemm_demo (our GEMM):
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jProduces:
build/ikp_gemm_demo-- tiled GEMM with trace instrumentationbuild/ikp_trace_record,build/ikp_trace_block_filter,build/ikp_trace_sampled_loop
Builds four .so libraries that are loaded via CUDA_INJECTION64_PATH:
make -C tools/cupti_region_profiler -jProduces:
tools/cupti_region_profiler/ikp_cupti_pcsamp.so-- PC sampling (stall reasons)tools/cupti_region_profiler/ikp_cupti_sassmetrics.so-- SASS metrics (5 profiles)tools/cupti_region_profiler/ikp_cupti_instrexec.so-- instruction execution countstools/cupti_region_profiler/ikp_cupti_pmsamp.so-- PM sampling
Requires NVBIT_PATH pointing to the NVBit root (must contain core/libnvbit.a):
make -C tools/nvbit_region_profiler \
NVBIT_PATH=$NVBIT_PATH \
ARCH=90a \
-jProduces: tools/nvbit_region_profiler/region_profiler.so
NVBit requires a separate binary built with -rdc=true and
-DIKP_ENABLE_NVBIT_MARKERS. The markers are no-ops in the CMake build,
so we compile a second binary manually:
cd examples/gemm
nvcc -O3 -std=c++17 -arch=sm_90a -lineinfo -rdc=true \
-DIKP_ENABLE_NVBIT_MARKERS \
-I ../../include \
tiled_gemm.cu ../../src/nvbit_marker_device.cu \
-o tiled_gemm_nvbit
cd ../..Produces: examples/gemm/tiled_gemm_nvbit
What this does: Records nanosecond-resolution per-warp begin/end
timestamps for each profiling region using globaltimer. Zero runtime
cost on warps outside the block filter.
Adds to Explorer: Trace tab (timing distributions, percentiles, per-block/per-warp heatmap).
mkdir -p _demo_out/trace
./build/ikp_gemm_demo \
--m=1024 --n=1024 --k=1024 \
--out=_demo_out/trace/gemm_trace.json| File | Description |
|---|---|
_demo_out/trace/gemm_trace.json |
Chrome Trace JSON -- open in Perfetto |
_demo_out/trace/gemm_trace_summary.json |
Per-region statistics: count, mean, p50/p95/p99, histograms |
The summary JSON is what the Explorer's Trace tab reads. It shows:
- Per-region duration distributions (histograms + percentile tables)
- Coefficient of variation (CV) to flag inconsistent regions
- Per-block-per-warp breakdown (shows load imbalance across CTAs)
Open the trace in Perfetto to confirm your instrumentation recorded all five regions:
https://ui.perfetto.dev # drag-and-drop gemm_trace.json
What this does: NVBit intercepts the kernel at the SASS level. The
region_profiler.so tool uses device-call markers (IKP_NVBIT_BEGIN /
IKP_NVBIT_END) to maintain a per-warp region stack and attribute every
executed instruction to a named region.
Adds to Explorer: Overview (region summaries, 40+ metrics), Regions tab (per-region detail with derived metrics), Execution tab (pipeline attribution, basic-block hotspots, branch analysis), Memory tab (access patterns, locality, reuse distance).
All NVBit runs use the same pattern:
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=<mode> \
IKP_NVBIT_TRACE_PATH=<output_dir> \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1We run five modes. Each adds data to different Explorer panels.
The foundation. Maps every SASS PC offset to its dominant region and computes per-region instruction counts, global memory operations, sector counts, and cache line estimates.
mkdir -p _demo_out/nvbit/pcmap
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=pcmap \
IKP_NVBIT_TRACE_PATH=_demo_out/nvbit/pcmap \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1Output:
pc2region_*.json-- PC offset -> region ID mapping (used by CUPTI join)region_stats_*.json-- per-region instruction counts, memory statssass_all_*.sass-- full SASS listing with region annotationssummary_*.txt-- human-readable summary
Adds per-warp, per-lane memory address traces and complete instruction class breakdown.
mkdir -p _demo_out/nvbit/all
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=all \
IKP_NVBIT_TRACE_CAP=4096 \
IKP_NVBIT_TRACE_PATH=_demo_out/nvbit/all \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1Output adds: mem_trace_*.jsonl -- per-warp, per-lane memory addresses.
The Explorer's Memory tab uses this for access pattern visualization,
per-PC locality analysis, and cache line reuse distance histograms.
Attributes each instruction to one of 16 hardware pipelines (fp32, fp16, fp64, int, tensor, ld, st, sfu, etc.).
mkdir -p _demo_out/nvbit/inst_pipe
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=pcmap \
IKP_NVBIT_ENABLE_INST_PIPE=1 \
IKP_NVBIT_TRACE_PATH=_demo_out/nvbit/inst_pipe \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1Output adds: inst_pipe field in region_stats_*.json with per-pipeline
counts. The Explorer's Execution tab renders this as a stacked bar chart
showing which functional units each region exercises.
Identifies the most-executed basic blocks and analyzes branch taken/fallthrough rates.
mkdir -p _demo_out/nvbit/bb_hot
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=pcmap \
IKP_NVBIT_ENABLE_BB_HOT=1 \
IKP_NVBIT_ENABLE_BRANCH_SITES=1 \
IKP_NVBIT_TRACE_PATH=_demo_out/nvbit/bb_hot \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1Output adds: hotspots_*.json with per-basic-block execution counts
and per-branch-site taken/fallthrough analysis. The Explorer uses this in the
Execution tab to highlight the hottest code paths and show branch divergence.
Dumps nvdisasm output with source line mappings, register metadata, and barrier annotations. This gives the Explorer's SASS panel richer data than the default NVBit-quality SASS listing.
mkdir -p _demo_out/nvbit/nvdisasm
IKP_NVBIT_ENABLE=1 \
IKP_NVBIT_KERNEL_REGEX=tiled_gemm_kernel \
IKP_NVBIT_MODE=pcmap \
IKP_NVBIT_DUMP_NVDISASM_SASS=1 \
IKP_NVBIT_DUMP_SASS_META=1 \
IKP_NVBIT_DUMP_SASS_LINEINFO=1 \
IKP_NVBIT_KEEP_CUBIN=1 \
IKP_NVBIT_TRACE_PATH=_demo_out/nvbit/nvdisasm \
LD_PRELOAD=tools/nvbit_region_profiler/region_profiler.so \
./examples/gemm/tiled_gemm_nvbit --iters=1Output adds: High-quality sass_all_*.sass with //## File "...", line N
comments, cubin_*.cubin for external analysis. The Explorer uses these
for source-SASS-PTX cross-linking in the three-panel code view.
What this does: CUPTI collects real hardware counter values at per-PC granularity via injection libraries. These complement NVBit's instruction-level view with actual performance counter data from the GPU.
Adds to Explorer: Stalls tab (PC-sampling stall reasons per region), Line tab (per-source-line CUPTI metrics), Overview (cross-validation of NVBit x CUPTI region attribution).
CUPTI works on the CMake-built binary (no NVBit markers needed).
The binary must be compiled with -lineinfo for source mapping; the CMake
build already ensures this for ikp_gemm_demo.
All CUPTI runs follow the pattern:
CUDA_INJECTION64_PATH=tools/cupti_region_profiler/<collector>.so \
<env vars> \
./build/ikp_gemm_demo --m=1024 --n=1024 --k=1024 --iters=20Each profile collects a different set of hardware counters. The GPU can only collect a limited number of counters per pass, so we run five separate profiles:
mkdir -p _demo_out/cupti
for profile in core divergence memory instruction_mix branch; do
CUDA_INJECTION64_PATH=tools/cupti_region_profiler/ikp_cupti_sassmetrics.so \
IKP_CUPTI_SASS_OUT=_demo_out/cupti/sassmetrics_${profile}.json \
IKP_CUPTI_SASS_PROFILE=${profile} \
IKP_CUPTI_SASS_LAZY_PATCHING=1 \
IKP_CUPTI_SASS_ENABLE_SOURCE=1 \
./build/ikp_gemm_demo --m=1024 --n=1024 --k=1024 --iters=20
doneOutput: sassmetrics_core.json, sassmetrics_divergence.json,
sassmetrics_memory.json, sassmetrics_instruction_mix.json,
sassmetrics_branch.json.
Each contains per-PC hardware counter values. The Explorer aggregates these across profiles and maps them to source lines (in the Line tab) and to NVBit regions (in the Regions tab) using the pc2region join.
Statistically samples the warp scheduler to determine what each warp is stalled on at each PC.
CUDA_INJECTION64_PATH=tools/cupti_region_profiler/ikp_cupti_pcsamp.so \
IKP_CUPTI_PCSAMP_OUT=_demo_out/cupti/pcsampling_raw.json \
IKP_CUPTI_PCSAMP_COLLECTION_MODE=serialized \
IKP_CUPTI_PCSAMP_KERNEL_REGEX=tiled_gemm_kernel \
IKP_CUPTI_PCSAMP_PERIOD=5 \
IKP_CUPTI_PCSAMP_MAX_PCS=10000 \
IKP_CUPTI_PCSAMP_VERBOSE=1 \
./build/ikp_gemm_demo --m=1024 --n=1024 --k=1024 --iters=20Output: pcsampling_raw.json -- per-PC stall reason samples.
The Explorer's Stalls tab visualizes this as a stacked bar chart showing the distribution of stall reasons (memory dependency, scoreboard, barrier, etc.) for the overall kernel and per region.
Note: PC sampling requires unrestricted profiling permissions. On managed HPC clusters it may return empty results. SASS metrics generally works even on restricted nodes.
Counts the number of threads executing each PC, including predication info.
CUDA_INJECTION64_PATH=tools/cupti_region_profiler/ikp_cupti_instrexec.so \
IKP_CUPTI_INSTREXEC_OUT=_demo_out/cupti/instrexec_raw.json \
IKP_CUPTI_INSTREXEC_KERNEL_REGEX=tiled_gemm_kernel \
IKP_CUPTI_INSTREXEC_MAX_RECORDS=0 \
./build/ikp_gemm_demo --m=1024 --n=1024 --k=1024 --iters=20Output: instrexec_raw.json -- per-PC thread execution counts and
predication breakdown. The Explorer uses this for cross-validation with
NVBit instruction counts and to compute thread-level occupancy efficiency.
With all profiling data in _demo_out/, generate the single-page Explorer:
python3 scripts/generate_explorer.py \
--demo-dir _demo_out \
--source examples/gemm/tiled_gemm.cu \
--output explorer.htmlThe Explorer is a self-contained HTML file that embeds all profiling data as JSON. It uses Monaco Editor (CDN), ECharts, and Split.js.
Monaco requires HTTP, so serve the output directory:
python3 -m http.server 8080 -d _demo_out
# visit http://localhost:8080/explorer.htmlOr use the built-in --serve flag:
python3 scripts/generate_explorer.py \
--demo-dir _demo_out \
--source examples/gemm/tiled_gemm.cu \
--output explorer.html \
--serve
# Starts server on http://localhost:8080/explorer.htmlThe Explorer has a three-panel code view (CUDA Source, PTX, SASS) on the left and a tabbed metrics panel on the right with seven tabs:
| Tab | What it shows | Data sources |
|---|---|---|
| Overview | Kernel-level summary: total instructions, memory ops, pipeline utilization, region count, CUPTI profile coverage | NVBit region_stats, CUPTI SASS metrics |
| Line | Per-source-line metrics -- click any line in the source panel to see its CUPTI hardware counters, region attribution, and per-PC breakdown | CUPTI SASS metrics (with -lineinfo) |
| Regions | Per-region detail cards with 40+ metrics: instruction counts, memory stats, derived ratios (coalescing efficiency, wasted bandwidth), CUPTI per-region aggregation | NVBit region_stats + CUPTI pc2region join |
| Execution | Pipeline attribution (stacked bar by region), basic-block hotspot table, branch site analysis (taken/fallthrough rates) | NVBit inst_pipe, bb_hot, branch_sites |
| Memory | Per-region memory access patterns, per-PC locality analysis, cache line reuse distance, sectors-per-instruction histogram | NVBit mem_trace, region_stats |
| Stalls | PC-sampling stall reason distribution (overall + per-region), dominant stall identification | CUPTI pcsampling |
| Trace | Timing distributions (histograms + percentile tables), per-block/per-warp heatmap, CV analysis | Intra-kernel trace summary |
The source, PTX, and SASS panels are cross-linked: clicking a source line
highlights the corresponding PTX .loc range and SASS instructions. Region
colors are consistent across all panels.
The Explorer is the primary destination, but several other visualization scripts exist for specific use cases.
Generates 19 static PNG charts suitable for papers and presentations.
Requires numpy and matplotlib.
python3 scripts/generate_gallery.py \
--demo-dir _demo_out \
--out-dir _demo_out/galleryAnnotates the CUDA source file with per-line SASS metrics and region attribution, producing a standalone HTML file:
python3 scripts/annotate_source.py \
--sass _demo_out/cupti/sassmetrics_core.json \
_demo_out/cupti/sassmetrics_divergence.json \
--pc2region _demo_out/nvbit/pcmap/pc2region_*.json \
--source examples/gemm/tiled_gemm.cu \
--labels "0:outside,1:total,2:load_A,3:load_B,4:compute,5:store" \
--html _demo_out/annotated_source.htmlThe Chrome Trace JSON from Step 1 can be opened directly in Perfetto for a timeline view:
https://ui.perfetto.dev # drag-and-drop _demo_out/trace/gemm_trace.json
scripts/run_all_examples.sh runs all build + profiling + post-processing
steps in one shot. It accepts the same conventions used throughout this
tutorial:
bash scripts/run_all_examples.sh \
--out=_demo_out \
--nvbit-path=$NVBIT_PATH \
--arch=90a \
--sm=sm_90aThis produces:
_demo_out/trace/-- Chrome Trace JSON + summary statistics_demo_out/cupti/-- PC sampling, SASS metrics (5 profiles), instrexec_demo_out/nvbit/-- 6 NVBit modes (pcmap, all, inst_pipe, bb_hot, nvdisasm, ptx)_demo_out/join/-- NVBit + CUPTI join analysis_demo_out/gallery/-- 19 auto-generated charts (matplotlib)_demo_out/explorer.html-- the Explorer_demo_out/dashboard.html-- interactive Plotly dashboard_demo_out/report.html-- Plotly HTML report_demo_out/annotated_source.html-- annotated source
| Variable | Default | Description |
|---|---|---|
IKP_NVBIT_ENABLE |
0 |
Master enable |
IKP_NVBIT_MODE |
pcmap |
pcmap, instmix, memtrace, all |
IKP_NVBIT_KERNEL_REGEX |
.* |
Kernel name filter (regex) |
IKP_NVBIT_TRACE_PATH |
. |
Output directory |
IKP_NVBIT_TRACE_CAP |
0 (unlimited) |
Max mem_trace records |
IKP_NVBIT_ENABLE_INST_PIPE |
0 |
Per-pipeline instruction counts |
IKP_NVBIT_ENABLE_BB_HOT |
0 |
Basic-block hotspot analysis |
IKP_NVBIT_ENABLE_BRANCH_SITES |
0 |
Per-branch taken/fallthrough analysis |
IKP_NVBIT_DUMP_SASS |
1 |
Dump NVBit SASS listing |
IKP_NVBIT_DUMP_SASS_BY_REGION |
1 |
Per-region SASS slices |
IKP_NVBIT_DUMP_NVDISASM_SASS |
0 |
High-quality nvdisasm output |
IKP_NVBIT_DUMP_SASS_META |
0 |
SASS metadata (register usage, barriers) |
IKP_NVBIT_DUMP_SASS_LINEINFO |
0 |
Source file:line in SASS comments |
IKP_NVBIT_DUMP_PTX |
0 |
PTX listing dump |
IKP_NVBIT_DUMP_PTX_BY_REGION |
0 |
Per-region PTX slices |
IKP_NVBIT_KEEP_CUBIN |
0 |
Keep extracted cubin files |
| Variable | Default | Description |
|---|---|---|
IKP_CUPTI_PCSAMP_OUT |
pcsampling.json |
Output file |
IKP_CUPTI_PCSAMP_COLLECTION_MODE |
serialized |
Collection mode |
IKP_CUPTI_PCSAMP_KERNEL_REGEX |
.* |
Kernel name filter |
IKP_CUPTI_PCSAMP_PERIOD |
5 |
Sampling period |
IKP_CUPTI_PCSAMP_MAX_PCS |
10000 |
Max PC buffer records |
IKP_CUPTI_PCSAMP_MAX_RECORDS |
0 |
Max total records (0=unlimited) |
IKP_CUPTI_PCSAMP_VERBOSE |
0 |
Verbosity level |
| Variable | Default | Description |
|---|---|---|
IKP_CUPTI_SASS_OUT |
sassmetrics.json |
Output file |
IKP_CUPTI_SASS_PROFILE |
core |
Profile: core, divergence, memory, instruction_mix, branch |
IKP_CUPTI_SASS_LAZY_PATCHING |
0 |
Use lazy patching (reduces overhead) |
IKP_CUPTI_SASS_ENABLE_SOURCE |
0 |
Include source file:line (requires -lineinfo) |
IKP_CUPTI_SASS_LIST |
0 |
List available metrics and exit |
IKP_CUPTI_SASS_LIST_OUT |
(stdout) | Output file for metric list |
| Variable | Default | Description |
|---|---|---|
IKP_CUPTI_INSTREXEC_OUT |
instrexec.json |
Output file |
IKP_CUPTI_INSTREXEC_KERNEL_REGEX |
.* |
Kernel name filter |
IKP_CUPTI_INSTREXEC_MAX_RECORDS |
0 |
Max records (0=unlimited) |
The generate_explorer.py script expects profiling outputs organized
in the following directory structure. This is the layout produced by
run_all_examples.sh and the manual steps in this tutorial.
_demo_out/
trace/
gemm_trace.json # Chrome Trace JSON (Step 1)
gemm_trace_summary.json # Per-region statistics (Step 1)
cupti/
sassmetrics_core.json # SASS metrics: core profile (Step 3a)
sassmetrics_divergence.json # divergence profile
sassmetrics_memory.json # memory profile
sassmetrics_instruction_mix.json # instruction_mix profile
sassmetrics_branch.json # branch profile
sassmetrics_source.json # (optional) with source mapping
pcsampling_raw.json # PC sampling stall reasons (Step 3b)
instrexec_raw.json # Instruction execution (Step 3c)
nvbit/
pcmap/
pc2region_<kernel>_0.json # PC -> region mapping (Step 2a)
region_stats_<kernel>_0.json # Per-region stats
sass_all_<kernel>_0.sass # SASS listing
summary_<kernel>_0.txt # Human-readable summary
all/
mem_trace_<kernel>_0.jsonl # Memory address traces (Step 2b)
pc2region_<kernel>_0.json
region_stats_<kernel>_0.json
locality_analysis.json # (post-processing)
inst_pipe/
region_stats_<kernel>_0.json # Has inst_pipe field (Step 2c)
pc2region_<kernel>_0.json
bb_hot/
hotspots_<kernel>_0.json # BB hotspots + branches (Step 2d)
region_stats_<kernel>_0.json
pc2region_<kernel>_0.json
nvdisasm/
sass_all_<kernel>_0.sass # nvdisasm-quality SASS (Step 2e)
cubin_<kernel>_0.cubin # Extracted cubin
explorer.html # The Explorer (Step 4)
The <kernel> placeholder expands to the mangled kernel name, e.g.:
tiled_gemm_kernel_float_const___float_const___float___int__int__int__int__intra_kernel_profiler__trace__GlobalBuffer_
The Explorer scans nvbit/*/pc2region_*.json and nvbit/*/region_stats_*.json
using glob patterns, so the exact subdirectory names and kernel name mangling
do not matter -- it finds them automatically.