Skip to content

[AMD][DRAFT] PC Sampling, wave stall reasonings#10020

Draft
ZelboK wants to merge 13 commits intotriton-lang:mainfrom
ZelboK:feat/pc_sampling
Draft

[AMD][DRAFT] PC Sampling, wave stall reasonings#10020
ZelboK wants to merge 13 commits intotriton-lang:mainfrom
ZelboK:feat/pc_sampling

Conversation

@ZelboK
Copy link
Copy Markdown
Contributor

@ZelboK ZelboK commented Apr 13, 2026

Not ready for review. Branches off from unmerged PR 9704.

Your Name and others added 13 commits March 12, 2026 18:47
Replace the deprecated roctracer-based profiling backend with a new
implementation built on rocprofiler-sdk, using late-start via
rocprofiler_force_configure so no LD_PRELOAD or tool-library preloading
is required.

Key changes:

- Add RocprofSDKProfiler with a two-context architecture:
  * codeObjectContext (always active): lightweight callback for
    kernel_id -> name registration as code objects are loaded.
  * profilingContext (on-demand): HIP runtime API callback tracing
    and buffer-based kernel dispatch tracing, started in doStart()
    and stopped in doStop() to match Proton's start/stop idiomatics.

- Eagerly call force_configure at  time on AMD
  so interception hooks are installed before any HSA queues are created.
  Both contexts are registered at this point, causing the SDK to install
  queue hooks. Only the lightweight codeObjectContext is activated
  immediately.

- Rewrite _select_backend() to infer the backend from the registered
  backends dict rather than calling get_current_target(), which would
  trigger HIP runtime init before force_configure can run.

- Wire up ROCTx marker tracing via libroctx64's native callback API
  (roctxRegisterTracerCallback) since rocprofiler-sdk's marker service
  requires its replacement roctx library, unavailable with late-start.

- Add RocprofApi dispatch layer (ExternLibRocprofiler) for runtime
  dlopen/dlsym of librocprofiler-sdk.so, with optional path override
  via TRITON_ROCPROFILER_SDK_LIB_PATH.

- Update CMake to discover rocprofiler-sdk headers and plumb
  ROCPROFILER_SDK_INCLUDE_DIR into the build.
…), getKernelName Fix using shared lock instead of two lock acquis, simplified no correlation path, missing capture counting api. chagnes to see if nvidia CI runner works
…easons

Implement stochastic PC sampling via rocprofiler-sdk, fix a process-
abort-on-exit caused by dual ROCm library loading, and replace the
NVIDIA-approximated stall reason mapping with proper AMD-native names.

PC sampling:
- Wire up rocprofiler_configure_pc_sampling_service with stochastic
  method and configurable interval (PROTON_PC_SAMPLING_INTERVAL env).
- Add pcSamplingBufferCallback to accumulate per-kernel samples and
  flush them into Proton's Data/Metric pipeline.
- Expose pcsampling mode for the rocprofiler backend in Python.

Library dispatch fix:
- Replace RTLD_NOLOAD versioned probes in Dispatch::init with
  dl_iterate_phdr-based lookup (findLoadedLib) so the pip-installed
  ROCm libraries are reused instead of pulling in a second copy from
  /opt/rocm-*/lib/.
- Pre-populate ExternLibRocprofiler::lib from the pip installation
  directory before forceConfigure, preventing SONAME deduplication
  from silently substituting the system SDK.

AMD stall reasons:
- Add 9 AMD-specific PCSamplingMetricKind entries derived from
  rocprofiler-sdk pc_sampling.h (e.g. waitcnt, alu_dependency,
  arbiter_win_ex_stall) so output uses accurate hardware names
  instead of force-fitting into NVIDIA stall columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant