[AMD][DRAFT] PC Sampling, wave stall reasonings#10020
Draft
ZelboK wants to merge 13 commits intotriton-lang:mainfrom
Draft
[AMD][DRAFT] PC Sampling, wave stall reasonings#10020ZelboK wants to merge 13 commits intotriton-lang:mainfrom
ZelboK wants to merge 13 commits intotriton-lang:mainfrom
Conversation
Replace the deprecated roctracer-based profiling backend with a new
implementation built on rocprofiler-sdk, using late-start via
rocprofiler_force_configure so no LD_PRELOAD or tool-library preloading
is required.
Key changes:
- Add RocprofSDKProfiler with a two-context architecture:
* codeObjectContext (always active): lightweight callback for
kernel_id -> name registration as code objects are loaded.
* profilingContext (on-demand): HIP runtime API callback tracing
and buffer-based kernel dispatch tracing, started in doStart()
and stopped in doStop() to match Proton's start/stop idiomatics.
- Eagerly call force_configure at time on AMD
so interception hooks are installed before any HSA queues are created.
Both contexts are registered at this point, causing the SDK to install
queue hooks. Only the lightweight codeObjectContext is activated
immediately.
- Rewrite _select_backend() to infer the backend from the registered
backends dict rather than calling get_current_target(), which would
trigger HIP runtime init before force_configure can run.
- Wire up ROCTx marker tracing via libroctx64's native callback API
(roctxRegisterTracerCallback) since rocprofiler-sdk's marker service
requires its replacement roctx library, unavailable with late-start.
- Add RocprofApi dispatch layer (ExternLibRocprofiler) for runtime
dlopen/dlsym of librocprofiler-sdk.so, with optional path override
via TRITON_ROCPROFILER_SDK_LIB_PATH.
- Update CMake to discover rocprofiler-sdk headers and plumb
ROCPROFILER_SDK_INCLUDE_DIR into the build.
…), getKernelName Fix using shared lock instead of two lock acquis, simplified no correlation path, missing capture counting api. chagnes to see if nvidia CI runner works
…easons Implement stochastic PC sampling via rocprofiler-sdk, fix a process- abort-on-exit caused by dual ROCm library loading, and replace the NVIDIA-approximated stall reason mapping with proper AMD-native names. PC sampling: - Wire up rocprofiler_configure_pc_sampling_service with stochastic method and configurable interval (PROTON_PC_SAMPLING_INTERVAL env). - Add pcSamplingBufferCallback to accumulate per-kernel samples and flush them into Proton's Data/Metric pipeline. - Expose pcsampling mode for the rocprofiler backend in Python. Library dispatch fix: - Replace RTLD_NOLOAD versioned probes in Dispatch::init with dl_iterate_phdr-based lookup (findLoadedLib) so the pip-installed ROCm libraries are reused instead of pulling in a second copy from /opt/rocm-*/lib/. - Pre-populate ExternLibRocprofiler::lib from the pip installation directory before forceConfigure, preventing SONAME deduplication from silently substituting the system SDK. AMD stall reasons: - Add 9 AMD-specific PCSamplingMetricKind entries derived from rocprofiler-sdk pc_sampling.h (e.g. waitcnt, alu_dependency, arbiter_win_ex_stall) so output uses accurate hardware names instead of force-fitting into NVIDIA stall columns.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not ready for review. Branches off from unmerged PR 9704.