Skip to content

[PGO][HIP] Decouple device profile drain via HSA introspection#2714

Draft
lfmeadow wants to merge 1 commit into
amd-stagingfrom
device-pgo-introspection-drain
Draft

[PGO][HIP] Decouple device profile drain via HSA introspection#2714
lfmeadow wants to merge 1 commit into
amd-stagingfrom
device-pgo-introspection-drain

Conversation

@lfmeadow
Copy link
Copy Markdown

Summary

Replaces the host-shadow / per-CUID / hipModuleLoad-interceptor device profile
drain (added in llvm#177665) with an HSA-introspection drain that, at process
exit, walks every loaded device code object on every GPU agent, finds the
canonical __llvm_profile_sections bounds table emitted by
compiler-rt/lib/profile/InstrProfilingPlatformGPU.c, copies its
counters/data/names back to the host, and writes an arch-prefixed .profraw
via __llvm_write_custom_profile.

Host and device profile drains become fully independent. The drain runs
from an atexit handler registered in a library constructor, so device
counters are collected whether or not the host translation units were
instrumented, and without any host-side per-TU shadow, CUID matching, or
module-load interception. This fixes the cases the old 1‑1 host↔device model
could not handle: separate device-only modules (e.g. runtime hipModuleLoad),
an uninstrumented host, and multi-GPU.

Everything lives in-tree in libclang_rt.profile; there is no out-of-tree
library or LD_PRELOAD shim.

Changes

  • compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp — full rewrite to
    the HSA-introspection drain: dlopen'd HSA/HIP via the interception helpers,
    GPU-agent enumeration, hsa_ven_amd_loader_query_segment_descriptors to get
    (agent, executable) pairs, executable-symbol walk for the canonical
    __llvm_profile_sections, bounds dedup, idempotent drainDevices() reached
    by both the existing weak host-write hook and a constructor-registered
    atexit, and collision-free target names for multi-module/non-RDC. Legacy
    __llvm_profile_offload_register_* symbols kept as no-ops for ABI
    compatibility. The file is guarded host-only so the GPU profile-runtime build
    compiles it to an empty TU. atexit-only; no fatal-signal handler (a
    crash before atexit loses device counters — documented limitation).
  • clang/lib/CodeGen/CGCUDANV.cpp — delete the entire offload-profiling
    machinery (OffloadProfShadow, emitOffloadProfilingSections, the non-RDC
    __hipRegisterVar + register-shadow block, the RDC offloading-entry + per-CUID
    ctor block). Clang now emits nothing PGO-specific for HIP.
  • clang/lib/Driver/ToolChains/Gnu.cpp — for HIP host links built with PGO,
    force-link the drain object via -u__llvm_profile_hip_collect_device_data
    (it is otherwise unreferenced now that the host emits no shadow).
  • compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake — add amdgcn /
    nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime (now
    the sole source of __llvm_profile_sections) is actually built for GPU
    targets. filter_available_targets keeps host builds unaffected.

Tests

Tier A (hardware-free, run in CI):

  • clang/test/CodeGenHIP/offload-pgo-sections.hip — asserts the per-CUID
    struct / host shadow / register-shadow ctor are no longer emitted, while
    device counter instrumentation still is.
  • clang/test/Driver/hip-profile-device-drain.hip-u__llvm_profile_hip_collect_device_data
    is added for HIP+PGO links and only then.
  • compiler-rt/test/profile/instrprof-offload-abi-compat.c — objects
    referencing the legacy __llvm_profile_offload_register_* symbols still link
    and run against the new runtime.

Tier B (REQUIRES: amdgpu) + lit .hip suffix and amdgpu feature gate:

  • compiler-rt/test/profile/AMDGPU/device-basic.hip — RDC + non-RDC; host and
    device .profraw produced, merged profile contains the device kernel, and
    llvm-cov reports device coverage.
  • device-no-kernel.hip — instrumented HIP program that launches no kernel:
    host .profraw produced, device drain is a clean no-op (no crash, no spurious
    file).
  • device-symbols.hip__llvm_profile_sections present (PROTECTED, in
    .dynsym) in the device ELF for RDC + non-RDC.

Test plan

  • Build clang + lld + host libclang_rt.profile + amdgcn device profile runtime.
  • Device ELF carries __llvm_profile_sections (PROTECTED/dynsym) for RDC and non-RDC.
  • End-to-end on gfx90a (MI210): host + device .profraw, merge, llvm-cov device coverage.
  • Multi-agent walk enumerates all 4 agents and dedups to one device drain.
  • No-kernel HIP program drains cleanly with no spurious device file.
  • Legacy ABI no-op symbols link and run.
  • New clang Driver / CodeGenHIP tests pass via llvm-lit.
  • RCCL all_reduce_perf smoke (validated previously out-of-tree with the same algorithm; not re-run here).

Made with Cursor

Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.

Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).

  * compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
    to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
    hsa_ven_amd_loader segment descriptors, executable symbol walk,
    bounds dedup, idempotent drainDevices + atexit, collision-free
    target names). Legacy __llvm_profile_offload_register_* symbols kept
    as no-ops for ABI compatibility. Guarded host-only for GPU builds.

  * clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
    machinery (OffloadProfShadow, emitOffloadProfilingSections, both
    shadow-registration sites). Clang emits nothing PGO-specific for HIP.

  * clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
    PGO, force-link the drain object via
    -u__llvm_profile_hip_collect_device_data (it is otherwise
    unreferenced now that the host emits no shadow).

  * compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
    nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
    (the sole source of __llvm_profile_sections after this change) is
    built for GPU targets.

Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
    per-CUID struct / shadow / registration are NOT emitted.
  * clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
    added for HIP+PGO and only then.
  * compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
    no-op symbols still link and run.
  * compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
    device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
    feature gate.
Co-authored-by: Cursor <cursoragent@cursor.com>
@lfmeadow lfmeadow marked this pull request as draft May 29, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant