Skip to content

[CK DSL] CK DSL provider proof of concept for hipDNN#7916

Draft
DarylHawkinsAMD wants to merge 19 commits into
users/vanantha/ck-dsl-prototypefrom
users/dahawkin/ck-dsl-provider
Draft

[CK DSL] CK DSL provider proof of concept for hipDNN#7916
DarylHawkinsAMD wants to merge 19 commits into
users/vanantha/ck-dsl-prototypefrom
users/dahawkin/ck-dsl-provider

Conversation

@DarylHawkinsAMD
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@DarylHawkinsAMD DarylHawkinsAMD changed the title [CK DSL] POC CK DSL provider for hipDNN [CK DSL] CK DSL provider proof of concept for hipDNN May 30, 2026
DarylHawkinsAMD and others added 19 commits May 30, 2026 18:55
Plan v0.8 for a new hipDNN engine plugin that exposes CK DSL-produced
kernels via runtime JIT compilation. Captures decided architecture
(embedded pybind11 interpreter, C++ adapter layer, in-memory JIT cache),
the M1 milestone (implicit-GEMM conv with hipEvent timing and
CpuFpReferenceConvolution verification), the implementation stream
decomposition, and the resolved decision log so the next session can
pick up from prep step P-1 without re-litigating context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creates dnn-providers/ck-dsl-provider/ as a buildable-but-empty hipDNN
engine plugin: CkDslConvImplicitGemmEngine reports no applicable graphs
and the .so links cleanly through the SDK's EnginePluginImpl.inl macros.
Wires the provider into the rocm-libraries superbuild as a new component
and preset.

Also updates plan v0.8 -> v0.9 to rename the single M1 engine after its
op so the eventual per-spec engine fan-out (M5) is additive rather than
a refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Links the plugin .so to libpython3 + pybind11 and initializes a
per-process scoped_interpreter from CkDslContainer's ctor (the
SharedContainerManager's per-process hook). The interpreter is heap-
allocated and intentionally never finalized so sibling plugins that
share CPython are unaffected at plugin unload.

CMake pins discovery to the system Python (Python3_ROOT_DIR=/usr,
FIND_STRATEGY=LOCATION) to avoid uv-managed Pythons in ~/.local that
fail at runtime with a codec-of-the-filesystem-encoding error.

Adds three gtest cases (InitializesOnce, CanImportStdlib,
SurvivesGilReentry) under a new ck_dsl_provider_unit_tests target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge owns the imported ck_dsl_provider.compile_service
module and is constructed by CkDslContainer right after the embedded
interpreter is up. sys.path is prepended (idempotently) with two
CMake-baked absolute paths: the provider-local python/ directory and
projects/composablekernel/python/, so both ck_dsl_provider and ck_dsl
import cleanly without any packaging metadata or PYTHONPATH env. The
generated ckdsl_provider_paths.h carries those literals.

PythonError translates pybind11::error_already_set into a
HipdnnPluginException, preserving the Python exception type and message.
Bridge methods destroy the error object inside the GIL scope so the
Python objects it owns release cleanly.

For I-3 the bridge only exposes a noop_smoke() roundtrip that imports
ck_dsl and returns its __file__; the real compile() entry point is the
I-7 milestone. Three new gtest cases cover the happy path, error
translation, and sys.path idempotence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge holds a py::module_ member. The defaulted dtor let
that member decref its PyObject* on whatever thread tore the container
down — at process exit that thread does not hold the GIL, which is
undefined behaviour and asserts in CPython debug builds. Custom dtor
acquires the GIL and clears the module before letting the member's
default destructor run on an empty object.

Also documents the sys.path-idempotence and bridge-thread-safety
assumptions inline (reviewer minor notes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carries forward across a push + re-clone on another node:

- STATUS.md         resume-point handoff: HEAD, plan progress, next
                    step (I-4), gfx950 hardware constraint, gotchas.
- prep_findings/    synthesis of the §6.1 P-1..P-7 prep results that
                    everything from I-1 onward references.
- pybind11_rtld_local_spike/  reproducible source of the P-3 spike
                    that proved pybind11 works inside a hipDNN plugin
                    .so loaded RTLD_LOCAL.

WIP/.gitignore drops *.log and */build/ so subsequent runs do not
re-dirty the tree.

All of this can be deleted post-M1 once the durable docs live in
their permanent homes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the embedded compile service to a real HSACO load + launch on
gfx950: compile_service.compile_smoke() builds an FP16 elementwise-copy
kernel via ck_dsl, the C++ CompileServiceBridge translates the Python
dict into a KernelArtifact, and HipModule + LaunchAbi load and launch
it through hipModuleLoadData / hipModuleLaunchKernel.

The new runtime/ tree (KernelArtifact, LaunchAbi, HipModule) follows
the P-1 design memo: schema-driven arg packing replaces launcher.cpp's
hardcoded per-kind layout, an explicit ldsBytes field closes the
launcher's dynamic-LDS gap, and HipModule's RAII wrapper unloads the
module on dtor and on a failed hipModuleGetFunction.

HipModuleSmokeTest gates on hipGetDeviceCount() > 0 so the host-only
CI lane stays green; on a gfx950 node both cases pass and the
elementwise copy correctly writes FP16 1.0 (0x3C00) to its output.
SignatureHash -> shared_ptr<HipModule> map, mutex-guarded, with a
getOrLoad(key, loader) that invokes the loader at most once per key
for the cache's lifetime. M1 has no eviction policy and no negative
cache; both are M2+ work per plan §3.4.

The loader runs under the cache mutex so concurrent misses on the
same key wait for the first compile rather than racing it. For M1
the provider is effectively single-threaded per handle so this lock
is uncontended; a per-key shared_future scheme is the natural M2+
upgrade once concurrent compiles on distinct keys become a real
workload.

JitCacheSmoke validates the miss-then-hit shape end-to-end against a
real compile via compileSmoke: the loader runs exactly once across
two calls, both calls return the same HipModule, and the hit comes
back in <10 ms while the miss takes hundreds of ms. Two host-only
tests cover the empty-loader guard and the empty-cache miss case.
Full suite: 12/12 green.
Pure-C++ ConvImplicitGemmSpec mirrors the ck_dsl ImplicitGemmConvSpec
dataclass (13 ConvProblem fields + 22 spec fields). Constexpr defaults
match the bake-off values from PREP_FINDINGS P-5, not the dataclass
defaults: tile_k=64 (not 128), warp_tile_m/n=32 (not 16), warp_tile_k=16
(not 32), epilogue="cshuffle" (not "default"). lds_layout is omitted;
the Python dataclass re-derives it from async_dma / lds_k_pad / tile_k.

ConvImplicitGemmAdapter::buildSpec walks a single ConvolutionFwdAttributes
+ its tensor map and populates the ~15 graph-derived fields. Validates
2-D-only, FP16-only, symmetric padding, CROSS_CORRELATION, 4-D dims,
matching X/W channel dim, and Y N/K consistency. All scalars narrowed
from int64 to int32 with a range check (the DSL signature is i32).

ConvImplicitGemmPayload::convImplicitGemmSpecToPayload translates the
spec into the py::dict that I-7's compile_service.compile will consume
via **payload splat. The dict is field-for-field with the dataclass so
a divergence (extra/missing field) fails loudly at the dataclass ctor.

Six host-only adapter unit tests cover the bake-off shape happy path
plus the rejection paths (asymmetric padding, 3-D conv, true
convolution mode, missing tensor, non-HALF dtype). Two payload tests
exercise the GIL-aware translation and the round-trip-through-Python
divergence canary that splats the C++ payload into the real
ImplicitGemmConvSpec dataclass and reads back its derived props.
Full suite: 20/20 green.
Wires the end-to-end miss/hit pipeline: hipDNN graph -> applicability
check -> adapter -> signature -> JitCache -> bridge.compile -> Python
compile_service -> HSACO -> HipModule. Execute() is still a stub
(I-8 wires the launch).

New surfaces:

* `compile_service.compile(op_kind, payload)` dispatches by op_kind;
  for "conv_implicit_gemm" it splats the C++ payload through
  `ImplicitGemmConvSpec(problem=ConvProblem(**...), **rest)`, builds
  the kernel, returns the same dict shape as compile_smoke with the
  6-arg conv kernel ABI (3 ptrs + 3 i32 bytes) and the canonical
  grid `(num_pid_n, num_pid_m, 1)` / block `(block_size, 1, 1)`.

* `CompileServiceBridge::compile(opKind, payload)` factors the
  dict->KernelArtifact translation out of compileSmoke and exposes the
  production entry point.

* `GraphSignature::computeForConvFwd` is an FNV-1a 64-bit hash over
  op_kind + dtypes + tensor dims + conv attrs + the provider/DSL git
  SHA (via CK_DSL_PROVIDER_VERSION_STRING). DSL changes invalidate
  every prior key by construction.

* `HipModule` now retains the launch metadata (grid, block, ldsBytes,
  argSchema, kind) from the artifact at ctor time so the plan layer
  doesn't have to thread the artifact alongside the module.

* `ConvImplicitGemmPlanBuilder` owns the per-op JitCache. isApplicable
  runs the adapter as the validator; buildPlan derives the signature,
  hits the cache, on miss compiles via the bridge with the payload
  from `convImplicitGemmSpecToPayload`, and stores a
  `ConvImplicitGemmPlan` (stub execute() until I-8) on the context.

* `CkDslConvImplicitGemmEngine` now delegates all five IEngine
  virtuals to the plan builder. CkDslContainer replaces the static
  factory with a member createEngine that captures the bridge.

Tests: 13 new cases (33 total, all green). 4 host-only plan-builder
cases (isApplicable true/false, workspace, knobs), 2 GPU-gated
(BuildPlanCachesOnSecondCall verifies the second buildPlan reuses the
HipModule and completes in <50ms vs ~6.5s for the first compile; the
launch metadata cross-checks plan §4: grid=(1,392,1), block=(256,1,1),
6-arg schema; PlanExecuteIsStub asserts the I-8 sentinel), 7
GraphSignature host-only (determinism + per-axis sensitivity to
op_kind, shape, stride, padding, dtype, plus missing-tensor reject).
Replaces the I-7 stub with the real launch path:

* findDeviceBuffer scans the deviceBuffers array for X/W/Y by uid
  (matching miopen-provider's pattern). Missing uid throws with the
  searched uid and array size in the message.
* LaunchAbi::pack builds the 6-arg kernel buffer (3 device pointers
  + 3 i32 buffer-rsrc byte bounds) against the artifact's argSchema
  carried on the HipModule.
* HipModule::launch fires hipModuleLaunchKernel via the handle's
  stream, using grid/block/ldsBytes captured from the artifact at
  load time.

PlanBuilder computes xBytes/wBytes/yBytes from the spec's ConvProblem
geometry at build time (N*H*W*C*2, K*R*S*C*2, N*Ho*Wo*K*2 for FP16)
and embeds them in the plan -- the graph isn't in scope at execute()
and the sizes are static per signature. Guards against the i32
overflow case (the kernel signature is i32 for A_bytes/B_bytes/D_bytes).

Tests: PlanExecuteIsStub -> PlanExecuteLaunches (GPU-gated). Allocates
real 3.2MB X/Y + 73KB W FP16 buffers, sets W = 0xab as sentinel for
unwritten output, runs execute(), syncs, verifies zero input + zero
weight -> zero output (i.e. the kernel actually ran). Plus a new
ExecuteRejectsMissingDeviceBuffer host-only case (with a GPU skip
because the buildPlan step still needs to compile + load the module
into a real HIP module). Full suite: 34/34 green.
Self-contained warmup-and-iterate timing utility per PREP_FINDINGS
P-7. Defaults: 5 warmup (matches launcher.cpp:559) + 50 timed (half
of launcher.cpp's 100, keeps the integration test under ~1s of kernel
time while keeping the median stable).

Protocol:
* Per-iter event pairs (one start + one stop per timed iter). All
  records issued on the supplied stream so HIP serialises them.
* Single hipEventSynchronize on the final stop event drains all
  prior pairs at once -- no per-iter hipDeviceSynchronize.
* Min + median microseconds reported. Median uses partial nth_element
  (one pass for odd N, two for even N) instead of a full sort.
* TFLOPS = flops / median_seconds / 1e12 when flops > 0; zero
  otherwise so the smoke kernel (no defined arithmetic intensity)
  doesn't produce a spurious value.
* Logging-only -- no perf-target assertions, matching the plan Q9
  resolution that M1 logs absolute TFLOPS and defers comparison
  baselines to M2.

API: PerfMeasurement::measure(launchFn, flops, stream) takes any
callable (template, dispatched through std::function so the HIP
machinery stays in the .cpp). PerfResult carries warmup/timed iter
counts so the log line is self-describing. log() emits
[CkDslPerf] tag=... warmup=N iters=M min_us=X.X median_us=Y.Y tflops=Z.Z
through HIPDNN_PLUGIN_LOG_INFO.

Tests (8 new, 42 total green): 6 host-only stats cases (odd-N
median, even-N median, single sample, empty rejection, zero-timed-
iters rejection, P-7 defaults) plus 2 GPU-gated end-to-end cases
that drive the helper against the compileSmoke kernel and verify
flops==0 -> tflops==0 and the flops formula round-trip.
End-to-end JIT-path integration test for the M1 capstone shape
(N=8, 56x56x64 -> 64, 3x3, stride 1, pad 1, FP16, NHWC):

1. Builds a single-op conv-fwd FB graph via the test SDK helper.
2. Allocates Tensor<half> for X/W/Y with logical NCHW dims + NHWC
   physical strides via TensorLayout::NHWC (and KCRS strides for W).
3. Seeds X/W with deterministic random values in [-0.1, 0.1] so the
   K_gemm=576 accumulator stays in the numerically-friendly part of
   FP16.
4. Drives the JIT path through ConvImplicitGemmPlanBuilder (engine
   -> adapter -> signature -> bridge -> compile_service -> HSACO ->
   HipModule), executes the resulting plan via the same UID-keyed
   device-buffer interface the SDK uses at runtime.
5. Runs CpuFpReferenceConvolution::fprop on the same host-side
   tensors and asserts element-wise allclose at 5e-2 abs tolerance
   over the full ~1.6M-element NHWC output. The tolerance is loose
   relative to the analytical bound (~2.4e-4 for K=576 fp16
   accumulations on uniform [-0.1, 0.1] inputs) so the test isn't
   brittle to minor codegen reshufflings.
6. PerfMeasurement(default 5/50) over plan.execute(), logs via
   HIPDNN_PLUGIN_LOG_INFO in the [CkDslPerf] format from P-7.
   Verified on this gfx950 (MI350-series): 131 TFLOPS median for
   the bake-off shape with the bake-off knobs (vs the example's 248
   on MI300X; M2 autotuning will close the gap).

**Adaptation from plan §1:** the test bypasses the hipDNN frontend
API and the backend's .so-loading plugin path. Both surfaces are
architecturally additive on top of what this test already proves --
the plan-builder + plan-execute path is the exact same code the
backend would call after dlopen. Folding in the frontend Graph
API + plugin loader is M1.5 / I-11 work.

Full suite: 43/43 green, ~19s wall (15s for this test on a cold
comgr).
Moves IntegrationGpuCkDslConvFp16 into a dedicated integration_tests/
subdir with its own gtest binary + main, and calls
finalize_test_targets("ck-dsl-provider") to register the prefixed
ctest targets the plan §6.2 I-11 advertises:

* ninja ck-dsl-provider-unit-check       (label "unit_test", ~10 s)
* ninja ck-dsl-provider-integration-check (label "integration_test", ~9 s)
* ninja ck-dsl-provider-check            (both lanes)

Plus the -verbose variants per the Tests.cmake pattern.

Tests split:
* tests/                                  42 cases, all host-only or
                                          GPU-gated, none requiring
                                          the bake-off shape.
* integration_tests/                      1 case (BakeOffConv); the
                                          end-to-end JIT-path
                                          capstone from I-10.

pre-commit run over the full provider tree (every .cpp / .hpp / .py
/ CMakeLists.txt / .cmake / .json / .md / .in under
dnn-providers/ck-dsl-provider/) -- all hooks pass: trailing
whitespace, end-of-file, large-file, black, clang-format, cmake-lint.

WIP/STATUS.md updated to reflect M1 completion, including the
measured 131 TFLOPS on gfx950 (MI350-series) for the bake-off shape
and the M1.5 / M2 deferral list (frontend Graph API + .so loader
integration, autotuning, second op, on-disk cache).
Add an Architecture section with two Mermaid diagrams: a layered
component view and an end-to-end request->compile->launch sequence.
Both verified to render with mermaid-cli 11.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert IntegrationGpuCkDslConvFp16 from a single hardcoded
BakeOffConv TEST_F into a parameterized TEST_P over a ConvCase
shape list. The original bake-off shape is preserved as the
BakeOff case.

The bake-off shape is fully tile-aligned (M=N*Ho*Wo, GEMM-N=K,
GEMM-K=C*R*S all multiples of the kernel's 64-wide tile), leaving
partial-tile boundary handling unverified. The shape set now adds
tile-aligned variants (stride 2, 1x1, C/K=128, non-square R!=S,
dilation 2) and partial-tile probes (partial GEMM-N, GEMM-K,
GEMM-M, and all three at once).

All 10 shapes pass on gfx950 with worst abs diff 6e-5..2.4e-4
against the CPU reference (5e-2 tolerance).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Non-functional cleanups in ck-dsl-provider; no change to the JIT or
launch behaviour.

Graph signature:
- Remove the legacy FlatBuffer-walking GraphSignature::computeForConvFwd.
  The cache key now derives solely from the adapter-built
  ConvImplicitGemmSpec via computeForSpec, keeping the adapter the single
  FlatBuffer reader and removing a parallel read path that could drift.
- Rewrite GraphSignatureTest around computeForSpec: per-field and
  codegen-knob sensitivity, position-aliasing, optional-knob presence,
  and an adapter-built-spec round-trip.
- Clarify the SignatureHash doc and spell out the on-disk-cache (M3)
  precondition: dtype/arch/layout must join the key before entries can
  be persisted.

Naming:
- Rename badParam -> throwBadParam in ConvImplicitGemmAdapter to reflect
  that it is [[noreturn]].

Dead code (zero consumers, confirmed against the provider, tests, and
the plugin ABI):
- EmbeddedInterpreter::importCheck -- I-3 bring-up scaffolding, unused
  since the integration test landed.
- HipModule::kind() and its write-only _kind member.
- The unused HipModule::launch(std::vector, ...) overload.
- CkDslConvImplicitGemmEngine::planBuilderForTesting() -- a test seam
  with no test.

Kept deliberately as forward-compat per dsl_docs/hipdnn_provider/plan.md:
KernelArtifact::isa (M3 disk cache) and the I64/F32/F16 launch-ABI
surface (M2/M5 ops).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Design plan" section to the ck-dsl-provider README pointing at
the CK DSL hipDNN Provider plan as the source of truth for milestone
scope, the runtime embedded-Python architecture, and resolved design
questions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DarylHawkinsAMD DarylHawkinsAMD force-pushed the users/dahawkin/ck-dsl-provider branch from a43ac83 to 353c954 Compare May 30, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant