[CK DSL] CK DSL provider proof of concept for hipDNN (ALMIOPEN-1985)#7916
Draft
DarylHawkinsAMD wants to merge 27 commits into
Draft
[CK DSL] CK DSL provider proof of concept for hipDNN (ALMIOPEN-1985)#7916DarylHawkinsAMD wants to merge 27 commits into
DarylHawkinsAMD wants to merge 27 commits into
Conversation
a43ac83 to
353c954
Compare
Plan v0.8 for a new hipDNN engine plugin that exposes CK DSL-produced kernels via runtime JIT compilation. Captures decided architecture (embedded pybind11 interpreter, C++ adapter layer, in-memory JIT cache), the M1 milestone (implicit-GEMM conv with hipEvent timing and CpuFpReferenceConvolution verification), the implementation stream decomposition, and the resolved decision log so the next session can pick up from prep step P-1 without re-litigating context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creates dnn-providers/ck-dsl-provider/ as a buildable-but-empty hipDNN engine plugin: CkDslConvImplicitGemmEngine reports no applicable graphs and the .so links cleanly through the SDK's EnginePluginImpl.inl macros. Wires the provider into the rocm-libraries superbuild as a new component and preset. Also updates plan v0.8 -> v0.9 to rename the single M1 engine after its op so the eventual per-spec engine fan-out (M5) is additive rather than a refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Links the plugin .so to libpython3 + pybind11 and initializes a per-process scoped_interpreter from CkDslContainer's ctor (the SharedContainerManager's per-process hook). The interpreter is heap- allocated and intentionally never finalized so sibling plugins that share CPython are unaffected at plugin unload. CMake pins discovery to the system Python (Python3_ROOT_DIR=/usr, FIND_STRATEGY=LOCATION) to avoid uv-managed Pythons in ~/.local that fail at runtime with a codec-of-the-filesystem-encoding error. Adds three gtest cases (InitializesOnce, CanImportStdlib, SurvivesGilReentry) under a new ck_dsl_provider_unit_tests target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge owns the imported ck_dsl_provider.compile_service module and is constructed by CkDslContainer right after the embedded interpreter is up. sys.path is prepended (idempotently) with two CMake-baked absolute paths: the provider-local python/ directory and projects/composablekernel/python/, so both ck_dsl_provider and ck_dsl import cleanly without any packaging metadata or PYTHONPATH env. The generated ckdsl_provider_paths.h carries those literals. PythonError translates pybind11::error_already_set into a HipdnnPluginException, preserving the Python exception type and message. Bridge methods destroy the error object inside the GIL scope so the Python objects it owns release cleanly. For I-3 the bridge only exposes a noop_smoke() roundtrip that imports ck_dsl and returns its __file__; the real compile() entry point is the I-7 milestone. Three new gtest cases cover the happy path, error translation, and sys.path idempotence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge holds a py::module_ member. The defaulted dtor let that member decref its PyObject* on whatever thread tore the container down — at process exit that thread does not hold the GIL, which is undefined behaviour and asserts in CPython debug builds. Custom dtor acquires the GIL and clears the module before letting the member's default destructor run on an empty object. Also documents the sys.path-idempotence and bridge-thread-safety assumptions inline (reviewer minor notes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carries forward across a push + re-clone on another node:
- STATUS.md resume-point handoff: HEAD, plan progress, next
step (I-4), gfx950 hardware constraint, gotchas.
- prep_findings/ synthesis of the §6.1 P-1..P-7 prep results that
everything from I-1 onward references.
- pybind11_rtld_local_spike/ reproducible source of the P-3 spike
that proved pybind11 works inside a hipDNN plugin
.so loaded RTLD_LOCAL.
WIP/.gitignore drops *.log and */build/ so subsequent runs do not
re-dirty the tree.
All of this can be deleted post-M1 once the durable docs live in
their permanent homes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the embedded compile service to a real HSACO load + launch on gfx950: compile_service.compile_smoke() builds an FP16 elementwise-copy kernel via ck_dsl, the C++ CompileServiceBridge translates the Python dict into a KernelArtifact, and HipModule + LaunchAbi load and launch it through hipModuleLoadData / hipModuleLaunchKernel. The new runtime/ tree (KernelArtifact, LaunchAbi, HipModule) follows the P-1 design memo: schema-driven arg packing replaces launcher.cpp's hardcoded per-kind layout, an explicit ldsBytes field closes the launcher's dynamic-LDS gap, and HipModule's RAII wrapper unloads the module on dtor and on a failed hipModuleGetFunction. HipModuleSmokeTest gates on hipGetDeviceCount() > 0 so the host-only CI lane stays green; on a gfx950 node both cases pass and the elementwise copy correctly writes FP16 1.0 (0x3C00) to its output.
SignatureHash -> shared_ptr<HipModule> map, mutex-guarded, with a getOrLoad(key, loader) that invokes the loader at most once per key for the cache's lifetime. M1 has no eviction policy and no negative cache; both are M2+ work per plan §3.4. The loader runs under the cache mutex so concurrent misses on the same key wait for the first compile rather than racing it. For M1 the provider is effectively single-threaded per handle so this lock is uncontended; a per-key shared_future scheme is the natural M2+ upgrade once concurrent compiles on distinct keys become a real workload. JitCacheSmoke validates the miss-then-hit shape end-to-end against a real compile via compileSmoke: the loader runs exactly once across two calls, both calls return the same HipModule, and the hit comes back in <10 ms while the miss takes hundreds of ms. Two host-only tests cover the empty-loader guard and the empty-cache miss case. Full suite: 12/12 green.
Pure-C++ ConvImplicitGemmSpec mirrors the ck_dsl ImplicitGemmConvSpec dataclass (13 ConvProblem fields + 22 spec fields). Constexpr defaults match the bake-off values from PREP_FINDINGS P-5, not the dataclass defaults: tile_k=64 (not 128), warp_tile_m/n=32 (not 16), warp_tile_k=16 (not 32), epilogue="cshuffle" (not "default"). lds_layout is omitted; the Python dataclass re-derives it from async_dma / lds_k_pad / tile_k. ConvImplicitGemmAdapter::buildSpec walks a single ConvolutionFwdAttributes + its tensor map and populates the ~15 graph-derived fields. Validates 2-D-only, FP16-only, symmetric padding, CROSS_CORRELATION, 4-D dims, matching X/W channel dim, and Y N/K consistency. All scalars narrowed from int64 to int32 with a range check (the DSL signature is i32). ConvImplicitGemmPayload::convImplicitGemmSpecToPayload translates the spec into the py::dict that I-7's compile_service.compile will consume via **payload splat. The dict is field-for-field with the dataclass so a divergence (extra/missing field) fails loudly at the dataclass ctor. Six host-only adapter unit tests cover the bake-off shape happy path plus the rejection paths (asymmetric padding, 3-D conv, true convolution mode, missing tensor, non-HALF dtype). Two payload tests exercise the GIL-aware translation and the round-trip-through-Python divergence canary that splats the C++ payload into the real ImplicitGemmConvSpec dataclass and reads back its derived props. Full suite: 20/20 green.
Wires the end-to-end miss/hit pipeline: hipDNN graph -> applicability check -> adapter -> signature -> JitCache -> bridge.compile -> Python compile_service -> HSACO -> HipModule. Execute() is still a stub (I-8 wires the launch). New surfaces: * `compile_service.compile(op_kind, payload)` dispatches by op_kind; for "conv_implicit_gemm" it splats the C++ payload through `ImplicitGemmConvSpec(problem=ConvProblem(**...), **rest)`, builds the kernel, returns the same dict shape as compile_smoke with the 6-arg conv kernel ABI (3 ptrs + 3 i32 bytes) and the canonical grid `(num_pid_n, num_pid_m, 1)` / block `(block_size, 1, 1)`. * `CompileServiceBridge::compile(opKind, payload)` factors the dict->KernelArtifact translation out of compileSmoke and exposes the production entry point. * `GraphSignature::computeForConvFwd` is an FNV-1a 64-bit hash over op_kind + dtypes + tensor dims + conv attrs + the provider/DSL git SHA (via CK_DSL_PROVIDER_VERSION_STRING). DSL changes invalidate every prior key by construction. * `HipModule` now retains the launch metadata (grid, block, ldsBytes, argSchema, kind) from the artifact at ctor time so the plan layer doesn't have to thread the artifact alongside the module. * `ConvImplicitGemmPlanBuilder` owns the per-op JitCache. isApplicable runs the adapter as the validator; buildPlan derives the signature, hits the cache, on miss compiles via the bridge with the payload from `convImplicitGemmSpecToPayload`, and stores a `ConvImplicitGemmPlan` (stub execute() until I-8) on the context. * `CkDslConvImplicitGemmEngine` now delegates all five IEngine virtuals to the plan builder. CkDslContainer replaces the static factory with a member createEngine that captures the bridge. Tests: 13 new cases (33 total, all green). 4 host-only plan-builder cases (isApplicable true/false, workspace, knobs), 2 GPU-gated (BuildPlanCachesOnSecondCall verifies the second buildPlan reuses the HipModule and completes in <50ms vs ~6.5s for the first compile; the launch metadata cross-checks plan §4: grid=(1,392,1), block=(256,1,1), 6-arg schema; PlanExecuteIsStub asserts the I-8 sentinel), 7 GraphSignature host-only (determinism + per-axis sensitivity to op_kind, shape, stride, padding, dtype, plus missing-tensor reject).
Replaces the I-7 stub with the real launch path: * findDeviceBuffer scans the deviceBuffers array for X/W/Y by uid (matching miopen-provider's pattern). Missing uid throws with the searched uid and array size in the message. * LaunchAbi::pack builds the 6-arg kernel buffer (3 device pointers + 3 i32 buffer-rsrc byte bounds) against the artifact's argSchema carried on the HipModule. * HipModule::launch fires hipModuleLaunchKernel via the handle's stream, using grid/block/ldsBytes captured from the artifact at load time. PlanBuilder computes xBytes/wBytes/yBytes from the spec's ConvProblem geometry at build time (N*H*W*C*2, K*R*S*C*2, N*Ho*Wo*K*2 for FP16) and embeds them in the plan -- the graph isn't in scope at execute() and the sizes are static per signature. Guards against the i32 overflow case (the kernel signature is i32 for A_bytes/B_bytes/D_bytes). Tests: PlanExecuteIsStub -> PlanExecuteLaunches (GPU-gated). Allocates real 3.2MB X/Y + 73KB W FP16 buffers, sets W = 0xab as sentinel for unwritten output, runs execute(), syncs, verifies zero input + zero weight -> zero output (i.e. the kernel actually ran). Plus a new ExecuteRejectsMissingDeviceBuffer host-only case (with a GPU skip because the buildPlan step still needs to compile + load the module into a real HIP module). Full suite: 34/34 green.
Self-contained warmup-and-iterate timing utility per PREP_FINDINGS P-7. Defaults: 5 warmup (matches launcher.cpp:559) + 50 timed (half of launcher.cpp's 100, keeps the integration test under ~1s of kernel time while keeping the median stable). Protocol: * Per-iter event pairs (one start + one stop per timed iter). All records issued on the supplied stream so HIP serialises them. * Single hipEventSynchronize on the final stop event drains all prior pairs at once -- no per-iter hipDeviceSynchronize. * Min + median microseconds reported. Median uses partial nth_element (one pass for odd N, two for even N) instead of a full sort. * TFLOPS = flops / median_seconds / 1e12 when flops > 0; zero otherwise so the smoke kernel (no defined arithmetic intensity) doesn't produce a spurious value. * Logging-only -- no perf-target assertions, matching the plan Q9 resolution that M1 logs absolute TFLOPS and defers comparison baselines to M2. API: PerfMeasurement::measure(launchFn, flops, stream) takes any callable (template, dispatched through std::function so the HIP machinery stays in the .cpp). PerfResult carries warmup/timed iter counts so the log line is self-describing. log() emits [CkDslPerf] tag=... warmup=N iters=M min_us=X.X median_us=Y.Y tflops=Z.Z through HIPDNN_PLUGIN_LOG_INFO. Tests (8 new, 42 total green): 6 host-only stats cases (odd-N median, even-N median, single sample, empty rejection, zero-timed- iters rejection, P-7 defaults) plus 2 GPU-gated end-to-end cases that drive the helper against the compileSmoke kernel and verify flops==0 -> tflops==0 and the flops formula round-trip.
End-to-end JIT-path integration test for the M1 capstone shape (N=8, 56x56x64 -> 64, 3x3, stride 1, pad 1, FP16, NHWC): 1. Builds a single-op conv-fwd FB graph via the test SDK helper. 2. Allocates Tensor<half> for X/W/Y with logical NCHW dims + NHWC physical strides via TensorLayout::NHWC (and KCRS strides for W). 3. Seeds X/W with deterministic random values in [-0.1, 0.1] so the K_gemm=576 accumulator stays in the numerically-friendly part of FP16. 4. Drives the JIT path through ConvImplicitGemmPlanBuilder (engine -> adapter -> signature -> bridge -> compile_service -> HSACO -> HipModule), executes the resulting plan via the same UID-keyed device-buffer interface the SDK uses at runtime. 5. Runs CpuFpReferenceConvolution::fprop on the same host-side tensors and asserts element-wise allclose at 5e-2 abs tolerance over the full ~1.6M-element NHWC output. The tolerance is loose relative to the analytical bound (~2.4e-4 for K=576 fp16 accumulations on uniform [-0.1, 0.1] inputs) so the test isn't brittle to minor codegen reshufflings. 6. PerfMeasurement(default 5/50) over plan.execute(), logs via HIPDNN_PLUGIN_LOG_INFO in the [CkDslPerf] format from P-7. Verified on this gfx950 (MI350-series): 131 TFLOPS median for the bake-off shape with the bake-off knobs (vs the example's 248 on MI300X; M2 autotuning will close the gap). **Adaptation from plan §1:** the test bypasses the hipDNN frontend API and the backend's .so-loading plugin path. Both surfaces are architecturally additive on top of what this test already proves -- the plan-builder + plan-execute path is the exact same code the backend would call after dlopen. Folding in the frontend Graph API + plugin loader is M1.5 / I-11 work. Full suite: 43/43 green, ~19s wall (15s for this test on a cold comgr).
Moves IntegrationGpuCkDslConvFp16 into a dedicated integration_tests/
subdir with its own gtest binary + main, and calls
finalize_test_targets("ck-dsl-provider") to register the prefixed
ctest targets the plan §6.2 I-11 advertises:
* ninja ck-dsl-provider-unit-check (label "unit_test", ~10 s)
* ninja ck-dsl-provider-integration-check (label "integration_test", ~9 s)
* ninja ck-dsl-provider-check (both lanes)
Plus the -verbose variants per the Tests.cmake pattern.
Tests split:
* tests/ 42 cases, all host-only or
GPU-gated, none requiring
the bake-off shape.
* integration_tests/ 1 case (BakeOffConv); the
end-to-end JIT-path
capstone from I-10.
pre-commit run over the full provider tree (every .cpp / .hpp / .py
/ CMakeLists.txt / .cmake / .json / .md / .in under
dnn-providers/ck-dsl-provider/) -- all hooks pass: trailing
whitespace, end-of-file, large-file, black, clang-format, cmake-lint.
WIP/STATUS.md updated to reflect M1 completion, including the
measured 131 TFLOPS on gfx950 (MI350-series) for the bake-off shape
and the M1.5 / M2 deferral list (frontend Graph API + .so loader
integration, autotuning, second op, on-disk cache).
Add an Architecture section with two Mermaid diagrams: a layered component view and an end-to-end request->compile->launch sequence. Both verified to render with mermaid-cli 11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert IntegrationGpuCkDslConvFp16 from a single hardcoded BakeOffConv TEST_F into a parameterized TEST_P over a ConvCase shape list. The original bake-off shape is preserved as the BakeOff case. The bake-off shape is fully tile-aligned (M=N*Ho*Wo, GEMM-N=K, GEMM-K=C*R*S all multiples of the kernel's 64-wide tile), leaving partial-tile boundary handling unverified. The shape set now adds tile-aligned variants (stride 2, 1x1, C/K=128, non-square R!=S, dilation 2) and partial-tile probes (partial GEMM-N, GEMM-K, GEMM-M, and all three at once). All 10 shapes pass on gfx950 with worst abs diff 6e-5..2.4e-4 against the CPU reference (5e-2 tolerance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Non-functional cleanups in ck-dsl-provider; no change to the JIT or launch behaviour. Graph signature: - Remove the legacy FlatBuffer-walking GraphSignature::computeForConvFwd. The cache key now derives solely from the adapter-built ConvImplicitGemmSpec via computeForSpec, keeping the adapter the single FlatBuffer reader and removing a parallel read path that could drift. - Rewrite GraphSignatureTest around computeForSpec: per-field and codegen-knob sensitivity, position-aliasing, optional-knob presence, and an adapter-built-spec round-trip. - Clarify the SignatureHash doc and spell out the on-disk-cache (M3) precondition: dtype/arch/layout must join the key before entries can be persisted. Naming: - Rename badParam -> throwBadParam in ConvImplicitGemmAdapter to reflect that it is [[noreturn]]. Dead code (zero consumers, confirmed against the provider, tests, and the plugin ABI): - EmbeddedInterpreter::importCheck -- I-3 bring-up scaffolding, unused since the integration test landed. - HipModule::kind() and its write-only _kind member. - The unused HipModule::launch(std::vector, ...) overload. - CkDslConvImplicitGemmEngine::planBuilderForTesting() -- a test seam with no test. Kept deliberately as forward-compat per dsl_docs/hipdnn_provider/plan.md: KernelArtifact::isa (M3 disk cache) and the I64/F32/F16 launch-ABI surface (M2/M5 ops). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Design plan" section to the ck-dsl-provider README pointing at the CK DSL hipDNN Provider plan as the source of truth for milestone scope, the runtime embedded-Python architecture, and resolved design questions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… missing arch
Make the implicit-GEMM conv path arch-aware so a kernel is built and
validated for the device it will actually launch on, and refuse to
proceed when the arch cannot be determined.
- Add a DeviceArch detection module: resolve the device's bare gfx token
from the stream (falling back to the current device). Returns nullopt
only when no HIP device is visible (host-only); throws
DeviceArchDetectionError when a device IS present but its arch can't be
read (gcnArchName empty, property/ordinal query fails) -- guessing a
default would silently miscompile for the wrong target.
- Thread arch as an orthogonal compile target (not a spec field):
* GraphSignature::computeForSpec folds arch into the cache key so a
gfx950 module never aliases a gfx942 build.
* CompileServiceBridge gains an arch-aware isApplicable (consults the
DSL's is_valid_spec without compiling) and compile(op, payload,
arch); compile_service.py threads arch into build_implicit_gemm_conv
and validation.
- Fail closed instead of fail open on a missing arch:
* isApplicable declines (returns false) when no device is visible, and
logs+declines on a detection fault; it no longer accepts on the
structural verdict alone.
* buildPlan throws instead of defaulting to gfx950.
- Stop using exceptions for the normal "not for us" applicability path:
add a non-throwing tryBuildSpec seam so isApplicable composes plain
booleans, with a scoped backstop only for the genuinely exceptional
arch/bridge faults. buildPlan keeps the throwing extraction, where a
failure post-applicability is exceptional.
- Tests: arch-aware applicability + cross-arch compile coverage; device
arch threaded as a separate argument throughout.
Host-only and cross-compile unit tests pass; GPU/gfx950-gated tests skip
cleanly on non-gfx950 hardware.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…doc cleanup Follow-up refinements to the conv arch-threading work (dec459d), addressing review feedback: - DeviceArch: memoize the ordinal -> bare gfx token resolution behind a mutex-guarded process-wide cache so hipGetDeviceProperties runs at most once per device on the multi-threaded plan-finding path. Extract the feature-suffix strip into a public stripArchFeatureSuffix() so the pure string behaviour is unit-testable without a device, and warn when a stream's device can't be resolved and detection falls back to the current default device (observable in multi-GPU processes). - Add DeviceArchTest covering stripArchFeatureSuffix. - ConvImplicitGemmSpec: document the three dataclass fields intentionally omitted from the C++ mirror (lds_layout, k0_k1_split, groups) and how their defaults are reconstructed Python-side; tidy the knob-defaults note. Adapter: drop the stale "20 spec knobs" count. - ConvImplicitGemmPlanBuilder: document that tryBuildSpec's non-throwing guarantee is scoped to HipdnnPluginException, with the isApplicable backstop catching anything else. - compile_service: add return-type annotations (TYPE_CHECKING import for the forward-referenced spec type) and expand the is_applicable docstring on its cost profile. - ConvImplicitGemmPlanBuilderTest: drop redundant pipeline/epilogue assignments in makeExampleSpecForArch that already match the defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The provider's validation and conv compile paths are already arch-aware (is_valid_spec / build_implicit_gemm_conv / compile_kernel all take a target arch and cross-compile gfx942/gfx950/gfx1151). Two things were still pinned to gfx950 by configuration rather than by any DSL limit: the smoke compile, and the production-default conv adapter knobs. Only the former can be lifted today, so: - compile_smoke now takes an arch and threads it to compile_kernel; the bridge's compileSmoke(arch) mirrors compile(). The smoke kernel is a plain FP16 copy (no MFMA atom), so it builds for every supported arch. Fix a latent broken import surfaced once these tests actually run: ck_dsl.instances.elementwise -> ck_dsl.instances.common.elementwise. - Add CK_DSL_PROVIDER_SKIP_IF_UNSUPPORTED_ARCH plus a single-source ckDslIsSupportedArch predicate (gfx942/gfx950/gfx1151) in TestUtils, and move the smoke suite (HipModuleSmoke, JitCacheSmoke, PerfMeasurementGpu) and ExecutesExampleShapeOnPresentDevice onto it -- they now run on any supported device instead of skipping off gfx950. - Add a device-free CompileSmokeHost test that compiles the smoke kernel for all three arches (comgr cross-compiles without the matching GPU). - Keep CK_DSL_PROVIDER_SKIP_IF_NOT_GFX950 only where it is still correct: the conv plan-builder tests that build the gfx950-tuned production default; rewrite its rationale (and the stale "gfx950-only" comments in compile_service, CompileServiceBridge, KernelArtifact) to say so -- this is an adapter-default limitation pending M2 autotuning, not a DSL ISA limit. Verified on gfx942: unit suite 56 passed, 3 skipped (exactly the gfx950-gated conv plan-builder tests), 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Motivation This fixes an issue with the labels where PRs would get a failure if no labels or we get a label thats not in our supported list. ## Technical Details - Removed label checking because the paths filter does pretty much the exact same thing. (cherry picked from commit f3cd76b)
2b7db7f to
0013b1a
Compare
…x942, gfx950, gfx1151)
Drops the WIP/ scratch directory (session status notes, prep findings, and the pybind11 rtld_local spike) that were unintentionally tracked on the provider branch and surfaced in the PR diff. These are local development artifacts, not part of the provider deliverable.
Provision a pinned, self-contained CPython from astral's python-build-standalone (PBS) for the embedded interpreter instead of depending on the host system Python. This removes the fragile /usr pin and the `pip install --user pybind11` prerequisite, and is the first step toward a relocatable, self-contained installable plugin. - cmake/CkDslPython.cmake (new): ck_dsl_provider_provision_python() honors -DCKDSL_PYTHON_DIST_DIR (air-gap / pre-stage) else FetchContents the pinned PBS install_only tarball with sha256 verification, then steers find_package(Python3) at the provisioned prefix. Asset/sha are pinned per platform (Linux x86_64 verified; aarch64 + Windows wired). - CMakeLists.txt: replace the /usr Python pin and the pip pybind11 probe with the provisioning include; pybind11 falls back to a pinned FetchContent when no CMake config is found. - EmbeddedInterpreter: pin PyConfig.home + executable to the bundled prefix so the bundled stdlib loads deterministically under the existing isolated config; executable path is platform-aware. - Bake the bundled prefix as kCkDslPythonHome in the generated paths header. - Tests: add provenance (stdlib resolves from the bundled prefix) and ctypes->comgr load checks. PBS statically links the runtime C-extension tail (_ctypes/_hashlib/ _bz2/_lzma) into libpython, so the hardened plugin loads them through the single libpython dependency with no separate-extension symbol issues. Verified on Linux/gfx1151: configure + build green, unit 62/62, integration 10/10 conv (real compile via PBS + comgr on-device). Windows: provisioning validation and the interpreter executable path are platform-aware but NOT yet verified on Windows; the bundled-DLL runtime search (no $ORIGIN on Windows) and install staging are still to come. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a proof-of-concept hipDNN engine plugin (
ck-dsl-provider) that compiles and runs kernels produced by the Composable Kernel Python DSL (ck_dsl). The plugin embeds a CPython interpreter, drives a Python compile service from C++ to produce HSACO plus launch metadata, caches results per graph signature, and ships one engine serving forward 2D implicit-GEMM convolution at FP16/NHWC. See the CK DSL hipDNN Provider plan.Risk Assessment
Low risk. This is a new, opt-in, self-contained provider under
dnn-providers/ck-dsl-provider/; it adds no behavior to existing code paths, public APIs, or default build/dispatch. The blast radius is limited to the new plugin and its own check targets, with PR CI pending.Testing Summary
Testing Checklist
ck_dsl_provider_tests- Status: Passedck_dsl_provider_integration_tests- ASICs: gfx950 - Status: Not runTechnical Changes
ck-dsl-providerhipDNN plugin with container/handle/context lifetime management, an engine + adapter + plan request pipeline, and a runtime layer (HipModule,KernelArtifact,LaunchAbi,JitCache).EmbeddedInterpreter+CompileServiceBridge) that invokesck_dsl_provider.compile_service, buildsck_dslspecs from a typed payload, and returns HSACO bytes with launch metadata.HipModule.