Adding dispatcher architecture #3300

vidyasagar-amd · 2025-11-26T05:00:56Z

Proposed changes

This PR introduces new dispatching infrastructure for CK Tile, building on prior Tile Engine work, allowing users to isolate and run specific kernels using C++ and Python APIs for GEMMs (universal, preshuffle and multi-D variants). It also adds unified code-generation tools, GPU architecture based kernel filtering, a kernel registry handling mechanism, and a set of examples on how to integrate with other frameworks (C++/Python) together with basic unit and integration tests.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

spolifroni-amd

The readmes etc look fine. I can't tell if this is supposed to be used internally for those contributing to the project, or for anyone using the library. If it's for anyone, then a changelog entry is needed to talk about arch_specs.json. Otherwise this is fine.

afagaj

It's worth adding a CHANGELOG.md entry to announce this change.

Copilot

Pull request overview

Introduce CK Tile Dispatcher architecture with new C++ and Python APIs, codegen tooling, kernel registry, and comprehensive GEMM/Conv examples, plus basic validation and benchmarking.

Add GEMM and Convolution examples in Python and C++ demonstrating registry/dispatcher usage, validation, and benchmarking.
Provide shared Python utilities for convolution (conv_utils.py) and codegen assets (requirements, scripts).
Expand documentation (READMEs) for quick start and example overviews.

Reviewed changes

Copilot reviewed 74 out of 160 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
dispatcher/examples/gemm/python/README.md	Adds quick start and Python GEMM examples overview and usage.
dispatcher/examples/gemm/python/01_basic_gemm.py	Basic GEMM example showing manual workflow (config, codegen, registry, dispatch).
dispatcher/examples/gemm/python/02_batch_gemm.py	Demonstrates running batches of GEMM problems with padding support.
dispatcher/examples/gemm/python/03_benchmark.py	Adds benchmarking script with warmup and TFLOPS reporting.
dispatcher/examples/gemm/python/04_validation.py	Validates GPU GEMM outputs against NumPy references with tolerances.
dispatcher/examples/gemm/python/05_numpy_integration.py	NumPy integration wrapper (GPUMatmul) and small demos.
dispatcher/examples/gemm/python/06_json_export.py	Exports registry/kernels metadata to JSON (and consumes C++ JSON if present).
dispatcher/examples/gemm/python/07_preshuffle.py	Preshuffle pipeline example with larger tiles and intrawave scheduler.
dispatcher/examples/gemm/python/08_multi_d.py	Multi-D fused GEMM example with CPU simulation and GPU base op timing.
dispatcher/examples/gemm/python/09_multi_registry.py	Multiple registries for compute/memory/latency optimized workloads.
dispatcher/examples/gemm/cpp/README.md	Adds C++ GEMM examples quick start and overview of example set.
dispatcher/examples/gemm/cpp/01_basic_gemm.cpp	Declarative kernel set example and simple GEMM run/verify.
dispatcher/examples/gemm/cpp/02_multi_size.cpp	Runs multiple problem sizes and reports TFLOPS.
dispatcher/examples/gemm/cpp/03_benchmark.cpp	Benchmark runner with stats (min/median/mean).
dispatcher/examples/gemm/cpp/04_validation.cpp	CPU reference and validation against GPU results.
dispatcher/examples/gemm/cpp/05_heuristics.cpp	Custom heuristic-based kernel selection demonstration.
dispatcher/examples/gemm/cpp/06_json_export.cpp	Registry JSON export and kernel set declarations.
dispatcher/examples/gemm/cpp/07_preshuffle.cpp	Preshuffle GEMM example with verification.
dispatcher/examples/gemm/cpp/08_multi_d.cpp	Multi-D fused concept demo using standard GEMM run.
dispatcher/examples/gemm/cpp/09_multi_registry.cpp	Multiple registries and dispatchers plus summary pattern.
dispatcher/examples/conv/python/conv_utils.py	Core Python utilities for conv signature/algorithm/arch, codegen, runners, and validation.
dispatcher/examples/conv/python/README.md	Adds Python Conv examples quick start and detailed guide.
dispatcher/examples/conv/python/02_conv2d_fwd.py	2D forward conv example with optional CPU verification and GPU run.
dispatcher/examples/conv/python/03_conv3d_fwd.py	3D forward conv example with CPU reference and GPU run.
dispatcher/examples/conv/python/04_conv2d_bwd_data.py	2D backward data conv with CPU reference and optional GPU execution.
dispatcher/examples/conv/python/05_conv2d_bwd_weight.py	2D backward weight conv with CPU reference and optional GPU execution.
dispatcher/examples/conv/python/06_benchmark.py	Conv benchmarking across small problems; optional CPU reference.
dispatcher/examples/conv/python/07_validation.py	Conv validation suite vs CPU with detailed analysis.
dispatcher/examples/conv/python/08_json_export.py	Export conv registry to JSON with basic stats and examples.
dispatcher/examples/conv/python/09_multi_registry.py	Multiple registries for conv workloads with selection heuristics.
dispatcher/examples/conv/python/10_conv3d_forward.py	3D forward conv GPU runner with timing and TFLOPS.
dispatcher/examples/conv/python/11_bwd_data.py	Backward data API demo with runners (note: codegen in progress).
dispatcher/examples/conv/python/12_bwd_weight.py	Backward weight API demo with dedicated runner (separate lib).
dispatcher/codegen/requirements.txt	Adds Python deps for codegen and optional tooling.
dispatcher/codegen/minimal_test_config.json	Minimal tile/trait config for test kernel generation.
dispatcher/codegen/generate_test_kernels.sh	Shell script to generate a minimal set of CK Tile kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dispatcher/examples/gemm/python/05_numpy_integration.py

dispatcher/examples/conv/python/conv_utils.py

dispatcher/examples/conv/python/10_conv3d_forward.py

dispatcher/examples/conv/python/06_benchmark.py

dispatcher/examples/gemm/python/README.md

dispatcher/examples/gemm/cpp/README.md

dispatcher/examples/conv/python/README.md

dispatcher/examples/conv/python/conv_utils.py

tenpercent · 2025-12-01T20:13:56Z

Thanks! A few notes while I'm just starting to look -

all cmake flags seem to be necessary, otherwise it quietly completes the build without building the examples, which is a bit surprising
maybe use Ninja instead of GNU Make if we want to encourage that
There is a fair number of legit warnings when you build the C++ example. Consider adding -Werror to the default clang flags
make sure the binaries and python scripts correctly process --help flag and output useful info
make conv example consistent with gemm
for the functionality you don't want to be broken by changes in CK APIs, add building and testing to the CI
at least all instructions in READMEs need to be manually verified

dispatcher/README.md

tenpercent · 2025-12-01T20:25:21Z

dispatcher/README.md

+| CMake | 3.16+ | `cmake --version` |
+| Python | 3.8+ | `python3 --version` |
+| NumPy | Any | `pip show numpy` |
+| hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` |


I think some time around rocm 6.4 we were encouraged to switch from hipcc to clang bundled with the ROCm distribution

dispatcher/README.md

tenpercent · 2025-12-01T20:26:23Z

dispatcher/README.md

+- **gfx942** - MI300X, MI300A (Instinct MI300 series) ← Recommended
+- **gfx950** - MI350 series
+- **gfx90a** - MI200 series (MI250, MI250X)
+- **gfx1201** - RDNA4 series


make sure this is consistent with the examples. I think currently the printed messages are about gfx11

added gfx11

tenpercent · 2025-12-01T20:27:21Z

dispatcher/README.md

+
+```bash
+# Install NumPy (required for Python examples)
+pip install numpy


one more good practice to encourage - use uv venv for creating virtual environments and uv pip for the packages

dispatcher/README.md

Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme

Fixing typos

…preshuffle bug.

tenpercent · 2025-12-03T22:22:37Z

dispatcher/examples/gemm/cpp/01_basic_gemm.cpp

+    float actual   = static_cast<float>(c_host[0]);
+    bool passed    = std::abs(actual - expected) < 1.0f;
+
+    std::cout << "  C[0,0] = " << actual << " (expected " << expected << ")\n";


when doing the correctness checks, let's check all elements, not just one? also since you're initializing the tensor with 1's I think the comparison can be exact in this case

tenpercent · 2025-12-03T22:26:07Z

dispatcher/examples/gemm/cpp/08_multi_d.cpp

+    multi_d,
+    .add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 1), // 1 D tensor
+         Algorithm().tile(128, 128, 32))
+        .add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 2), // 2 D tensors


with multi-d, the algorithm must know the D tensors' layouts I believe

https://github.com/ROCm/composable_kernel/blob/develop/example/ck_tile/19_gemm_multi_d/gemm_multi_d_fp16.cpp#L113

Fixed, please check if it is consistent now

dispatcher/examples/gemm/cpp/08_multi_d.cpp

tenpercent · 2025-12-03T22:34:42Z

dispatcher/examples/gemm/cpp/09_multi_registry.cpp

+        c_dev.copy_to_host(c_host.data());
+        float expected = static_cast<float>(test.K);
+        // Use 1% relative tolerance for FP16 accumulation over K elements
+        if(std::abs(static_cast<float>(c_host[0]) - expected) > (0.01f * expected + 1.0f))


also for the correctness check it would be nice to set atol/rtol and not hardcode them

dispatcher/include/ck_tile/dispatcher/kernel_decl.hpp

tenpercent · 2025-12-03T22:48:41Z

dispatcher/include/ck_tile/dispatcher/kernel_decl.hpp

+
+    bool has(const std::string& name) const { return sets_.find(name) != sets_.end(); }
+
+    std::vector<std::string> names() const { return order_; }


returns a deep copy if that matters

tenpercent · 2025-12-03T23:02:27Z

dispatcher/cmake/DeclarativeKernels.cmake

+    endif()
+
+    # Generate source
+    file(WRITE ${OUTPUT_FILE} "// Auto-generated kernel: ${KERNEL_NAME}


should we do the generation in a python script instead? minimizing the cmake logic to invoking the python scripts

tenpercent · 2025-12-03T23:14:16Z

dispatcher/bindings/ctypes/gemm_ctypes_lib.cpp

+using Priority = ck_tile::dispatcher::Registry::Priority;
+
+// Global dispatcher (initialized once)
+static Dispatcher* g_dispatcher = nullptr;


maybe use a smart pointer (shared)

…lation.

tenpercent · 2025-12-04T04:03:47Z

dispatcher/include/ck_tile/dispatcher/utils.hpp

+/**
+ * @brief GPU timing using HIP events
+ */
+class GpuTimer


this should operate on the stream the kernel is getting launched on

afagaj

This is major new functionality, so it warrants adding a CHANGELOG.md entry.

e.g. something like
Introduce CK-Tile Dispatcher - a unified kernel dispatch system for AMD GPUs with C++ and Python frontends.

vidyasagar-amd requested review from a team, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, ddembeckAMD, geyyer, illsilin, poyenc, qianfengz, shumway and tenpercent as code owners November 26, 2025 05:00

spolifroni-amd previously approved these changes Nov 26, 2025

View reviewed changes

afagaj reviewed Nov 26, 2025

View reviewed changes

vidyasagar-amd dismissed spolifroni-amd’s stale review via 34c5579 November 28, 2025 19:16

vidyasagar-amd requested a review from Copilot December 1, 2025 17:46

Copilot AI reviewed Dec 1, 2025

View reviewed changes

tenpercent reviewed Dec 1, 2025

View reviewed changes

dispatcher/README.md Outdated Show resolved Hide resolved

tenpercent reviewed Dec 1, 2025

View reviewed changes

dispatcher/README.md Outdated Show resolved Hide resolved

tenpercent reviewed Dec 1, 2025

View reviewed changes

dispatcher/README.md Outdated Show resolved Hide resolved

tenpercent reviewed Dec 1, 2025

View reviewed changes

dispatcher/README.md Outdated Show resolved Hide resolved

vidyasagar-amd added 13 commits December 3, 2025 21:43

Dispatcher cleanup and updates.

7c8bdc0

Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme

Fixes to python paths

59d2240

Cleaning up code

443352b

Improving dispatcher support for different arch

d674647

Fixing typos

Fix formatting errors

620fcd2

Cleaning up examples

e6b3043

Improving codegeneration

daa93bf

Improving and fixing C++ examples

3042946

Adding conv functionality (fwd,bwd,bwdw) and examples.

5377447

Fixes based on feedback.

a838b25

Further fixes based on feedback.

3fca468

Adding stress test for autogeneration and autocorrection, and fixing …

05704bd

…preshuffle bug.

Another round of improvements based on feedback.

3c7d547

vidyasagar-amd force-pushed the builder-dispatch-tile-gemm branch from 6d3b286 to 3c7d547 Compare December 3, 2025 21:45

tenpercent reviewed Dec 3, 2025

View reviewed changes

dispatcher/examples/gemm/cpp/08_multi_d.cpp Outdated Show resolved Hide resolved

tenpercent reviewed Dec 3, 2025

View reviewed changes

dispatcher/include/ck_tile/dispatcher/kernel_decl.hpp Outdated Show resolved Hide resolved

tenpercent reviewed Dec 3, 2025

View reviewed changes

dispatcher/include/ck_tile/dispatcher/kernel_decl.hpp Outdated Show resolved Hide resolved

Trimming out unnecessary code.

22f3538

tenpercent reviewed Dec 3, 2025

View reviewed changes

vidyasagar-amd added 2 commits December 3, 2025 23:56

Fixing the multi-D implementation.

4f48456

Using gpu verification for gemms and fixing convolutions tflops calcu…

9930283

…lation.

tenpercent reviewed Dec 4, 2025

View reviewed changes

Fix counter usage issue and arch filtering per ops.

1366a26

afagaj reviewed Dec 4, 2025

View reviewed changes

Adding changelog and other fixes.

152193e


		bool has(const std::string& name) const { return sets_.find(name) != sets_.end(); }

		std::vector<std::string> names() const { return order_; }

Adding dispatcher architecture #3300

Are you sure you want to change the base?

Adding dispatcher architecture #3300

Conversation

vidyasagar-amd commented Nov 26, 2025

Proposed changes

Checklist

Discussion

Uh oh!

spolifroni-amd left a comment

Choose a reason for hiding this comment

Uh oh!

afagaj left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tenpercent commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afagaj left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

tenpercent commented Dec 1, 2025 •

edited

Loading