Skip to content

Conversation

@vidyasagar-amd
Copy link
Contributor

Proposed changes

This PR introduces new dispatching infrastructure for CK Tile, building on prior Tile Engine work, allowing users to isolate and run specific kernels using C++ and Python APIs for GEMMs (universal, preshuffle and multi-D variants). It also adds unified code-generation tools, GPU architecture based kernel filtering, a kernel registry handling mechanism, and a set of examples on how to integrate with other frameworks (C++/Python) together with basic unit and integration tests.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

spolifroni-amd
spolifroni-amd previously approved these changes Nov 26, 2025
Copy link
Contributor

@spolifroni-amd spolifroni-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readmes etc look fine. I can't tell if this is supposed to be used internally for those contributing to the project, or for anyone using the library. If it's for anyone, then a changelog entry is needed to talk about arch_specs.json. Otherwise this is fine.

Copy link
Collaborator

@afagaj afagaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth adding a CHANGELOG.md entry to announce this change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduce CK Tile Dispatcher architecture with new C++ and Python APIs, codegen tooling, kernel registry, and comprehensive GEMM/Conv examples, plus basic validation and benchmarking.

  • Add GEMM and Convolution examples in Python and C++ demonstrating registry/dispatcher usage, validation, and benchmarking.
  • Provide shared Python utilities for convolution (conv_utils.py) and codegen assets (requirements, scripts).
  • Expand documentation (READMEs) for quick start and example overviews.

Reviewed changes

Copilot reviewed 74 out of 160 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
dispatcher/examples/gemm/python/README.md Adds quick start and Python GEMM examples overview and usage.
dispatcher/examples/gemm/python/01_basic_gemm.py Basic GEMM example showing manual workflow (config, codegen, registry, dispatch).
dispatcher/examples/gemm/python/02_batch_gemm.py Demonstrates running batches of GEMM problems with padding support.
dispatcher/examples/gemm/python/03_benchmark.py Adds benchmarking script with warmup and TFLOPS reporting.
dispatcher/examples/gemm/python/04_validation.py Validates GPU GEMM outputs against NumPy references with tolerances.
dispatcher/examples/gemm/python/05_numpy_integration.py NumPy integration wrapper (GPUMatmul) and small demos.
dispatcher/examples/gemm/python/06_json_export.py Exports registry/kernels metadata to JSON (and consumes C++ JSON if present).
dispatcher/examples/gemm/python/07_preshuffle.py Preshuffle pipeline example with larger tiles and intrawave scheduler.
dispatcher/examples/gemm/python/08_multi_d.py Multi-D fused GEMM example with CPU simulation and GPU base op timing.
dispatcher/examples/gemm/python/09_multi_registry.py Multiple registries for compute/memory/latency optimized workloads.
dispatcher/examples/gemm/cpp/README.md Adds C++ GEMM examples quick start and overview of example set.
dispatcher/examples/gemm/cpp/01_basic_gemm.cpp Declarative kernel set example and simple GEMM run/verify.
dispatcher/examples/gemm/cpp/02_multi_size.cpp Runs multiple problem sizes and reports TFLOPS.
dispatcher/examples/gemm/cpp/03_benchmark.cpp Benchmark runner with stats (min/median/mean).
dispatcher/examples/gemm/cpp/04_validation.cpp CPU reference and validation against GPU results.
dispatcher/examples/gemm/cpp/05_heuristics.cpp Custom heuristic-based kernel selection demonstration.
dispatcher/examples/gemm/cpp/06_json_export.cpp Registry JSON export and kernel set declarations.
dispatcher/examples/gemm/cpp/07_preshuffle.cpp Preshuffle GEMM example with verification.
dispatcher/examples/gemm/cpp/08_multi_d.cpp Multi-D fused concept demo using standard GEMM run.
dispatcher/examples/gemm/cpp/09_multi_registry.cpp Multiple registries and dispatchers plus summary pattern.
dispatcher/examples/conv/python/conv_utils.py Core Python utilities for conv signature/algorithm/arch, codegen, runners, and validation.
dispatcher/examples/conv/python/README.md Adds Python Conv examples quick start and detailed guide.
dispatcher/examples/conv/python/02_conv2d_fwd.py 2D forward conv example with optional CPU verification and GPU run.
dispatcher/examples/conv/python/03_conv3d_fwd.py 3D forward conv example with CPU reference and GPU run.
dispatcher/examples/conv/python/04_conv2d_bwd_data.py 2D backward data conv with CPU reference and optional GPU execution.
dispatcher/examples/conv/python/05_conv2d_bwd_weight.py 2D backward weight conv with CPU reference and optional GPU execution.
dispatcher/examples/conv/python/06_benchmark.py Conv benchmarking across small problems; optional CPU reference.
dispatcher/examples/conv/python/07_validation.py Conv validation suite vs CPU with detailed analysis.
dispatcher/examples/conv/python/08_json_export.py Export conv registry to JSON with basic stats and examples.
dispatcher/examples/conv/python/09_multi_registry.py Multiple registries for conv workloads with selection heuristics.
dispatcher/examples/conv/python/10_conv3d_forward.py 3D forward conv GPU runner with timing and TFLOPS.
dispatcher/examples/conv/python/11_bwd_data.py Backward data API demo with runners (note: codegen in progress).
dispatcher/examples/conv/python/12_bwd_weight.py Backward weight API demo with dedicated runner (separate lib).
dispatcher/codegen/requirements.txt Adds Python deps for codegen and optional tooling.
dispatcher/codegen/minimal_test_config.json Minimal tile/trait config for test kernel generation.
dispatcher/codegen/generate_test_kernels.sh Shell script to generate a minimal set of CK Tile kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tenpercent
Copy link
Contributor

tenpercent commented Dec 1, 2025

Thanks! A few notes while I'm just starting to look -

  • all cmake flags seem to be necessary, otherwise it quietly completes the build without building the examples, which is a bit surprising
  • maybe use Ninja instead of GNU Make if we want to encourage that
  • There is a fair number of legit warnings when you build the C++ example. Consider adding -Werror to the default clang flags
  • make sure the binaries and python scripts correctly process --help flag and output useful info
  • make conv example consistent with gemm
  • for the functionality you don't want to be broken by changes in CK APIs, add building and testing to the CI
  • at least all instructions in READMEs need to be manually verified

| CMake | 3.16+ | `cmake --version` |
| Python | 3.8+ | `python3 --version` |
| NumPy | Any | `pip show numpy` |
| hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some time around rocm 6.4 we were encouraged to switch from hipcc to clang bundled with the ROCm distribution

- **gfx942** - MI300X, MI300A (Instinct MI300 series) ← Recommended
- **gfx950** - MI350 series
- **gfx90a** - MI200 series (MI250, MI250X)
- **gfx1201** - RDNA4 series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this is consistent with the examples. I think currently the printed messages are about gfx11

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added gfx11


```bash
# Install NumPy (required for Python examples)
pip install numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more good practice to encourage - use uv venv for creating virtual environments and uv pip for the packages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added note

@vidyasagar-amd vidyasagar-amd force-pushed the builder-dispatch-tile-gemm branch from 6d3b286 to 3c7d547 Compare December 3, 2025 21:45
float actual = static_cast<float>(c_host[0]);
bool passed = std::abs(actual - expected) < 1.0f;

std::cout << " C[0,0] = " << actual << " (expected " << expected << ")\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when doing the correctness checks, let's check all elements, not just one? also since you're initializing the tensor with 1's I think the comparison can be exact in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

multi_d,
.add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 1), // 1 D tensor
Algorithm().tile(128, 128, 32))
.add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 2), // 2 D tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with multi-d, the algorithm must know the D tensors' layouts I believe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, please check if it is consistent now

c_dev.copy_to_host(c_host.data());
float expected = static_cast<float>(test.K);
// Use 1% relative tolerance for FP16 accumulation over K elements
if(std::abs(static_cast<float>(c_host[0]) - expected) > (0.01f * expected + 1.0f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for the correctness check it would be nice to set atol/rtol and not hardcode them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


bool has(const std::string& name) const { return sets_.find(name) != sets_.end(); }

std::vector<std::string> names() const { return order_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns a deep copy if that matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

endif()

# Generate source
file(WRITE ${OUTPUT_FILE} "// Auto-generated kernel: ${KERNEL_NAME}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do the generation in a python script instead? minimizing the cmake logic to invoking the python scripts

using Priority = ck_tile::dispatcher::Registry::Priority;

// Global dispatcher (initialized once)
static Dispatcher* g_dispatcher = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a smart pointer (shared)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/**
* @brief GPU timing using HIP events
*/
class GpuTimer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should operate on the stream the kernel is getting launched on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator

@afagaj afagaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is major new functionality, so it warrants adding a CHANGELOG.md entry.

e.g. something like
Introduce CK-Tile Dispatcher - a unified kernel dispatch system for AMD GPUs with C++ and Python frontends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants