-
Notifications
You must be signed in to change notification settings - Fork 256
Adding dispatcher architecture #3300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
spolifroni-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readmes etc look fine. I can't tell if this is supposed to be used internally for those contributing to the project, or for anyone using the library. If it's for anyone, then a changelog entry is needed to talk about arch_specs.json. Otherwise this is fine.
afagaj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth adding a CHANGELOG.md entry to announce this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Introduce CK Tile Dispatcher architecture with new C++ and Python APIs, codegen tooling, kernel registry, and comprehensive GEMM/Conv examples, plus basic validation and benchmarking.
- Add GEMM and Convolution examples in Python and C++ demonstrating registry/dispatcher usage, validation, and benchmarking.
- Provide shared Python utilities for convolution (conv_utils.py) and codegen assets (requirements, scripts).
- Expand documentation (READMEs) for quick start and example overviews.
Reviewed changes
Copilot reviewed 74 out of 160 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| dispatcher/examples/gemm/python/README.md | Adds quick start and Python GEMM examples overview and usage. |
| dispatcher/examples/gemm/python/01_basic_gemm.py | Basic GEMM example showing manual workflow (config, codegen, registry, dispatch). |
| dispatcher/examples/gemm/python/02_batch_gemm.py | Demonstrates running batches of GEMM problems with padding support. |
| dispatcher/examples/gemm/python/03_benchmark.py | Adds benchmarking script with warmup and TFLOPS reporting. |
| dispatcher/examples/gemm/python/04_validation.py | Validates GPU GEMM outputs against NumPy references with tolerances. |
| dispatcher/examples/gemm/python/05_numpy_integration.py | NumPy integration wrapper (GPUMatmul) and small demos. |
| dispatcher/examples/gemm/python/06_json_export.py | Exports registry/kernels metadata to JSON (and consumes C++ JSON if present). |
| dispatcher/examples/gemm/python/07_preshuffle.py | Preshuffle pipeline example with larger tiles and intrawave scheduler. |
| dispatcher/examples/gemm/python/08_multi_d.py | Multi-D fused GEMM example with CPU simulation and GPU base op timing. |
| dispatcher/examples/gemm/python/09_multi_registry.py | Multiple registries for compute/memory/latency optimized workloads. |
| dispatcher/examples/gemm/cpp/README.md | Adds C++ GEMM examples quick start and overview of example set. |
| dispatcher/examples/gemm/cpp/01_basic_gemm.cpp | Declarative kernel set example and simple GEMM run/verify. |
| dispatcher/examples/gemm/cpp/02_multi_size.cpp | Runs multiple problem sizes and reports TFLOPS. |
| dispatcher/examples/gemm/cpp/03_benchmark.cpp | Benchmark runner with stats (min/median/mean). |
| dispatcher/examples/gemm/cpp/04_validation.cpp | CPU reference and validation against GPU results. |
| dispatcher/examples/gemm/cpp/05_heuristics.cpp | Custom heuristic-based kernel selection demonstration. |
| dispatcher/examples/gemm/cpp/06_json_export.cpp | Registry JSON export and kernel set declarations. |
| dispatcher/examples/gemm/cpp/07_preshuffle.cpp | Preshuffle GEMM example with verification. |
| dispatcher/examples/gemm/cpp/08_multi_d.cpp | Multi-D fused concept demo using standard GEMM run. |
| dispatcher/examples/gemm/cpp/09_multi_registry.cpp | Multiple registries and dispatchers plus summary pattern. |
| dispatcher/examples/conv/python/conv_utils.py | Core Python utilities for conv signature/algorithm/arch, codegen, runners, and validation. |
| dispatcher/examples/conv/python/README.md | Adds Python Conv examples quick start and detailed guide. |
| dispatcher/examples/conv/python/02_conv2d_fwd.py | 2D forward conv example with optional CPU verification and GPU run. |
| dispatcher/examples/conv/python/03_conv3d_fwd.py | 3D forward conv example with CPU reference and GPU run. |
| dispatcher/examples/conv/python/04_conv2d_bwd_data.py | 2D backward data conv with CPU reference and optional GPU execution. |
| dispatcher/examples/conv/python/05_conv2d_bwd_weight.py | 2D backward weight conv with CPU reference and optional GPU execution. |
| dispatcher/examples/conv/python/06_benchmark.py | Conv benchmarking across small problems; optional CPU reference. |
| dispatcher/examples/conv/python/07_validation.py | Conv validation suite vs CPU with detailed analysis. |
| dispatcher/examples/conv/python/08_json_export.py | Export conv registry to JSON with basic stats and examples. |
| dispatcher/examples/conv/python/09_multi_registry.py | Multiple registries for conv workloads with selection heuristics. |
| dispatcher/examples/conv/python/10_conv3d_forward.py | 3D forward conv GPU runner with timing and TFLOPS. |
| dispatcher/examples/conv/python/11_bwd_data.py | Backward data API demo with runners (note: codegen in progress). |
| dispatcher/examples/conv/python/12_bwd_weight.py | Backward weight API demo with dedicated runner (separate lib). |
| dispatcher/codegen/requirements.txt | Adds Python deps for codegen and optional tooling. |
| dispatcher/codegen/minimal_test_config.json | Minimal tile/trait config for test kernel generation. |
| dispatcher/codegen/generate_test_kernels.sh | Shell script to generate a minimal set of CK Tile kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks! A few notes while I'm just starting to look -
|
| | CMake | 3.16+ | `cmake --version` | | ||
| | Python | 3.8+ | `python3 --version` | | ||
| | NumPy | Any | `pip show numpy` | | ||
| | hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some time around rocm 6.4 we were encouraged to switch from hipcc to clang bundled with the ROCm distribution
dispatcher/README.md
Outdated
| - **gfx942** - MI300X, MI300A (Instinct MI300 series) ← Recommended | ||
| - **gfx950** - MI350 series | ||
| - **gfx90a** - MI200 series (MI250, MI250X) | ||
| - **gfx1201** - RDNA4 series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is consistent with the examples. I think currently the printed messages are about gfx11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added gfx11
|
|
||
| ```bash | ||
| # Install NumPy (required for Python examples) | ||
| pip install numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more good practice to encourage - use uv venv for creating virtual environments and uv pip for the packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added note
Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme
Fixing typos
6d3b286 to
3c7d547
Compare
| float actual = static_cast<float>(c_host[0]); | ||
| bool passed = std::abs(actual - expected) < 1.0f; | ||
|
|
||
| std::cout << " C[0,0] = " << actual << " (expected " << expected << ")\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when doing the correctness checks, let's check all elements, not just one? also since you're initializing the tensor with 1's I think the comparison can be exact in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| multi_d, | ||
| .add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 1), // 1 D tensor | ||
| Algorithm().tile(128, 128, 32)) | ||
| .add(Signature().dtype("fp16").layout("rcr").elementwise("MultiDAdd", 2), // 2 D tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with multi-d, the algorithm must know the D tensors' layouts I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, please check if it is consistent now
| c_dev.copy_to_host(c_host.data()); | ||
| float expected = static_cast<float>(test.K); | ||
| // Use 1% relative tolerance for FP16 accumulation over K elements | ||
| if(std::abs(static_cast<float>(c_host[0]) - expected) > (0.01f * expected + 1.0f)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for the correctness check it would be nice to set atol/rtol and not hardcode them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| bool has(const std::string& name) const { return sets_.find(name) != sets_.end(); } | ||
|
|
||
| std::vector<std::string> names() const { return order_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns a deep copy if that matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| endif() | ||
|
|
||
| # Generate source | ||
| file(WRITE ${OUTPUT_FILE} "// Auto-generated kernel: ${KERNEL_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do the generation in a python script instead? minimizing the cmake logic to invoking the python scripts
| using Priority = ck_tile::dispatcher::Registry::Priority; | ||
|
|
||
| // Global dispatcher (initialized once) | ||
| static Dispatcher* g_dispatcher = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a smart pointer (shared)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| /** | ||
| * @brief GPU timing using HIP events | ||
| */ | ||
| class GpuTimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should operate on the stream the kernel is getting launched on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
afagaj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is major new functionality, so it warrants adding a CHANGELOG.md entry.
e.g. something like
Introduce CK-Tile Dispatcher - a unified kernel dispatch system for AMD GPUs with C++ and Python frontends.
Proposed changes
This PR introduces new dispatching infrastructure for CK Tile, building on prior Tile Engine work, allowing users to isolate and run specific kernels using C++ and Python APIs for GEMMs (universal, preshuffle and multi-D variants). It also adds unified code-generation tools, GPU architecture based kernel filtering, a kernel registry handling mechanism, and a set of examples on how to integrate with other frameworks (C++/Python) together with basic unit and integration tests.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered