diff --git a/docs/conceptual/ck_tile/CK-tile-index.rst b/docs/conceptual/ck_tile/CK-tile-index.rst new file mode 100644 index 0000000000..e18cb24f80 --- /dev/null +++ b/docs/conceptual/ck_tile/CK-tile-index.rst @@ -0,0 +1,33 @@ +.. _ck_tile_index: + +************************ +CK Tile Index +************************ + +CK Tile documentation structure: + +.. toctree:: + :maxdepth: 2 + + introduction_motivation + buffer_views + tensor_views + tile_distribution + coordinate_systems + terminology + adaptors + transforms + descriptors + tile_window + load_store_traits + space_filling_curve + static_distributed_tensor + convolution_example + coordinate_movement + lds_index_swapping + swizzling_example + tensor_coordinates + sweep_tile + encoding_internals + thread_mapping + hardware/index diff --git a/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md b/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md new file mode 100644 index 0000000000..5e8679dbd2 --- /dev/null +++ b/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md @@ -0,0 +1,156 @@ +# Mermaid Diagram Management + +This document explains how to manage mermaid diagrams in the CK Tile documentation. + +## Overview + +All mermaid diagrams in the CK Tile documentation have been converted to SVG files for better rendering compatibility. The original mermaid source code is preserved as commented blocks in the RST files, allowing easy updates when needed. + +## Directory Structure + +- `docs/conceptual/ck_tile/diagrams/` - Contains all SVG diagram files +- `docs/conceptual/ck_tile/convert_mermaid_to_svg.py` - Initial conversion script (one-time use) +- `docs/conceptual/ck_tile/update_diagrams.py` - Helper script to regenerate diagrams from comments + +## Diagram Format in RST Files + +Each diagram follows this format: + +```rst +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + A --> B + B --> C + +.. image:: diagrams/diagram_name.svg + :alt: Diagram + :align: center +``` + +The commented mermaid block won't appear in the rendered documentation but serves as the source for regenerating the SVG. + +## Updating Diagrams + +### When to Update + +You need to regenerate SVG files when: +- Modifying the mermaid source in a commented block +- Adding new diagrams +- Updating diagram styling + +### How to Update + +1. **Edit the commented mermaid source** in the RST file +2. **Run the update script**: + ```bash + # Update all diagrams + python docs/conceptual/ck_tile/update_diagrams.py + + # Update diagrams in a specific file + python docs/conceptual/ck_tile/update_diagrams.py transforms.rst + + # Force regenerate all diagrams (even if SVGs exist) + python docs/conceptual/ck_tile/update_diagrams.py --force + ``` + +### Prerequisites + +The update script requires [mermaid-cli](https://github.com/mermaid-js/mermaid-cli): + +```bash +npm install -g @mermaid-js/mermaid-cli +``` + +## Adding New Diagrams + +To add a new mermaid diagram: + +1. **Create the commented block** in your RST file: + ```rst + .. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + A --> B + ``` + +2. **Add the image reference** immediately after: + ```rst + .. image:: diagrams/my_new_diagram.svg + :alt: My New Diagram + :align: center + ``` + +3. **Generate the SVG**: + ```bash + python docs/conceptual/ck_tile/update_diagrams.py your_file.rst + ``` + +## Current Diagrams + +The following RST files contain mermaid diagrams (40 total): + +- `adaptors.rst` (2 diagrams) +- `convolution_example.rst` (1 diagram) +- `coordinate_movement.rst` (1 diagram) +- `descriptors.rst` (2 diagrams) +- `encoding_internals.rst` (2 diagrams) +- `lds_index_swapping.rst` (3 diagrams) +- `load_store_traits.rst` (2 diagrams) +- `space_filling_curve.rst` (1 diagram) +- `static_distributed_tensor.rst` (1 diagram) +- `sweep_tile.rst` (4 diagrams) +- `tensor_coordinates.rst` (2 diagrams) +- `thread_mapping.rst` (2 diagrams) +- `tile_window.rst` (5 diagrams) +- `transforms.rst` (12 diagrams) + +## Troubleshooting + +### SVG not generated + +- Check that mermaid-cli is installed: `mmdc --version` +- Verify the mermaid syntax is valid +- Look for error messages in the script output + +### Diagram not updating + +- Use `--force` flag to regenerate: `python docs/update_diagrams.py --force` +- Check that the image reference matches the generated filename + +### Pattern not matching + +If the update script can't find your commented diagram: +- Ensure proper indentation (3 spaces for comment block content) +- Verify the `.. mermaid::` directive is commented +- Check that the image reference immediately follows the comment block + +## Script Details + +### update_diagrams.py + +This script: +1. Scans RST files for commented mermaid blocks +2. Extracts the mermaid source code +3. Converts to SVG using `mmdc` +4. Saves to the diagrams directory + +**Usage:** +- `python docs/conceptual/ck_tile/update_diagrams.py` - Check all files, update missing SVGs +- `python docs/conceptual/ck_tile/update_diagrams.py --force` - Regenerate all SVGs +- `python docs/conceptual/ck_tile/update_diagrams.py ` - Update specific file + +### convert_mermaid_to_svg.py + +This was the initial conversion script. It: +1. Found all active `.. mermaid::` directives +2. Converted them to SVGs +3. Replaced directives with commented source + image references + +This script was used once for the initial conversion and typically doesn't need to be run again. diff --git a/docs/conceptual/ck_tile/adaptors.rst b/docs/conceptual/ck_tile/adaptors.rst new file mode 100644 index 0000000000..9e8907ab10 --- /dev/null +++ b/docs/conceptual/ck_tile/adaptors.rst @@ -0,0 +1,391 @@ +.. _ck_tile_adaptors: + +Tensor Adaptors - Chaining Transformations +========================================== + +Overview +-------- + +While individual :ref:`transforms ` are effective, TensorAdaptors enable the chaining of multiple transforms together to create complex coordinate transformations. Adaptors can be thought of as transformation pipelines that can reshape, reorder, and restructure tensors in advanced ways. + +TensorAdaptors serve as the bridge between individual transforms and the high-level tensor operations used in applications. They provide a composable abstraction that allows developers to build complex data access patterns from simple building blocks. + +TensorAdaptor Basics +-------------------- + +A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations `, managing the flow of coordinates through multiple transform stages: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Adaptor Composition" + subgraph "Single Transform" + direction TB + I1["Input Coords
[0,1,2]"] + T1["Transform
(e.g., Transpose)"] + O1["Output Coords
[2,0,1]"] + I1 --> T1 --> O1 + end + + subgraph "Chained Transforms" + direction TB + I2["Input
2D"] + T2A["Transform A
(e.g., Merge)"] + M2["Intermediate
1D"] + T2B["Transform B
(e.g., Pad)"] + O2["Output
1D Padded"] + I2 --> T2A --> M2 --> T2B --> O2 + end + end + + style T1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style T2A fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style T2B fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + + +.. image:: diagrams/adaptors_1.svg + :alt: Diagram + :align: center + +.. image:: diagrams/adaptors_1.svg + :alt: Diagram + :align: center +Core Components + +~~~~~~~~~~~~~~~ + +Each TensorAdaptor contains: + +- **transforms**: List of individual :ref:`transforms ` to apply +- **lower_dimension_hidden_idss**: Mappings between transform stages +- **upper_dimension_hidden_idss**: Hidden dimension mappings for internal stages +- **bottom_dimension_hidden_ids**: Input dimension identifiers +- **top_dimension_hidden_ids**: Output dimension identifiers + +The most important method of a TensorAdaptor is ``calculate_bottom_index``, which calculates the lower index from the upper index by applying transforms in reverse order. + +Transpose Adaptor: Dimension Reordering +--------------------------------------- + +The transpose adaptor reorders tensor dimensions according to a permutation pattern. This operation forms the basis for many tensor manipulations in GPU kernels. + +.. code-block:: cpp + + // Create transpose adaptor: [0, 1, 2] → [2, 0, 1] + auto transpose_adaptor = make_identity_tensor_adaptor<3>(); // Start with identity + + // Apply transpose using transform_tensor_adaptor + auto transposed_desc = transform_tensor_descriptor( + original_desc, + make_tuple(make_pass_through_transform(original_desc.get_length(2)), + make_pass_through_transform(original_desc.get_length(0)), + make_pass_through_transform(original_desc.get_length(1))), + make_tuple(sequence<2>{}, sequence<0>{}, sequence<1>{}), // old dims + make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}) // new dims + ); + + // Alternative: Direct coordinate transformation + multi_index<3> top_coord{0, 1, 2}; + // After transpose [2, 0, 1]: coord becomes [2, 0, 1] + +Single-Stage Adaptors: Custom Transform Chains +---------------------------------------------- + +Custom adaptors can be created by specifying which transforms to use and how they connect. This provides fine-grained control over the transformation pipeline: + +.. code-block:: cpp + + // Create a descriptor that merges 2x3 dimensions into single dimension + auto base_desc = make_naive_tensor_descriptor_packed(make_tuple(2, 3)); + + // Apply merge transform + auto merged_desc = transform_tensor_descriptor( + base_desc, + make_tuple(make_merge_transform(make_tuple(2, 3))), + make_tuple(sequence<0, 1>{}), // merge dims 0,1 + make_tuple(sequence<0>{}) // to single dim 0 + ); + + // The adaptor is embedded in the :ref:`descriptor ` + // To use it: + multi_index<1> top_coord{5}; // 1D coordinate + // This internally calculates: row = 5/3 = 1, col = 5%3 = 2 + +Chaining Adaptors: Building Complex Transformations +--------------------------------------------------- + +The real power of adaptors comes from chaining multiple transformations together to create advanced data access patterns: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Adaptor Chaining Flow" + subgraph "Adaptor 1" + A1I["Bottom Dims
[0,1]"] + A1T["Transform:
Merge[2,3]"] + A1O["Top Dims
[0]"] + end + + subgraph "Adaptor 2" + A2I["Bottom Dims
[0]"] + A2T["Transform:
Unmerge[2,3]"] + A2O["Top Dims
[0,1]"] + end + + subgraph "Chained Result" + CI["Input 2D
Bottom[0,1]"] + CO["Output 2D
Top[0,1]"] + end + end + + A1I --> A1T + A1T --> A1O + A1O --> A2I + A2I --> A2T + A2T --> A2O + + CI --> A1I + A2O --> CO + + style A1T fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style A2T fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style CI fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style CO fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + +.. image:: diagrams/adaptors_2.svg + :alt: Diagram + :align: center + +.. image:: diagrams/adaptors_2.svg + :alt: Diagram + :align: center + +.. code-block:: cpp + + // Start with a 2D descriptor + auto desc1 = make_naive_tensor_descriptor_packed(make_tuple(2, 3)); + + // First transformation: merge 2D to 1D + auto merged_desc = transform_tensor_descriptor( + desc1, + make_tuple(make_merge_transform(make_tuple(2, 3))), + make_tuple(sequence<0, 1>{}), // merge dims 0,1 + make_tuple(sequence<0>{}) // to dim 0 + ); + + // Second transformation: unmerge 1D back to 2D + auto final_desc = transform_tensor_descriptor( + merged_desc, + make_tuple(make_unmerge_transform(make_tuple(2, 3))), + make_tuple(sequence<0>{}), // from dim 0 + make_tuple(sequence<0, 1>{}) // to dims 0,1 + ); + + // The chained transformation is embedded in final_desc + // Result should be identity transformation + +Transform Addition: Extending Existing Adaptors +----------------------------------------------- + +Existing adaptors can be extended with new transforms using ``transform_tensor_adaptor``. This pattern is useful for adding padding or other modifications to existing transformation pipelines: + +.. code-block:: cpp + + // Start with transposed descriptor + auto base_desc = make_naive_tensor_descriptor( + make_tuple(3, 4), + make_tuple(1, 3) // transposed strides + ); + + // Add padding to both dimensions + auto padded_desc = transform_tensor_descriptor( + base_desc, + make_tuple(make_pad_transform(3, 1, 1), // pad dim 0: 3 → 5 + make_pad_transform(4, 0, 0)), // keep dim 1: 4 → 4 + make_tuple(sequence<0>{}, sequence<1>{}), // input dims + make_tuple(sequence<0>{}, sequence<1>{}) // output dims (keep 2D) + ); + + // Access pattern + multi_index<2> padded_coord{1, 2}; // In padded space + // Internally calculates: unpadded = [1-1, 2] = [0, 2] + // Then applies transpose strides + +Advanced Patterns +----------------- + +Complex Nested Transforms +~~~~~~~~~~~~~~~~~~~~~~~~~ + +CK Tile supports complex nested transform patterns that enable advanced data layouts: + +.. code-block:: cpp + + // Example: 4D tensor with complex transformations + // Shape: [A, B, C, D] with various transforms + + // 1. Create base descriptor + auto base_desc = make_naive_tensor_descriptor_packed( + make_tuple(A, B, C, D) + ); + + // 2. Apply multiple transformations + // First: merge first 3 dimensions + auto step1_desc = transform_tensor_descriptor( + base_desc, + make_tuple(make_merge_transform(make_tuple(A, B, C)), + make_pass_through_transform(D)), + make_tuple(sequence<0, 1, 2>{}, sequence<3>{}), // input mapping + make_tuple(sequence<0>{}, sequence<1>{}) // output: 2D + ); + + // 3. Then unmerge back but with different grouping + auto step2_desc = transform_tensor_descriptor( + step1_desc, + make_tuple(make_unmerge_transform(make_tuple(A*B, C)), + make_pass_through_transform(D)), + make_tuple(sequence<0>{}, sequence<1>{}), // from 2D + make_tuple(sequence<0, 1>{}, sequence<2>{}) // to 3D + ); + + // The adaptor chain is embedded in the descriptors + // CK optimizes these at compile time + +GPU Memory Layout Example +~~~~~~~~~~~~~~~~~~~~~~~~~ + +A practical example showing how adaptors create efficient :ref:`GPU memory access patterns `: + +.. code-block:: cpp + + // Create descriptor for thread block tile: 64x64 + // With 8x8 vector loads per thread + constexpr auto BlockM = 64; + constexpr auto BlockN = 64; + constexpr auto VectorM = 8; + constexpr auto VectorN = 8; + + // Thread arrangement: 8x8 threads + constexpr auto ThreadM = BlockM / VectorM; // 8 + constexpr auto ThreadN = BlockN / VectorN; // 8 + + // Create block descriptor with proper layout + auto block_desc = transform_tensor_descriptor( + make_naive_tensor_descriptor_packed( + make_tuple(number{}, number{}) + ), + make_tuple( + make_unmerge_transform(make_tuple( + number{}, number{} + )), + make_unmerge_transform(make_tuple( + number{}, number{} + )) + ), + make_tuple(sequence<0>{}, sequence<1>{}), // from 2D + make_tuple(sequence<0, 2>{}, sequence<1, 3>{}) // to 4D: [TM,TN,VM,VN] + ); + + // This creates the layout: + // - Dimension 0,1: Thread indices + // - Dimension 2,3: Vector indices within thread + // Enables coalesced memory access on GPU + // See :ref:`ck_tile_thread_mapping` for thread mapping details + +Common Transform Chains +----------------------- + +CK Tile provides several common transform chain patterns used throughout GPU kernels: + +**Padding for Convolution** + +.. code-block:: cpp + + auto padded = transform_tensor_descriptor( + input, + make_tuple(make_pad_transform(H, pad_h, pad_h), + make_pad_transform(W, pad_w, pad_w)), + make_tuple(sequence<0>{}, sequence<1>{}), + make_tuple(sequence<0>{}, sequence<1>{}) + ); + +**Dimension Merging for GEMM** + +.. code-block:: cpp + + auto merged = transform_tensor_descriptor( + input, + make_tuple(make_merge_transform(make_tuple(M, K))), + make_tuple(sequence<0, 1>{}), + make_tuple(sequence<0>{}) + ); + +For complete GEMM optimization strategies, see :ref:`ck_tile_gemm_optimization`. + +**Broadcasting for Elementwise Operations** + +.. code-block:: cpp + + auto broadcast = transform_tensor_descriptor( + scalar, + make_tuple(make_replicate_transform(make_tuple(M, N))), + make_tuple(sequence<>{}), + make_tuple(sequence<0, 1>{}) + ); + +Key Concepts Summary +-------------------- + +TensorAdaptors are the coordination layer that makes complex tensor operations possible: + +- **Identity Adaptor**: Starting point for building transformations +- **Transpose Adaptor**: Dimension reordering with permutation patterns +- **Single-Stage Adaptors**: Custom transform chains with precise control +- **Chained Adaptors**: Complex multi-stage transformation pipelines +- **Transform Addition**: Extending existing adaptors with new transforms + +Core concepts to remember: + +- **Bottom/Top Dimensions**: Input and output coordinate spaces +- **Hidden Dimensions**: Internal coordinate mappings between transforms +- **Transform Chains**: Sequential application of multiple transforms +- **Coordinate Transformation**: Bidirectional mapping between coordinate spaces +- **Nested Transforms**: Complex multi-level transformation hierarchies + +Key C++ Patterns in Composable Kernel +-------------------------------------- + +1. **Descriptor-Based Adaptors**: In CK, adaptors are typically embedded within :ref:`tensor descriptors ` rather than created separately +2. **Compile-Time Optimization**: All transformations are resolved at compile time for zero overhead +3. **Type Safety**: Template metaprogramming ensures coordinate transformations are type-safe +4. **GPU Optimization**: Transform chains are designed for efficient GPU memory access patterns. See :ref:`ck_tile_lds_bank_conflicts` for LDS optimization. + +TensorAdaptors bridge the gap between low-level transforms and high-level tensor operations, providing the flexibility to create advanced data layouts and access patterns that are essential for efficient GPU computing. They build upon the foundation of :ref:`BufferViews ` and :ref:`TensorViews ` to provide complex transformation capabilities. + +Next Steps +---------- + +- :ref:`ck_tile_descriptors` - How adaptors combine with element space to form complete tensor descriptors +- :ref:`ck_tile_transforms` - Individual transform types and their properties +- :ref:`ck_tile_tile_window` - How adaptors enable efficient data loading patterns +- :ref:`ck_tile_space_filling_curve` - Advanced coordinate mapping techniques for cache optimization +- :ref:`ck_tile_static_distributed_tensor` - How adaptors help manage distributed tensor storage diff --git a/docs/conceptual/ck_tile/buffer_views.rst b/docs/conceptual/ck_tile/buffer_views.rst new file mode 100644 index 0000000000..14b8309504 --- /dev/null +++ b/docs/conceptual/ck_tile/buffer_views.rst @@ -0,0 +1,443 @@ +.. meta:: + :description: Composable Kernel CK Tile buffer views + :keywords: composable kernel, CK, CK Tile, ROCm, API, buffer view, raw memory + +.. _ck_tile_buffer_views: + +CK Tile buffer view +======================= + +Buffer view is an abstraction that provides structured access to memory. The ``buffer_view`` class is exposed in ``include/ck_tile/core/tensor/buffer_view.hpp``. + +Buffer view serves as the foundation for :ref:`ck_tile_tensor_views`. BufferView handles memory addressing and type safety, while TensorView builds upon this to add multi-dimensional coordinates (shape and strides). + + +Buffer view provides the following advantages: + +* A unified interface across global, shared, and register memory +* Address spaces encoded in types, taking advantage of compile-time type checking +* Configurable handling of invalid values, out-of-bounds operations, and conditional access patterns +* Atomic operations for parallel algorithms +* AMD GPU-specific optimizations +* Automatic application of appropriate memory ordering constraints and cache control directives based on the target address space and operation type + + +[TO DO: do we want to say more about these items? There wasn't a lot of detail in the original text, so I put them in a list for now] + + + +Address Space Usage Patterns +---------------------------- + +[TO DO: explain in words what the diagram shows] +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart TB + subgraph CF ["Compute Flow"] + direction LR + GM1["Global Memory
Input Data"] --> LDS["LDS
Tile Cache"] + LDS --> VGPR["VGPR
Working Set"] + VGPR --> Compute["Compute
Operations"] + Compute --> VGPR + VGPR --> LDS2["LDS
Reduction"] + LDS2 --> GM2["Global Memory
Output Data"] + end + + subgraph UP ["Usage Pattern"] + direction LR + P1["1. Load tile from Global → LDS"] + P2["2. Load working set LDS → VGPR"] + P3["3. Compute in VGPR"] + P4["4. Store results VGPR → LDS"] + P5["5. Reduce in LDS"] + P6["6. Write final LDS → Global"] + + P1 --> P2 --> P3 --> P4 --> P5 --> P6 + end + + CF ~~~ UP + + style GM1 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style LDS fill:#fed7aa,stroke:#f59e0b,stroke-width:2px + style VGPR fill:#d1fae5,stroke:#10b981,stroke-width:2px + style Compute fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + + +.. image:: diagrams/buffer_views_1.svg + :alt: Diagram + :align: center + + +Basic Creation +~~~~~~~~~~~~~~ + +[TO DO: remove "modern C++ template metaprogramming" and "zero-overhead abstraction"] + +[TO DO: might want to move the implementation details to a separate section under "reference"] + + +.. code-block:: cpp + + #include + #include + + // Create buffer view in C++ + __device__ void example_buffer_creation() + { + // Static array in global memory + float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + constexpr index_t buffer_size = 8; + + // Create buffer view for global memory + // Template parameters: + auto buffer_view = make_buffer_view( + data, // pointer to data + buffer_size // number of elements + ); + + + // Implementation detail: The actual C++ template is: + // template + // struct buffer_view + + // Alternative: Create with explicit type + using buffer_t = buffer_view; + buffer_t explicit_buffer{data, number{}}; + + // Access properties at compile time + constexpr auto size = buffer_view.get_buffer_size(); + constexpr auto space = buffer_view.get_address_space(); + + // The buffer_view type encodes: + // - Data type (float) + // - Address space (global memory) + // - Size (known at compile time for optimization) + static_assert(size == 8, "Buffer size should be 8"); + static_assert(space == address_space_enum::global, "Should be global memory"); + } + +[TO DO: add details and remove unnecessary comments; the "implementation detail" comment can be moved out and either placed outside and explained further, or just removed, depending on what we want to do] + +[TO DO: might want to put this implementation detail in the reference section] + +Buffer view uses two modes, zero value mode and custom value mode, that can prevent serialization during bounds checking. + +Zero value mode returns zero without branching when an access falls outside the valid buffer range. This is useful in convolutions where out-of-bounds accesses correspond to zero-padding. + +Custom value mode returns a custom value without branching when an access falls outside the valid buffer range. Custom value mode accommodates algorithms that require specific values for boundary conditions. + +[TO DO: there were two examples of custom value mode that I removed. I removed them because unlike for zero value mode where the example was convolution, the example was vague in custom value. Is there a more specific example of where custom value would be used?] + +.. code-block:: cpp + + // Basic buffer view creation with automatic zero for invalid elements + void basic_creation_example() { + // Create data array + constexpr size_t buffer_size = 8; + float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + + // Create global memory buffer view + auto buffer_view = make_buffer_view(data, buffer_size); + } + + // Custom invalid value mode + void custom_invalid_value_example() { + constexpr size_t buffer_size = 8; + float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + float custom_invalid = 13.0f; + + // Create buffer view with custom invalid value + auto buffer_view = make_buffer_view( + data, buffer_size, custom_invalid); + } + + +When ``InvalidElementUseNumericalZeroValue`` is set to true, the system uses zero value mode for out of bounds checking. When ``InvalidElementUseNumericalZeroValue`` is set to false, custom value mode is used. Zero value mode is used by default. + +.. note:: + + Zero or custom invalid value is only returned for complete invalid values or out of bound access, for example when the first address of the vector is invalid. Partial out of bounds access during vector reads will not return useful results. + +.. code-block:: cpp + + // Create data array + constexpr size_t buffer_size = 8; + float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + float custom_invalid = 13.0f; + + // Create global memory buffer view with zero invalid value mode (default) + auto buffer_view = make_buffer_view(data, buffer_size, custom_invalid); + + // Invalid element access with is_valid_element=false + // Returns custom_invalid due to custom invalid value mode + auto invalid_value = buffer_view.template get(0, 0, false); + printf("Invalid element: %.1f\n", invalid_value.get(0)); + + // Out of bounds access - AMD buffer addressing handles bounds checking + // Will return custom_invalid when accessing beyond buffer_size + auto oob_value = buffer_view.template get(0, 100, true); + printf("Out of bounds: %.1f\n", oob_value.get(0)); + + + + + +Get Operations +-------------- + +[TO DO: might want to put this implementation detail in the reference section] + +The signature for the ``buffer_view`` ``get()`` takes four parameters: + +``i``: the primary offset into the buffer expressed in terms of elements of type T rather than raw bytes. + +``linear_offset``: [TO DO: what is this?] + +``is_valid_element``: [TO DO: what is this?] + +[TO DO: the last param, that's the out of bounds handling, yes? +.. code:: cpp + + get(index_t i, + index_t linear_offset, + bool is_valid_element, + bool_constant = {}) + + +[TO DO: need some context around the code] + +[TO DO: code chunks need to have detail and explanation so that the reader can see what they're trying to demonstrate.] + + +.. code-block:: cpp + + // Create buffer view + float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + auto buffer_view = make_buffer_view(data, 8); + + // Simple get - compile-time bounds checking when possible + auto value_buf = buffer_view.template get(0,1,true); //get the buffer from the buffer view + float value = value_buf.get(0); //get the value from the buffer + + // Get with valid flag - branchless conditional access + bool valid_flag = false; + value_buf = buffer_view.template get(0,1,valid_flag); + value = value_buf.get(0); + // Returns 0 valid_flag is false + + // vectorized get + using float2 = ext_vector_t; + auto vector_buf = buffer_view.template get(0, 0, true); + // Loads 2 floats in a single instruction + float val1 = vector_buf.get(0); + float val2 = vector_buf.get(1); + } + +``ext_vector_t`` enables compile-time selection of optimal load and store instructions that can transfer multiple data elements in a single memory transaction. + +[TO DO: what is it actually doing? When does one use scalars vs vectors? Is it application specific or are there ] + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Scalar Access (4 instructions)" + S1["Load float[0]"] --> R1["Register 1"] + S2["Load float[1]"] --> R2["Register 2"] + S3["Load float[2]"] --> R3["Register 3"] + S4["Load float[3]"] --> R4["Register 4"] + end + + subgraph "Vectorized Access (1 instruction)" + V1["Load float4[0]"] --> VR["Vector Register
(4 floats)"] + end + + subgraph "Performance Impact" + Perf["4x fewer instructions
Better memory bandwidth
Reduced latency"] + end + + R1 & R2 & R3 & R4 --> Perf + VR --> Perf + + style S1 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style S2 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style S3 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style S4 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style V1 fill:#d1fae5,stroke:#10b981,stroke-width:2px + style Perf fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + + + + + + +.. image:: diagrams/buffer_views_2.svg + :alt: Diagram + :align: center + +Understanding BufferView Indexing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +[TO DO: an explanation of the diagram is needed] + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "Input Parameters" + Offset["Offset
(e.g., 5)"] + ValidFlag["Valid Flag
(optional)"] + end + + subgraph "Processing" + BoundsCheck{{"Bounds Check
offset < buffer_size?"}} + FlagCheck{{"Flag Check
valid_flag == True?"}} + Access["Access Memory
buffer[offset]"] + end + + subgraph "Output" + ValidResult["Valid Result
Return value"] + Invalid["Invalid Result
Return 0 or default"] + end + + Offset --> BoundsCheck + ValidFlag --> FlagCheck + + BoundsCheck -->|Yes| FlagCheck + BoundsCheck -->|No| Invalid + + FlagCheck -->|Yes| Access + FlagCheck -->|No| Invalid + + Access --> ValidResult + + style Offset fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style ValidFlag fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style ValidResult fill:#d1fae5,stroke:#10b981,stroke-width:2px + style Invalid fill:#fee2e2,stroke:#ef4444,stroke-width:2px + + + + + + +.. image:: diagrams/buffer_views_3.svg + :alt: Diagram + :align: center + + + +Update Operations +----------------- + +Update operations modify the buffer content. The ``set()`` method writes a value to a specific location. + +.. code-block:: cpp + + void scalar_set_operations_example() { + + // Create data array + constexpr size_t buffer_size = 8; + float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f}; + + // Create global memory buffer view + auto buffer_view = make_buffer_view(data, buffer_size); + + // Basic set: set(i, linear_offset, is_valid_element, value) + // Sets element at position i + linear_offset = 0 + 2 = 2 + buffer_view.template set(0, 2, true, 99.0f); + + // Invalid write with is_valid_element=false (ignored) + buffer_view.template set(0, 3, false, 777.0f); + + // Out of bounds write - handled safely by AMD buffer addressing + buffer_view.template set(0, 100, true, 555.0f); + + // Vector set + using float2 = ext_vector_t; + float2 pair_values{100.0f, 200.0f}; + buffer_view.template set(0, 5, true, pair_values); + } + +Atomic Operations +----------------- + +[TO DO: this needs information] + +Atomic vs Non-Atomic Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Non-Atomic Operation (Race Condition)" + NA1["Thread 1: Read value (10)"] --> NA2["Thread 1: Add 5 (15)"] + NA3["Thread 2: Read value (10)"] --> NA4["Thread 2: Add 3 (13)"] + NA2 --> NA5["Thread 1: Write 15"] + NA4 --> NA6["Thread 2: Write 13"] + NA5 & NA6 --> NA7["Final value: 13 ❌
(Lost update from Thread 1)"] + end + + subgraph "Atomic Operation (Thread-Safe)" + A1["Thread 1: atomic_add(5)"] --> A2["Hardware ensures
serialization"] + A3["Thread 2: atomic_add(3)"] --> A2 + A2 --> A4["Final value: 18 ✓
(Both updates applied)"] + end + + style NA7 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style A4 fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + + +.. image:: diagrams/buffer_views_4.svg + :alt: Diagram + :align: center + +C++ Atomic Operations +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + __device__ void example_atomic_operations() + { + // Shared memory for workgroup-level reductions + __shared__ float shared_sum[256]; + auto shared_buffer_view = make_buffer_view( + shared_sum, 256 + ); + + // Initialize shared memory + if (threadIdx.x < 256) { + shared_buffer_view.template set(threadIdx.x, 0.0f, true); + } + __syncthreads(); + + // Each thread atomically adds to shared memory + auto my_value = static_cast(threadIdx.x); + shared_buffer_view.template update(0,0,true,my_value); + + // Atomic max for finding maximum value + shared_buffer_view.template update(0,1,true,my_value); + + __syncthreads(); + } diff --git a/docs/conceptual/ck_tile/cache_flushing_benchmarking.rst b/docs/conceptual/ck_tile/cache_flushing_benchmarking.rst new file mode 100644 index 0000000000..2866ba0c9f --- /dev/null +++ b/docs/conceptual/ck_tile/cache_flushing_benchmarking.rst @@ -0,0 +1,390 @@ +=================================== +Cache Flushing for GPU Benchmarking +=================================== + +Overview +======== + +When benchmarking GPU kernels, accurate performance measurements require understanding and controlling cache behavior. Running a kernel multiple times with the same input data can lead to artificially fast results due to **cache hits**, where data and instructions are served from fast GPU cache rather than slow High Bandwidth Memory (HBM). + +Composable Kernel provides two complementary mechanisms to ensure realistic "cold cache" performance measurements: + +1. **Instruction Cache Flushing** - Invalidates cached GPU instructions +2. **Rotating Memory Buffers** - Cycles through multiple data buffer copies at different memory addresses + +This document explains how these mechanisms work and how to use them in benchmarks. + +The Problem: Hot vs. Cold Cache +================================ + +GPU Memory Hierarchy +-------------------- + +GPUs have a multi-level cache hierarchy: + +.. code-block:: text + + Fast → Slow, Small → Large + + ┌─────────────────┐ + │ Register File │ ~1 cycle + ├─────────────────┤ + │ L1 I-Cache │ ~4 cycles ← Instruction cache + ├─────────────────┤ + │ L1 Data Cache │ ~4 cycles ← Data cache + ├─────────────────┤ + │ L2 Cache │ ~50 cycles + ├─────────────────┤ + │ HBM (VRAM) │ ~400 cycles + └─────────────────┘ + +Cache Behavior Without Flushing +-------------------------------- + +When running a kernel repeatedly without cache management: + +.. code-block:: text + + Run 1: [Cache MISS] → Fetch from HBM → 400 cycles → 5.2ms + Run 2: [Cache HIT!] → Read from L1/L2 → 4 cycles → 3.8ms ← Artificially fast! + Run 3: [Cache HIT!] → Read from L1/L2 → 4 cycles → 3.8ms + ... + Average: 4.1ms (misleading - not representative of real-world performance) + +This leads to: + +- ✗ Inflated performance numbers +- ✗ Inconsistent timing between first and subsequent runs +- ✗ Unfair comparisons between different kernels +- ✗ Misleading optimization decisions + +Solution 1: Instruction Cache Flushing +======================================= + +What is Instruction Cache? +--------------------------- + +The **instruction cache (I-cache)** is a small, fast memory on each GPU compute unit that stores recently executed machine code instructions. When a thread needs to execute an instruction: + +1. The **Program Counter (PC)** holds the instruction's memory address +2. The GPU checks if that address exists in the I-cache +3. **Cache HIT**: Instruction read instantly from I-cache (~4 cycles) +4. **Cache MISS**: Instruction fetched from HBM (~400 cycles), then cached + +How It Works +------------ + +The GPU uses **address-based caching**: when you launch the same kernel multiple times, the kernel code resides at the same memory address, allowing the I-cache to serve cached instructions. + +.. code-block:: text + + First Kernel Run: + PC = 0x7F8A0000 → I-Cache lookup → MISS → Fetch from HBM → Cache it + + Second Kernel Run (without flush): + PC = 0x7F8A0000 → I-Cache lookup → HIT! → Read from cache (fast!) + + Second Kernel Run (with flush): + PC = 0x7F8A0000 → I-Cache lookup → MISS → Fetch from HBM again + +The ``flush_icache()`` Function +-------------------------------- + +Located in ``include/ck_tile/host/flush_icache.hpp``: + +.. code-block:: cpp + + namespace ck_tile { + // GPU kernel to invalidate instruction cache for accurate benchmarking. + static __global__ void flush_cache() + { + asm __volatile__("s_icache_inv \n\t" // Invalidate I-cache + "s_nop 0 \n\t" // Wait cycles (16 NOPs) + "s_nop 0 \n\t" + // ... 14 more NOPs + "s_nop 0 \n\t" :: + :); + } + } + +**Key Components:** + +- ``s_icache_inv``: AMD GPU instruction that invalidates the L1 instruction cache on the current compute unit +- ``s_nop 0`` (×16): No-operation instructions (NOPs) that create a 16-cycle delay to ensure cache invalidation completes before the kernel exits + +**Why 16 NOPs?** + +The ``s_icache_inv`` instruction is **asynchronous**: it initiates cache invalidation but doesn't wait for completion. Without the NOPs, the kernel might exit before the flush finishes, leading to race conditions and incomplete cache invalidation. + +Launching the Flush Kernel +--------------------------- + +From ``include/ck_tile/host/rotating_buffers.hpp``: + +.. code-block:: cpp + + inline void flush_icache() + { + hipDeviceProp_t deviceProps; + HIP_CHECK_ERROR(hipGetDeviceProperties(&deviceProps, 0)); + + // Over-provision blocks to ensure all CUs execute the flush instruction. + // With imperfect scheduling, launching exactly 1 block per CU doesn't guarantee coverage. + // 60x over-provisioning provides statistical certainty that every CU gets at least one block. + constexpr int32_t blocks_per_cu = 60; + int32_t gpu_block3 = deviceProps.multiProcessorCount * blocks_per_cu; + + ck_tile::flush_cache<<>>(); + HIP_CHECK_ERROR(hipGetLastError()); + } + +**Why 60× Over-provisioning?** + +The I-cache is **per-compute-unit** (CU). To flush all CUs, we must ensure every CU executes at least one instance of ``s_icache_inv``. + +- Launching exactly 1 block per CU doesn't guarantee 1:1 mapping due to GPU scheduler behavior +- Launching 60 blocks per CU provides statistical certainty that every CU receives work +- For a 120-CU GPU: 120 × 60 = 7,200 blocks × 64 threads = 460,800 total threads + +This ensures comprehensive instruction cache flushing across all compute units. + +Solution 2: Rotating Memory Buffers +==================================== + +What is Data Cache? +------------------- + +While I-cache stores instructions, **data cache** (L1 data, L2) stores matrix data (inputs A, B and output C). When a kernel reads the same matrix repeatedly, the data is served from cache rather than HBM. + +The RotatingMemWrapper Struct +------------------------------ + +Located in ``include/ck_tile/host/rotating_buffers.hpp``: + +.. code-block:: cpp + + template + struct RotatingMemWrapper + { + RotatingMemWrapper(const void* a_ptr_, + const void* b_ptr_, + std::size_t rotating_count_, + std::size_t size_a_, + std::size_t size_b_); + + void Next(); // Rotate to next buffer copy + ~RotatingMemWrapper() noexcept; // Cleanup + }; + +**Purpose**: Prevents data cache reuse by cycling through multiple copies of input matrices at different memory addresses. + +How It Works +------------ + +**Constructor: Create Buffer Copies** + +.. code-block:: cpp + + RotatingMemWrapper(a_ptr, b_ptr, rotating_count=3, size_a, size_b) + { + // Store original buffer pointers as first entry + p_a_grids.push_back(a_ptr); + p_b_grids.push_back(b_ptr); + + // Create (rotating_count - 1) additional copies at different memory addresses + for(size_t i = 1; i < rotating_count; i++) + { + void* pADeviceBuf; + hipMalloc(&pADeviceBuf, size_a); + hipMemcpy(pADeviceBuf, p_a_grids[0], size_a, hipMemcpyDeviceToDevice); + p_a_grids.push_back(pADeviceBuf); + + // Same for B matrix... + } + } + +Result: + +.. code-block:: text + + GPU Memory: + ┌─────────────────────────┐ + │ Matrix A (original) │ Address: 0x1000 + │ Matrix A (copy 1) │ Address: 0x2000 + │ Matrix A (copy 2) │ Address: 0x3000 + │ Matrix B (original) │ Address: 0x4000 + │ Matrix B (copy 1) │ Address: 0x5000 + │ Matrix B (copy 2) │ Address: 0x6000 + └─────────────────────────┘ + +**Next(): Rotate to Next Buffer** + +.. code-block:: cpp + + void Next() + { + if(rotating_count > 1) + { + std::size_t idx = iter++ % rotating_count; // Cycle: 0,1,2,0,1,2,... + a_ptr = p_a_grids[idx]; + b_ptr = p_b_grids[idx]; + } + } + +Usage in benchmarking loop: + +.. code-block:: text + + Iteration 1: Next() → Use buffers at 0x1000, 0x4000 → Kernel reads → Cache miss + Iteration 2: Next() → Use buffers at 0x2000, 0x5000 → Kernel reads → Cache miss + Iteration 3: Next() → Use buffers at 0x3000, 0x6000 → Kernel reads → Cache miss + Iteration 4: Next() → Use buffers at 0x1000, 0x4000 → Kernel reads → Cache miss + ... + +By the time the buffers cycle back to the first copy, the cache has likely evicted the old data. + +**Destructor: Cleanup** + +.. code-block:: cpp + + ~RotatingMemWrapper() noexcept + { + // Restore original buffer pointers + a_ptr = p_a_grids[0]; + b_ptr = p_b_grids[0]; + + // Free extra buffer copies (index 0 is original, don't free it) + for(size_t i = 1; i < rotating_count; i++) + { + hipFree(p_a_grids[i]); + hipFree(p_b_grids[i]); + } + } + +Using Cache Flushing in Practice +================================= + +Command Line Argument +--------------------- + +The ``flush_cache`` command-line argument controls whether cache flushing is enabled: + +.. code-block:: bash + + # Enable cache flushing (cold cache benchmarking) + ./gemm_example --flush_cache=1 --rotating_count=3 + + # Disable cache flushing (hot cache benchmarking) + ./gemm_example --flush_cache=0 + +In ``run_gemm_quant_example.inc``: + +.. code-block:: cpp + + bool flush_cache = arg_parser.get_bool("flush_cache"); + int rotating_count = arg_parser.get_int("rotating_count"); + + // Pass to stream_config + ck_tile::stream_config{ + nullptr, // stream + true, // time_kernel + 1, // log_level + n_warmup, // cold_niters (warmup iterations) + n_repeat, // nrepeat (timed iterations) + true, // is_gpu_timer + flush_cache, // flush_cache_ ← Controls cache flushing + rotating_count // rotating_count_ ← Number of buffer copies + } + +Integration with Timing Loop +----------------------------- + +The ``launch_kernel_time_mask`` function integrates both mechanisms: + +.. code-block:: cpp + + // From include/ck_tile/host/kernel_launch.hpp + template + float launch_kernel_time_mask(const stream_config& s, + PreprocessFunc preprocess, + Callables&&... callables) + { + // Timing loop (simplified) + for(int i = 0; i < s.nrepeat_; i++) + { + preprocess(); // 1. Flush I-cache + rotate buffers + callables_func(); // 2. Launch kernel + } + + return average_time; + } + +Complete Example +---------------- + +From ``example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc``: + +.. code-block:: cpp + + // Setup rotating memory wrapper + RotatingMemWrapper rotating_mem( + a_ptr, b_ptr, rotating_count, size_a, size_b); + + // Define preprocessing: flush I-cache + rotate buffers + auto preprocess = [&]() { + if(flush_cache) { + flush_icache(); // Invalidate instruction cache + rotating_mem.Next(); // Switch to next buffer copy + } + }; + + // Define kernel launch + auto kernel_launch = [&]() { + gemm_kernel<<>>(a_ptr, b_ptr, c_ptr, M, N, K); + }; + + // Benchmark with cache control + float avg_time = launch_kernel_time_mask( + stream_config, // Config with flush_cache and rotating_count + preprocess, // Flush + rotate before each iteration + kernel_launch // Kernel to benchmark + ); + +Execution Flow +-------------- + +With ``flush_cache=true`` and ``rotating_count=3``, ``nrepeat=100``: + +.. code-block:: text + + Warmup Phase (n_warmup iterations): + - Run kernel without timing + - Prime GPU, warm up scheduler + + Timed Phase (100 iterations): + Iteration 1: flush_icache() → rotating_mem.Next() → Use buffer copy 0 → kernel() → Measure + Iteration 2: flush_icache() → rotating_mem.Next() → Use buffer copy 1 → kernel() → Measure + Iteration 3: flush_icache() → rotating_mem.Next() → Use buffer copy 2 → kernel() → Measure + Iteration 4: flush_icache() → rotating_mem.Next() → Use buffer copy 0 → kernel() → Measure + ... + Iteration 100: flush_icache() → rotating_mem.Next() → Use buffer copy 1 → kernel() → Measure + + Return: Average time per iteration (excluding preprocess overhead) + +References +========== + +Related Files +------------- + +- ``include/ck_tile/host/flush_icache.hpp`` - I-cache flush kernel implementation +- ``include/ck_tile/host/rotating_buffers.hpp`` - RotatingMemWrapper implementation +- ``include/ck_tile/host/kernel_launch.hpp`` - Timing loop integration + +Conclusion +========== + +Accurate GPU kernel benchmarking requires careful control of cache behavior. The combination of **instruction cache flushing** (``flush_icache``) and **rotating memory buffers** (``RotatingMemWrapper``) ensures realistic "cold cache" performance measurements that represent real-world application behavior. + +By understanding and utilizing these mechanisms through the ``flush_cache`` command-line argument, you can obtain trustworthy performance data for optimization decisions and fair kernel comparisons. + diff --git a/docs/conceptual/ck_tile/convert_mermaid_to_svg.py b/docs/conceptual/ck_tile/convert_mermaid_to_svg.py new file mode 100644 index 0000000000..1d62405e53 --- /dev/null +++ b/docs/conceptual/ck_tile/convert_mermaid_to_svg.py @@ -0,0 +1,224 @@ +#!/usr/bin/env python3 +""" +Script to convert all mermaid diagrams in CK Tile docs to SVGs. +This script: +1. Finds all mermaid blocks in RST files +2. Converts them to SVG using mmdc +3. Updates RST files to use SVG images with commented mermaid source +""" + +import os +import re +import subprocess +import tempfile +from pathlib import Path + +# Configuration +DOCS_DIR = Path(__file__).parent +DIAGRAMS_DIR = DOCS_DIR / "diagrams" +RST_FILES = [ + "convolution_example.rst", + "encoding_internals.rst", + "lds_index_swapping.rst", + "space_filling_curve.rst", + "sweep_tile.rst", + "tensor_coordinates.rst", + "thread_mapping.rst", + "static_distributed_tensor.rst", + "load_store_traits.rst", + "tile_window.rst", + "transforms.rst", + "descriptors.rst", + "coordinate_movement.rst", + "adaptors.rst", + "introduction_motivation.rst", + "buffer_views.rst", + "tensor_views.rst", + "coordinate_systems.rst", + "tile_distribution.rst", +] + +# Pattern to find mermaid blocks (can be indented with 3 spaces for commented blocks) +MERMAID_PATTERN = re.compile( + r"^(?: )?\.\. mermaid::\s*\n((?:(?:\n| .*))*)", re.MULTILINE +) + + +def extract_mermaid_content(block): + """Extract the actual mermaid code from the block, removing RST indentation.""" + lines = block.split("\n") + # Remove the leading spaces (RST indentation) + content_lines = [] + for line in lines: + if line.startswith(" "): + content_lines.append(line[3:]) # Remove 3 spaces + elif line.strip() == "": + content_lines.append("") + return "\n".join(content_lines).strip() + + +def generate_diagram_name(file_path, diagram_index, total_in_file): + """Generate a descriptive name for the diagram.""" + base_name = file_path.stem + if total_in_file == 1: + return f"{base_name}.svg" + else: + return f"{base_name}_{diagram_index + 1}.svg" + + +def convert_mermaid_to_svg(mermaid_code, output_path): + """Convert mermaid code to SVG using mmdc.""" + # Create a temporary file for the mermaid code + with tempfile.NamedTemporaryFile( + mode="w", suffix=".mmd", delete=False, encoding="utf-8" + ) as tmp: + tmp.write(mermaid_code) + tmp_path = tmp.name + + try: + # Run mmdc to convert to SVG (use shell=True on Windows for .cmd files) + subprocess.run( + [ + "mmdc", + "-i", + tmp_path, + "-o", + str(output_path), + "-t", + "neutral", + "-b", + "transparent", + ], + capture_output=True, + text=True, + check=True, + shell=True, # Required for Windows .cmd files + ) + print(f" ✓ Generated: {output_path.name}") + return True + except subprocess.CalledProcessError as e: + print(f" ✗ Error converting diagram: {e.stderr}") + return False + finally: + # Clean up temp file + os.unlink(tmp_path) + + +def update_rst_file(file_path, diagrams_info): + """Update RST file to replace mermaid blocks with commented source + image reference.""" + with open(file_path, "r", encoding="utf-8") as f: + content = f.read() + + # Sort diagrams by position (reverse order to maintain positions) + diagrams_info.sort(key=lambda x: x["position"], reverse=True) + + for info in diagrams_info: + # Find the mermaid block + match = info["match"] + start_pos = match.start() + end_pos = match.end() + + # Create the replacement text + mermaid_block = match.group(0) + + # Create commented mermaid block + commented_lines = [ + ".. ", + " Original mermaid diagram (edit here, then run update_diagrams.py)", + " ", + ] + for line in mermaid_block.split("\n"): + commented_lines.append(f" {line}") + + # Add image reference + svg_rel_path = f"diagrams/{info['svg_name']}" + image_block = [ + "", + f".. image:: {svg_rel_path}", + " :alt: Diagram", + " :align: center", + "", + ] + + replacement = "\n".join(commented_lines + image_block) + + # Replace in content + content = content[:start_pos] + replacement + content[end_pos:] + + # Write back + with open(file_path, "w", encoding="utf-8") as f: + f.write(content) + + print(f" ✓ Updated: {file_path.name}") + + +def process_file(file_path): + """Process a single RST file.""" + print(f"\nProcessing {file_path.name}...") + + with open(file_path, "r", encoding="utf-8") as f: + content = f.read() + + # Find all mermaid blocks + matches = list(MERMAID_PATTERN.finditer(content)) + + if not matches: + print(" No mermaid diagrams found.") + return + + print(f" Found {len(matches)} diagram(s)") + + diagrams_info = [] + + # Process each mermaid block + for idx, match in enumerate(matches): + mermaid_content = extract_mermaid_content(match.group(1)) + svg_name = generate_diagram_name(file_path, idx, len(matches)) + svg_path = DIAGRAMS_DIR / svg_name + + # Convert to SVG + if convert_mermaid_to_svg(mermaid_content, svg_path): + diagrams_info.append( + {"match": match, "svg_name": svg_name, "position": match.start()} + ) + + # Update the RST file + if diagrams_info: + update_rst_file(file_path, diagrams_info) + + +def main(): + """Main function.""" + print("CK Tile Mermaid to SVG Converter") + print("=" * 50) + + # Verify mmdc is available + try: + subprocess.run( + ["mmdc", "--version"], capture_output=True, check=True, shell=True + ) + except (subprocess.CalledProcessError, FileNotFoundError): + print("Error: mermaid-cli (mmdc) not found. Please install it:") + print(" npm install -g @mermaid-js/mermaid-cli") + return 1 + + # Ensure diagrams directory exists + DIAGRAMS_DIR.mkdir(parents=True, exist_ok=True) + + # Process each file + for rst_file in RST_FILES: + file_path = DOCS_DIR / rst_file + if file_path.exists(): + process_file(file_path) + else: + print(f"\n⚠ Warning: {rst_file} not found") + + print("\n" + "=" * 50) + print("✓ Conversion complete!") + print(f"SVG files saved to: {DIAGRAMS_DIR}") + + return 0 + + +if __name__ == "__main__": + exit(main()) diff --git a/docs/conceptual/ck_tile/convert_raw_html_to_commented.py b/docs/conceptual/ck_tile/convert_raw_html_to_commented.py new file mode 100644 index 0000000000..e90bf9def0 --- /dev/null +++ b/docs/conceptual/ck_tile/convert_raw_html_to_commented.py @@ -0,0 +1,84 @@ +#!/usr/bin/env python3 +"""Convert raw HTML mermaid blocks to commented format for SVG conversion.""" + +import os +import re + + +def convert_raw_html_to_commented(content): + """Convert raw HTML mermaid blocks to commented mermaid format.""" + + # Pattern to match raw HTML mermaid blocks + pattern = r'\.\. raw:: html\n\n
]*>\n(.*?)\n
' + + def replace_block(match): + mermaid_code = match.group(1) + # The mermaid code in HTML has 3-space indentation, keep it + # but add 3 more spaces for .. mermaid:: indentation + mermaid_lines = mermaid_code.split("\n") + properly_indented = [] + for line in mermaid_lines: + if line.strip(): # Non-empty line + # Line already has 3 spaces from HTML, add 3 more for mermaid block + properly_indented.append(" " + line) + else: + properly_indented.append("") + + indented_code = "\n".join(properly_indented) + + # Create commented format matching the expected pattern + commented = f""".. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + +{indented_code} + + +""" + return commented + + return re.sub(pattern, replace_block, content, flags=re.DOTALL) + + +def main(): + """Process files with raw HTML mermaid blocks.""" + + files_to_convert = [ + "introduction_motivation.rst", + "buffer_views.rst", + "tensor_views.rst", + "coordinate_systems.rst", + "tile_distribution.rst", + ] + + converted_files = [] + + for filename in files_to_convert: + if not os.path.exists(filename): + print(f"Skipping {filename} - not found") + continue + + with open(filename, "r", encoding="utf-8") as f: + original = f.read() + + converted = convert_raw_html_to_commented(original) + + if converted != original: + with open(filename, "w", encoding="utf-8") as f: + f.write(converted) + + blocks_converted = original.count(".. raw:: html") + converted_files.append((filename, blocks_converted)) + print(f"✓ Converted {filename}: {blocks_converted} blocks") + else: + print(f" {filename}: no raw HTML blocks found") + + print("\n=== CONVERSION COMPLETE ===") + print(f"Files converted: {len(converted_files)}") + print(f"Total blocks: {sum(c for _, c in converted_files)}") + print("\nNext: Run convert_mermaid_to_svg.py to generate SVG files") + + +if __name__ == "__main__": + main() diff --git a/docs/conceptual/ck_tile/convolution_example.rst b/docs/conceptual/ck_tile/convolution_example.rst new file mode 100644 index 0000000000..a981ae04da --- /dev/null +++ b/docs/conceptual/ck_tile/convolution_example.rst @@ -0,0 +1,567 @@ +.. meta:: + :description: CK Tile convolution implementation example + :keywords: CK Tile, convolution, im2col, tensor descriptors, GPU optimization + +.. _ck_tile_convolution_example: + +***************************************** +Convolution Implementation with CK Tile +***************************************** + +Overview +======== + +This section covers how CK Tile's :ref:`tensor descriptor ` system enables efficient convolution implementations on GPUs. Convolution operations are fundamental in deep learning, and understanding their optimization reveals how high-performance libraries achieve their efficiency. This section progresses from a naive implementation to an optimized approach using tensor descriptors, showing how they enable efficient memory access patterns for GPU acceleration. + +The key insight is that convolution can be transformed from a complex nested loop operation into a highly parallel matrix multiplication through the image to column (im2col) transformation. CK Tile's tensor descriptors provide the perfect abstraction for implementing this transformation efficiently without data duplication. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Convolution Process" + I["Input Image
6×6"] + K["Kernel
3×3"] + SW["Sliding Window
Extract 3×3 patches"] + DP["Dot Product
Element-wise multiply & sum"] + O["Output
4×4"] + end + + subgraph "Im2col Optimization" + W["Windows Matrix
16×9
(all patches)"] + KF["Kernel Flattened
9×1"] + MM["Matrix Multiply
W @ K"] + OF["Output Flattened
16×1"] + end + + I --> SW + K --> DP + SW --> DP + DP --> O + + SW --> W + K --> KF + W --> MM + KF --> MM + MM --> OF + OF --> O + + style I fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style O fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style MM fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + + +.. image:: diagrams/convolution_example.svg + :alt: Diagram + :align: center + +.. image:: diagrams/convolution_example.svg + :alt: Diagram + :align: center + +Understanding Sliding Windows +============================= + +Before diving into convolution, it's crucial to understand how sliding windows work. In convolution, overlapping patches need to be extracted from the input image. Traditional approaches would copy these patches, but CK Tile uses :ref:`tensor descriptors ` to create efficient :ref:`views ` without data duplication. + +Simple Tiling Example +--------------------- + +Non-overlapping tiles: + +.. code-block:: cpp + + // Create a 6x6 matrix tiled into 2x2 blocks + template + struct SimpleTiling { + static constexpr index_t kMatrixSize = 6; + static constexpr index_t kTileSize = 2; + static constexpr index_t kNumTiles = kMatrixSize / kTileSize; + + // Original matrix: shape=(6, 6), strides=(6, 1) + // Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1) + // See :ref:`ck_tile_descriptors` for descriptor details + using TileDescriptor = TensorDescriptor< + Sequence, + Sequence<12, 2, 6, 1> + >; + + __device__ void demonstrate() { + // To move to next tile row: skip 2 matrix rows = 6 × 2 = 12 + // To move to next tile col: skip 2 matrix cols = 1 × 2 = 2 + // Within tile: use original strides (6, 1) + } + }; + +The key insight is understanding **strides**, the number of elements to skip to move to the next element in each dimension. For non-overlapping tiles, we skip by ``tile_size`` in the outer dimensions. + +Overlapping Windows for Convolution +------------------------------------ + +For convolution, overlapping windows that slide by one element are needed: + +.. code-block:: cpp + + // Extract 3x3 overlapping windows from a 6x6 image + template + struct ConvolutionWindows { + static constexpr index_t H = 6; // Image height + static constexpr index_t W = 6; // Image width + static constexpr index_t K = 3; // Kernel size + static constexpr index_t OutH = H - K + 1; // Output height = 4 + static constexpr index_t OutW = W - K + 1; // Output width = 4 + + // Windows descriptor: shape=(4, 4, 3, 3), strides=(6, 1, 6, 1) + using WindowDescriptor = TensorDescriptor< + Sequence, + Sequence // Key: stride by 1 for overlap! + >; + + __device__ DataType extract_window(const DataType* image, + index_t out_i, index_t out_j, + index_t k_i, index_t k_j) { + WindowDescriptor desc; + index_t offset = desc.calculate_offset({out_i, out_j, k_i, k_j}); + return image[offset]; + } + }; + +The stride pattern ``[W, 1, W, 1]`` creates sliding windows: + +- Moving one step in output row: jump ``W`` elements (one image row) +- Moving one step in output col: jump ``1`` element (one image column) +- Within each window: same strides to access the 3×3 patch + +Naive Convolution Implementation +================================ + +A straightforward implementation for reference: + +.. code-block:: cpp + + template + __global__ void naive_convolution_kernel( + const DataType* __restrict__ input, + const DataType* __restrict__ kernel, + DataType* __restrict__ output, + index_t H, index_t W, index_t K) + { + index_t out_h = H - K + 1; + index_t out_w = W - K + 1; + + // Each thread computes one output element + index_t out_i = blockIdx.y * blockDim.y + threadIdx.y; + index_t out_j = blockIdx.x * blockDim.x + threadIdx.x; + + if (out_i < out_h && out_j < out_w) { + DataType sum = 0; + + // Extract window and apply convolution + for (index_t ki = 0; ki < K; ++ki) { + for (index_t kj = 0; kj < K; ++kj) { + index_t in_i = out_i + ki; + index_t in_j = out_j + kj; + sum += input[in_i * W + in_j] * kernel[ki * K + kj]; + } + } + + output[out_i * out_w + out_j] = sum; + } + } + +This implementation directly follows the mathematical definition but has poor memory access patterns and limited parallelism within each output computation. + +Window Extraction with Tensor Descriptors +========================================= + +CK Tile's tensor descriptors provide an clean way to extract convolution windows: + +.. code-block:: cpp + + template + struct ConvolutionWindowExtractor { + static constexpr index_t OutH = H - K + 1; + static constexpr index_t OutW = W - K + 1; + + // Create tensor descriptor for all windows + using WindowsDescriptor = TensorDescriptor< + Sequence, + Sequence + >; + + __device__ void extract_all_windows( + const DataType* input, + DataType* windows_buffer) + { + WindowsDescriptor desc; + + // Extract all windows in parallel + index_t tid = threadIdx.x + blockIdx.x * blockDim.x; + index_t total_elements = OutH * OutW * K * K; + + for (index_t i = tid; i < total_elements; i += gridDim.x * blockDim.x) { + // Convert linear index to 4D coordinates + index_t tmp = i; + index_t kj = tmp % K; tmp /= K; + index_t ki = tmp % K; tmp /= K; + index_t out_j = tmp % OutW; tmp /= OutW; + index_t out_i = tmp; + + // Calculate source offset using descriptor + index_t src_offset = desc.calculate_offset({out_i, out_j, ki, kj}); + windows_buffer[i] = input[src_offset]; + } + } + }; + +The tensor descriptor automatically handles the complex indexing required for overlapping windows, making the code cleaner and less error-prone. + +Im2col Transformation +===================== + +The im2col transformation converts the 4D windows tensor into a 2D matrix suitable for matrix multiplication. This is where CK Tile's :ref:`transformation system ` shines: + +.. code-block:: cpp + + template + struct Im2colTransformer { + static constexpr index_t NumWindows = OutH * OutW; + static constexpr index_t PatchSize = K * K; + + // Step 1: Create 4D windows descriptor + using WindowsDescriptor = TensorDescriptor< + Sequence, + Sequence + >; + + // Step 2: Apply merge transforms to create 2D im2col layout + // See :ref:`ck_tile_transforms` for transform operations + using Im2colDescriptor = decltype( + transform_tensor_descriptor( + WindowsDescriptor{}, + make_tuple( + make_merge_transform(Sequence{}), // Merge spatial dims + make_merge_transform(Sequence{}) // Merge kernel dims + ), + Sequence<0, 1>{}, // Merge dimensions 0,1 + Sequence<2, 3>{} // Merge dimensions 2,3 + ) + ); + + __device__ void create_im2col_matrix( + const DataType* input, + DataType* im2col_matrix) + { + Im2colDescriptor desc; + + // Each thread handles multiple elements + index_t tid = threadIdx.x + blockIdx.x * blockDim.x; + index_t total_elements = NumWindows * PatchSize; + + for (index_t i = tid; i < total_elements; i += gridDim.x * blockDim.x) { + index_t window_idx = i / PatchSize; + index_t patch_idx = i % PatchSize; + + // Calculate source offset using merged descriptor + index_t src_offset = desc.calculate_offset({window_idx, patch_idx}); + im2col_matrix[i] = input[src_offset]; + } + } + }; + +The transformation pipeline: +1. Start with 4D tensor ``[OutH, OutW, K, K]`` +2. Merge spatial dimensions: ``[OutH, OutW] → NumWindows`` +3. Merge kernel dimensions: ``[K, K] → PatchSize`` +4. Result: 2D matrix ``[NumWindows, PatchSize]`` + +Optimized Convolution Kernel +============================ + +Combining all components into an optimized convolution implementation: + +.. code-block:: cpp + + template + __global__ void optimized_convolution_kernel( + const DataType* __restrict__ input, + const DataType* __restrict__ kernel, + DataType* __restrict__ output, + index_t H, index_t W, index_t K) + { + constexpr index_t WarpSize = 32; + const index_t OutH = H - K + 1; + const index_t OutW = W - K + 1; + const index_t NumWindows = OutH * OutW; + const index_t PatchSize = K * K; + + // Create im2col descriptor for this image size + using Im2colDesc = TensorDescriptor< + Sequence, + DynamicStrides // Computed based on H, W, K + >; + + // Tile distribution for matrix multiplication + // See :ref:`ck_tile_tile_distribution` for details + using ATileDist = TileDistribution< + Sequence, + Sequence + >; + using BTileDist = TileDistribution< + Sequence, + Sequence<1, BlockN> + >; + using CTileDist = TileDistribution< + Sequence, + Sequence + >; + + // Thread-local accumulator + // See :ref:`ck_tile_static_distributed_tensor` + StaticDistributedTensor c_accumulator; + + // Initialize accumulator + #pragma unroll + for (index_t i = 0; i < c_accumulator.size(); ++i) { + c_accumulator[i] = 0; + } + + // Main GEMM loop over K dimension + for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) { + // Create tile windows for im2col matrix and kernel + // See :ref:`ck_tile_tile_window` for window operations + auto a_window = make_tile_window( + input, Im2colDesc{H, W, K}, + {blockIdx.y * TileM, k_tile} + ); + + auto b_window = make_tile_window( + kernel, TensorDescriptor>{}, + {k_tile, 0} + ); + + // Load tiles - see :ref:`ck_tile_load_store_traits` for optimization + auto a_tile = a_window.load(); + auto b_tile = b_window.load(); + + // Synchronize after loads + __syncthreads(); + + // Local matrix multiplication + #pragma unroll + for (index_t m = 0; m < TileM/BlockM; ++m) { + #pragma unroll + for (index_t n = 0; n < TileN/BlockN; ++n) { + #pragma unroll + for (index_t k = 0; k < TileK; ++k) { + c_accumulator.at(m, n) += + a_tile.at(m, k) * b_tile.at(k, n); + } + } + } + } + + // Store results back to global memory + auto c_window = make_tile_window( + output, TensorDescriptor>{OutW, 1}, + {blockIdx.y * TileM, blockIdx.x * TileN} + ); + c_window.store(c_accumulator); + } + +Multi-Channel Convolution +========================= + +Real-world convolutions involve multiple input and output channels. CK Tile handles this cleanly: + +.. code-block:: cpp + + template + struct MultiChannelConvolution { + static constexpr index_t OutH = H - K + 1; + static constexpr index_t OutW = W - K + 1; + static constexpr index_t NumWindows = OutH * OutW; + static constexpr index_t PatchSize = K * K * CIn; + + // 5D windows descriptor [OutH, OutW, K, K, CIn] + using Windows5D = TensorDescriptor< + Sequence, + Sequence + >; + + // Im2col: [NumWindows, PatchSize] + using Im2colDesc = decltype( + transform_tensor_descriptor( + Windows5D{}, + make_tuple( + make_merge_transform(Sequence{}), + make_merge_transform(Sequence{}) + ), + Sequence<0, 1>{}, + Sequence<2, 3, 4>{} + ) + ); + + // Filter layout: [K*K*CIn, COut] + using FilterDesc = TensorDescriptor< + Sequence, + Sequence + >; + + __device__ void compute( + const DataType* input, // [H, W, CIn] + const DataType* filters, // [K, K, CIn, COut] + DataType* output) // [OutH, OutW, COut] + { + // The convolution becomes a matrix multiplication: + // [NumWindows, PatchSize] @ [PatchSize, COut] = [NumWindows, COut] + // Then reshape to [OutH, OutW, COut] + } + }; + +The multi-channel extension naturally follows from the single-channel case: + +- Input: ``[H, W, CIn]`` +- Filters: ``[K, K, CIn, COut]`` +- Im2col matrix: ``[NumWindows, K×K×CIn]`` +- Output: ``[OutH, OutW, COut]`` + +Performance Optimizations +========================= + +CK Tile enables several optimizations for convolution: + +**1. Memory Coalescing** + +.. code-block:: cpp + + // Coalesced access pattern for im2col + template + __device__ void load_im2col_vectorized( + const float* input, + float* im2col_tile, + const Im2colDescriptor& desc) + { + using VectorType = vector_type_t; + + // Load multiple elements per thread + index_t tid = threadIdx.x; + index_t stride = blockDim.x; + + for (index_t i = tid; i < NumElements; i += stride * VectorSize) { + VectorType vec = *reinterpret_cast(&input[i]); + *reinterpret_cast(&im2col_tile[i]) = vec; + } + } + +**2. Shared Memory Tiling** + +.. code-block:: cpp + + // Use shared memory for frequently accessed data + __shared__ float smem_a[TileM][TileK]; + __shared__ float smem_b[TileK][TileN]; + + // Collaborative loading with proper bank conflict avoidance + // See :ref:`ck_tile_lds_bank_conflicts` for optimization + auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) { + #pragma unroll + for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) { + #pragma unroll + for (index_t j = threadIdx.x; j < TileK; j += blockDim.x) { + smem[i][j] = window.at(i, j); + } + } + }; + +**3. Register Blocking** + +.. code-block:: cpp + + // Each thread computes multiple output elements + template + struct RegisterBlock { + float c_reg[RegM][RegN]; + + __device__ void compute(const float* a_smem, const float* b_smem) { + #pragma unroll + for (index_t k = 0; k < TileK; ++k) { + #pragma unroll + for (index_t m = 0; m < RegM; ++m) { + #pragma unroll + for (index_t n = 0; n < RegN; ++n) { + c_reg[m][n] += a_smem[m] * b_smem[n]; + } + } + } + } + }; + +Performance Characteristics +=========================== + +The tensor descriptor approach provides optimal performance characteristics: + +.. list-table:: Method Comparison + :header-rows: 1 + :widths: 25 20 20 20 15 + + * - Method + - Memory Usage + - Parallelization + - GPU Efficiency + - Flexibility + * - Naive loops + - Low + - Poor + - Poor + - High + * - Direct im2col copy + - High + - Excellent + - Good + - Medium + * - Tensor descriptors + - Medium + - Excellent + - Excellent + - High + * - CK Tile optimized + - Low + - Excellent + - Excellent + - High + +Key advantages of the CK Tile approach: + +1. **Zero-copy views**: Tensor descriptors create logical views without data duplication +2. **Compile-time optimization**: All indexing calculations resolve at compile time +3. **Hardware-aware**: Automatic alignment and vectorization based on :ref:`architecture ` +4. **Composability**: Complex access patterns built from simple :ref:`transformations ` +5. **Performance portability**: Same code optimizes differently for different GPUs + +Summary +======= + +This example demonstrates how CK Tile transforms convolution from a memory-bound operation with poor parallelism into a compute-bound operation that utilizes GPU resources. The key insights are: + +- **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides +- **Im2col transformation** converts convolution to matrix multiplication without data copies +- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`) +- **Multi-channel support** extends naturally through higher-dimensional descriptors +- **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques) + +The tensor descriptor system provides a unified framework for these transformations, enabling automatic generation of efficient kernels for various convolution configurations and hardware architectures. This approach forms the foundation for production deep learning frameworks' convolution implementations. diff --git a/docs/conceptual/ck_tile/coordinate_movement.rst b/docs/conceptual/ck_tile/coordinate_movement.rst new file mode 100644 index 0000000000..73633afa88 --- /dev/null +++ b/docs/conceptual/ck_tile/coordinate_movement.rst @@ -0,0 +1,532 @@ +.. meta:: + :description: CK Tile advanced coordinate operations documentation + :keywords: CK Tile, coordinate movement, tensor coordinates, GPU programming + +.. _ck_tile_coordinate_movement: + +**************************** +Advanced Coordinate Movement +**************************** + +Overview +======== + +Advanced coordinate operations form the bridge between mathematical transformations and practical tensor manipulation in CK Tile. These operations enable efficient navigation through complex tensor layouts without recalculating entire transformation chains. Understanding coordinate movement is essential for implementing high-performance GPU kernels that traverse multi-dimensional data structures. + +The coordinate movement system provides two key abstractions: TensorCoordinate for descriptor-aware navigation and TensorAdaptorCoordinate for tracking positions through transformation chains. Together with movement functions, they enable advanced access patterns while maintaining optimal performance through incremental updates rather than full recalculation. + +For the mathematical foundations of coordinate systems, see :ref:`ck_tile_coordinate_systems`. For simpler coordinate concepts, see :ref:`ck_tile_tensor_coordinates`. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Coordinate Movement System" + TC["TensorCoordinate
Position + Descriptor Context"] + TAC["TensorAdaptorCoordinate
Position + Transform Context"] + MC["move_coordinate()
Efficient Navigation"] + end + + subgraph "Movement Example" + S["Start: [1,1]
Offset: 5"] + M1["Move [0,1]
→ [1,2]
Offset: 6"] + M2["Move [1,0]
→ [2,2]
Offset: 10"] + M3["Move [1,1]
→ [3,3]
Offset: 15"] + end + + TC --> MC + TAC --> MC + + S --> M1 + M1 --> M2 + M2 --> M3 + + style TC fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style TAC fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style MC fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_movement.svg + :alt: Diagram + :align: center + +.. image:: diagrams/coordinate_movement.svg + :alt: Diagram + :align: center + +TensorCoordinate: Descriptor-Aware Navigation +============================================= + +TensorCoordinate combines a multi-dimensional position with descriptor context to provide efficient offset calculation and validation. It caches transformation results to avoid redundant computations during navigation. This builds on the :ref:`ck_tile_descriptors` concepts for tensor specifications. + +Basic Structure +--------------- + +.. code-block:: cpp + + template + class TensorCoordinate { + private: + MultiIndex top_index_; // Position in top dimensions + MultiIndex hidden_index_; // Cached transformation results + index_t offset_; // Cached linear offset + + public: + // Create coordinate from descriptor and position + __host__ __device__ TensorCoordinate( + const TensorDescriptor& desc, + const MultiIndex& top_index) + { + top_index_ = top_index; + // Apply descriptor transforms to compute hidden indices + hidden_index_ = desc.calculate_bottom_index(top_index); + offset_ = desc.calculate_offset(top_index); + } + + // Access methods + __host__ __device__ const MultiIndex& get_index() const { + return top_index_; + } + + __host__ __device__ index_t get_offset() const { + return offset_; + } + + __host__ __device__ index_t ndim_hidden() const { + return hidden_index_.size(); + } + }; + +Creating and Using TensorCoordinate +----------------------------------- + +.. code-block:: cpp + + // Example: Navigate a 4x3 matrix with custom strides + template + __device__ void demonstrate_tensor_coordinate() { + // Create descriptor for 4x3 matrix, row-major layout + using Desc = TensorDescriptor< + Sequence<4, 3>, // Shape + Sequence<3, 1> // Strides + >; + Desc desc; + + // Create coordinate at position [2, 1] + auto coord = make_tensor_coordinate(desc, make_multi_index(2, 1)); + + // Access coordinate information + auto position = coord.get_index(); // [2, 1] + auto offset = coord.get_offset(); // 2*3 + 1 = 7 + auto hidden_dims = coord.ndim_hidden(); // 0 (no hidden dims) + + // Use offset for memory access + DataType* tensor_data = ...; + DataType value = tensor_data[offset]; + } + +Key Benefits +------------ + +1. **Context Preservation**: The coordinate maintains descriptor context for validation +2. **Cached Calculations**: Transformation results are cached for efficiency +3. **Type Safety**: Compile-time checking ensures coordinate-descriptor compatibility +4. **Zero Overhead**: All operations resolve at compile time when possible + + +TensorAdaptorCoordinate: Transform-Aware Tracking +================================================== + +TensorAdaptorCoordinate extends the concept to track coordinates through transformation chains, maintaining both input (top) and output (bottom) positions. This leverages :ref:`ck_tile_adaptors` and :ref:`ck_tile_transforms` for complex coordinate mappings. + +Structure and Implementation +---------------------------- + +.. code-block:: cpp + + template + class TensorAdaptorCoordinate { + private: + MultiIndex top_index_; // Input position + MultiIndex bottom_index_; // Output after transformations + MultiIndex hidden_index_; // Intermediate results + + public: + // Create from adaptor and position + __host__ __device__ TensorAdaptorCoordinate( + const TensorAdaptor& adaptor, + const MultiIndex& top_index) + { + top_index_ = top_index; + // Apply adaptor transforms + bottom_index_ = adaptor.calculate_bottom_index(top_index); + // Cache intermediate results + hidden_index_ = adaptor.get_hidden_index(top_index); + } + + // Access transformed coordinates + __host__ __device__ const MultiIndex& get_top_index() const { + return top_index_; + } + + __host__ __device__ const MultiIndex& get_bottom_index() const { + return bottom_index_; + } + }; + +Tracking Through Transformations +-------------------------------- + +.. code-block:: cpp + + // Example: Track coordinates through transpose + template + __device__ void demonstrate_adaptor_coordinate() { + // Create transpose adaptor (swap dimensions) + auto adaptor = make_transpose_adaptor<2>(Sequence<1, 0>{}); + + // Create coordinate at [2, 3] + auto coord = make_tensor_adaptor_coordinate( + adaptor, + make_multi_index(2, 3) + ); + + // Track transformation + auto input_pos = coord.get_top_index(); // [2, 3] + auto output_pos = coord.get_bottom_index(); // [3, 2] (swapped) + + // Use for complex access patterns + DataType* src_data = ...; + DataType* dst_data = ...; + + // Read from transposed position + index_t src_offset = calculate_offset(output_pos); + DataType value = src_data[src_offset]; + } + +Efficient Coordinate Movement +============================= + +The ``move_tensor_coordinate`` function provides efficient navigation by updating coordinates incrementally rather than recreating them. + +Basic Movement Operations +------------------------- + +.. code-block:: cpp + + // Move tensor coordinate through descriptor + template + __host__ __device__ void move_tensor_coordinate( + const TensorDescriptor& desc, + TensorCoordinate& coord, + const MultiIndex& step) + { + // Update top index + coord.top_index_ += step; + + // Incrementally update cached values + // Only recalculate affected transformations + if (transformation_affects_movement(desc, step)) { + coord.hidden_index_ = desc.calculate_bottom_index(coord.top_index_); + coord.offset_ = desc.calculate_offset(coord.top_index_); + } else { + // Fast path: simple offset update + coord.offset_ += calculate_step_offset(desc, step); + } + } + +Practical Movement Patterns +--------------------------- + +.. code-block:: cpp + + // Example: Efficient matrix traversal + template + __global__ void matrix_traversal_kernel( + const DataType* input, + DataType* output, + index_t rows, index_t cols) + { + // Create descriptor for matrix + using Desc = TensorDescriptor; + Desc desc(make_tuple(rows, cols), make_tuple(cols, 1)); + + // Start at thread's assigned position + index_t start_row = blockIdx.y * blockDim.y + threadIdx.y; + index_t start_col = blockIdx.x * blockDim.x + threadIdx.x; + + auto coord = make_tensor_coordinate( + desc, + make_multi_index(start_row, start_col) + ); + + // Row-wise traversal pattern + for (index_t i = 0; i < 4; ++i) { + if (coord.get_index()[0] < rows) { + // Process current position + output[coord.get_offset()] = + process_value(input[coord.get_offset()]); + + // Move to next column + move_tensor_coordinate(desc, coord, make_multi_index(0, 1)); + + // Wrap to next row if needed + if (coord.get_index()[1] >= cols) { + move_tensor_coordinate( + desc, coord, + make_multi_index(1, -cols) + ); + } + } + } + } + +Movement Through Adaptors +------------------------- + +.. code-block:: cpp + + // Move through adaptor transformations + template + __host__ __device__ MultiIndex move_tensor_adaptor_coordinate( + const TensorAdaptor& adaptor, + TensorAdaptorCoordinate& coord, + const MultiIndex& step) + { + // Update top index + MultiIndex old_top = coord.top_index_; + coord.top_index_ += step; + + // Calculate new bottom index + MultiIndex old_bottom = coord.bottom_index_; + coord.bottom_index_ = adaptor.calculate_bottom_index(coord.top_index_); + + // Return the change in bottom coordinates + return coord.bottom_index_ - old_bottom; + } + +Advanced Movement Patterns +========================== + +Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts: + +Tiled Access Pattern +-------------------- + +.. code-block:: cpp + + template + __device__ void tiled_movement_pattern( + const float* input, + float* output, + index_t M, index_t N) + { + // Descriptor for full matrix + using MatrixDesc = TensorDescriptor< + DynamicSequence, + DynamicSequence + >; + MatrixDesc desc(make_tuple(M, N), make_tuple(N, 1)); + + // Start at tile corner + index_t tile_row = blockIdx.y * TileM; + index_t tile_col = blockIdx.x * TileN; + + auto coord = make_tensor_coordinate( + desc, + make_multi_index(tile_row, tile_col) + ); + + // Process tile with efficient movement + #pragma unroll + for (index_t i = 0; i < TileM; ++i) { + #pragma unroll + for (index_t j = 0; j < TileN; ++j) { + if (i == 0 && j == 0) { + // First element - already positioned + } else if (j == 0) { + // New row - move down and back to start column + move_tensor_coordinate( + desc, coord, + make_multi_index(1, -(TileN-1)) + ); + } else { + // Same row - move right + move_tensor_coordinate( + desc, coord, + make_multi_index(0, 1) + ); + } + + // Process element + output[coord.get_offset()] = + compute_value(input[coord.get_offset()]); + } + } + } + +Space-Filling Curve Movement +---------------------------- + +For more details on space-filling curves and their benefits, see :ref:`ck_tile_space_filling_curve`. + +.. code-block:: cpp + + // Snake pattern for optimal cache usage + template + __device__ void snake_pattern_movement( + const float* input, + float* output, + index_t M, index_t N) + { + using Desc = TensorDescriptor; + Desc desc(make_tuple(M, N), make_tuple(N, 1)); + + auto coord = make_tensor_coordinate( + desc, + make_multi_index(threadIdx.y, threadIdx.x) + ); + + // Snake through block + for (index_t row = 0; row < BlockSize; ++row) { + for (index_t col = 0; col < BlockSize; ++col) { + // Process current position + process_element(input, output, coord.get_offset()); + + // Snake movement pattern + if (row % 2 == 0) { + // Even rows: move right + if (col < BlockSize - 1) { + move_tensor_coordinate( + desc, coord, make_multi_index(0, 1) + ); + } + } else { + // Odd rows: move left + if (col < BlockSize - 1) { + move_tensor_coordinate( + desc, coord, make_multi_index(0, -1) + ); + } + } + } + + // Move to next row + if (row < BlockSize - 1) { + move_tensor_coordinate( + desc, coord, make_multi_index(1, 0) + ); + } + } + } + +Performance Considerations +=================================== + +Efficient coordinate movement is critical for GPU performance. See :ref:`ck_tile_gpu_basics` for hardware details. + +**1. Incremental Updates** + +.. code-block:: cpp + + // Inefficient: recreate coordinate + for (index_t i = 0; i < N; ++i) { + auto coord = make_tensor_coordinate(desc, make_multi_index(i, j)); + process(data[coord.get_offset()]); + } + + // Efficient: incremental movement + auto coord = make_tensor_coordinate(desc, make_multi_index(0, j)); + for (index_t i = 0; i < N; ++i) { + process(data[coord.get_offset()]); + move_tensor_coordinate(desc, coord, make_multi_index(1, 0)); + } + +**2. Movement Caching** + +.. code-block:: cpp + + // Cache frequently used movements + template + struct MovementCache { + MultiIndex row_step = make_multi_index(1, 0); + MultiIndex col_step = make_multi_index(0, 1); + MultiIndex diag_step = make_multi_index(1, 1); + + __device__ void move_row(auto& coord) { + move_tensor_coordinate(Desc{}, coord, row_step); + } + }; + +**3. Vectorized Movement** + +.. code-block:: cpp + + // Move multiple coordinates simultaneously + template + __device__ void vectorized_movement( + TensorCoordinate coords[NumCoords], + const MultiIndex& step) + { + #pragma unroll + for (index_t i = 0; i < NumCoords; ++i) { + move_tensor_coordinate(Desc{}, coords[i], step); + } + } + +Integration with CK Tile Components +=================================== + +Coordinate movement integrates seamlessly with other CK Tile components: + +.. code-block:: cpp + + // Example: Tile window with coordinate movement + template + __device__ void process_tile_with_movement( + TileWindow& window, + index_t tile_size) + { + // Create coordinate for tile traversal + auto coord = window.get_tile_coordinate(); + + // Process tile elements with movement + for (index_t i = 0; i < tile_size; ++i) { + for (index_t j = 0; j < tile_size; ++j) { + // Load using coordinate + auto value = window.load_at(coord); + + // Process value + auto result = compute(value); + + // Store result + window.store_at(coord, result); + + // Move to next element + window.move_coordinate(coord, {0, 1}); + } + // Move to next row + window.move_coordinate(coord, {1, -tile_size}); + } + } + + +Advanced coordinate operations provide the foundation for efficient tensor navigation in CK Tile: + +- **TensorCoordinate**: Combines position with descriptor context for validated navigation +- **TensorAdaptorCoordinate**: Tracks coordinates through transformation chains +- **move_tensor_coordinate**: Enables efficient incremental updates without recalculation +- **Movement Patterns**: Support advanced access patterns like tiling and space-filling curves +- **Performance**: Incremental updates are orders of magnitude faster than coordinate recreation +- **Integration**: Seamlessly works with tile windows, distributions, and other CK Tile components + +These operations are essential for implementing high-performance GPU kernels that can navigate complex tensor layouts efficiently. By understanding and utilizing coordinate movement, kernels can be created that achieve optimal memory access patterns while maintaining code clarity and correctness. diff --git a/docs/conceptual/ck_tile/coordinate_systems.rst b/docs/conceptual/ck_tile/coordinate_systems.rst new file mode 100644 index 0000000000..13a9619010 --- /dev/null +++ b/docs/conceptual/ck_tile/coordinate_systems.rst @@ -0,0 +1,612 @@ +.. _ck_tile_coordinate_systems: + +Coordinate Systems - The Mathematical Foundation +================================================ + +Overview +-------- + +At the heart of the Composable Kernel framework lies a mathematical foundation based on coordinate transformations. This foundation enables the automatic generation of optimal memory access patterns while maintaining a clear separation between algorithmic intent and hardware implementation details. The coordinate system framework transforms the task of GPU work distribution into a series of well-defined mathematical transformations. + +These coordinate systems provide the mathematical machinery that maps abstract thread identities to concrete memory addresses, ensuring that every memory access is optimized for the underlying hardware. This systematic approach eliminates the error-prone manual calculations that plague traditional GPU programming while enabling optimizations that would be impractical to implement by hand. + +The Five Coordinate Spaces +-------------------------- + +The CK framework employs five interconnected coordinate spaces, each serving a specific purpose in the journey from thread identification to memory access. These spaces work together to solve the fundamental challenge of GPU programming: efficiently distributing work across thousands of parallel threads while maintaining optimal memory access patterns. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Coordinate Spaces Overview" + P["P-space
Thread Identification
Which thread am I?"] + Y["Y-space
Logical Tile
Which element in my tile?"] + X["X-space
Physical Tensor
Where in the tensor?"] + R["R-space
Replication
Data sharing pattern"] + D["D-space
Linear Storage
Memory address"] + end + + subgraph "Transformations" + T1["P + Y → X
Thread + Element → Position"] + T2["X → D
Position → Address"] + end + + P --> T1 + Y --> T1 + T1 --> X + X --> T2 + T2 --> D + + R -.-> P + R -.-> Y + + style P fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style Y fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style R fill:#fce4ec,stroke:#c2185b,stroke-width:2px + style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_systems_1.svg + :alt: Diagram + :align: center + +The Challenge and Solution +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consider a fundamental scenario: an 8×8 matrix and 4 GPU threads. Each thread needs to answer several critical questions: + +1. **Which thread am I?** (Thread identification) +2. **What work should I do?** (Work assignment) +3. **Where is my data in the tensor?** (Physical location) +4. **How do I share data with other threads?** (Cooperation) +5. **What's the memory address?** (Hardware access) + +The coordinate system framework provides a systematic solution through five specialized spaces that transform from logical concepts to physical reality. Each space captures a different aspect of the computation, and the transformations between them encode the distribution strategy. + +Thread Identification +------------------------------ + +Partition Space (P-space) represents the foundation of the coordinate system hierarchy. This space captures the identity of each processing element within the GPU's execution model, providing a structured way to identify threads across the complex hierarchy of warps, blocks, and grids. + +GPU Thread Hierarchy +~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "GPU Thread Hierarchy" + subgraph "Block" + subgraph "Warp 0" + T0["Thread 0
P=[0,0]"] + T1["Thread 1
P=[0,1]"] + T2["Thread 2
P=[0,2]"] + T31["..."] + T3["Thread 31
P=[0,31]"] + end + subgraph "Warp 1" + T32["Thread 32
P=[1,0]"] + T33["Thread 33
P=[1,1]"] + T34["..."] + T63["Thread 63
P=[1,31]"] + end + W2["Warp 2..."] + W7["Warp 7"] + end + end + + subgraph "P-space Mapping" + PM["P-coordinates = [warp_id, lane_id]
or
P-coordinates = [block_x, block_y, thread_x, thread_y]"] + end + + T0 --> PM + T32 --> PM + + style T0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style T32 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_systems_2.svg + :alt: Diagram + :align: center + +The structure of P-space directly reflects the :ref:`hardware organization ` of GPUs. Each thread receives a unique P-coordinate that encodes its position within the execution hierarchy. For simple distributions, P-space might be one-dimensional, containing only a thread ID. For complex hierarchical distributions, P-space can have multiple dimensions representing different levels of the GPU's thread organization. + +C++ Implementation +~~~~~~~~~~~~~~~~~~ + +**File**: ``include/ck_tile/core/container/multi_index.hpp`` + +.. code-block:: cpp + + #include + #include + + template + __device__ void example_p_space_calculation() + { + // Get P-coordinates from hardware thread IDs + const index_t thread_id = get_thread_local_1d_id(); + const index_t warp_id = get_warp_local_1d_id(); + const index_t lane_id = get_lane_id(); + + // Convert to multi-dimensional P-coordinates + auto p_coord_2d = make_multi_index(warp_id, lane_id); + + // Using tile distribution (preferred method) + constexpr auto tile_distribution = TileDistribution{}; + const auto p_coord = tile_distribution.calculate_p_coord(); + + // P-coordinates determine: + // 1. Work distribution - which data this thread processes + // 2. Memory coalescing - ensuring optimal access patterns + // 3. Thread cooperation - coordinating shared memory usage + } + +The P-space abstraction enables CK to handle different GPU architectures transparently. Whether running on GPUs with 32-thread warps or 64-thread wavefronts, the P-space coordinates provide a consistent interface while the underlying implementation adapts to the hardware. + +Logical Work Organization +---------------------------------- + +Yield Space (Y-space) represents the logical organization of work within each thread's assigned tile. While P-space identifies which thread is executing, Y-space defines what that thread does with its assigned work. This abstraction enables the expression of complex access patterns in a hardware-independent manner. + +Work Assignment Structure +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Thread's Tile (2x2 elements)" + Y00["Y=[0,0]
Element 0"] + Y01["Y=[0,1]
Element 1"] + Y10["Y=[1,0]
Element 2"] + Y11["Y=[1,1]
Element 3"] + end + + subgraph "Y-space Structure" + YS["Each thread processes
the same Y-space pattern
but at different X locations"] + end + + subgraph "Example: 4 Threads" + T0["Thread 0
P=[0,0]"] + T1["Thread 1
P=[0,1]"] + T2["Thread 2
P=[1,0]"] + T3["Thread 3
P=[1,1]"] + end + + Y00 --> YS + Y01 --> YS + Y10 --> YS + Y11 --> YS + + T0 --> YS + T1 --> YS + T2 --> YS + T3 --> YS + + style Y00 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style Y01 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style Y10 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style Y11 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_systems_3.svg + :alt: Diagram + :align: center + +The power of Y-space lies in its ability to express different iteration patterns without changing the underlying distribution logic. A thread might traverse its Y-space in row-major order for one algorithm, column-major for another, or even use :ref:`space-filling curves ` for optimal cache utilization. This flexibility enables algorithm-specific optimizations while maintaining a consistent framework. + +Hierarchical Y-Space +~~~~~~~~~~~~~~~~~~~~ + +For complex kernels, Y-space can have a hierarchical structure that mirrors the hierarchical nature of GPU architectures: + +.. code-block:: cpp + + // Hierarchical Y-space for complex kernels + template + __device__ void example_hierarchical_y_space() + { + constexpr auto tile_distribution = TileDistribution{}; + + // 4D Y-space: [repeat, warp, thread, vector] + constexpr auto y_hierarchical = make_tuple( + number<4>{}, // Repeat dimension + number<2>{}, // Warp dimension + number<8>{}, // Thread dimension + number<4>{} // Vector dimension + ); + + // Each dimension serves different purpose: + // - Repeat: Algorithm repetition (e.g., attention heads) + // - Warp: Inter-warp cooperation patterns + // - Thread: Per-thread work items + // - Vector: SIMD vectorization + + // Sweep through Y-space with compile-time unrolling + sweep_tile(distributed_tensor, [&](auto y_coord) { + // y_coord is compile-time multi_index + // All iterations unrolled at compile time + auto value = distributed_tensor(y_coord); + // Process value... + }); + } + +Physical Tensor Coordinates +------------------------------------ + +X-space represents the ground truth of data organization: the actual coordinates within the global tensor. This space directly corresponds to how users conceptualize their data: row and column indices for matrices, spatial coordinates for images, or multi-dimensional indices for general tensors. + +Memory Layout Mapping +~~~~~~~~~~~~~~~~~~~~~ + +The relationship between X-space and physical memory involves considerations of data layout, padding, and alignment: + +.. code-block:: cpp + + template + __device__ void example_x_space_operations() + { + constexpr auto tensor_desc = TensorDescriptor{}; + + // X-space properties + constexpr auto x_lengths = tensor_desc.get_lengths(); + constexpr auto x_strides = tensor_desc.get_strides(); + + // Direct X-coordinate specification + constexpr auto x_coord = make_multi_index(number<3>{}, number<4>{}); + + // Convert to linear offset + constexpr auto linear_offset = tensor_desc.calculate_offset(x_coord); + + // X-coordinates from P+Y transformation + const auto x_from_py = tile_dist.calculate_index(p_coord, y_coord); + + // Bounds checking + const bool valid = is_valid_x_coord(x_coord, x_lengths); + } + +The Core Transformation: P + Y → X +---------------------------------- + +The transformation from P and Y coordinates to X coordinates represents the heart of tile distribution. This transformation encodes the entire distribution strategy, determining how logical thread work maps to physical tensor locations. + +Transformation Pipeline +~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Input" + P["P-coordinates
Thread identity
P=[1,0]"] + Y["Y-coordinates
Element in tile
Y=[0,1]"] + end + + subgraph "Transformation" + T["P + Y → X
Base position + Offset"] + end + + subgraph "Output" + X["X-coordinates
Tensor position
X=[2,1]"] + end + + subgraph "Example" + E["Thread P=[1,0] at base (2,0)
Element Y=[0,1] adds offset (0,1)
Result X=[2,1] in tensor"] + end + + P --> T + Y --> T + T --> X + X --> E + + style P fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style Y fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_systems_4.svg + :alt: Diagram + :align: center + +Mathematical Foundation +~~~~~~~~~~~~~~~~~~~~~~~ + +The P+Y→X transformation can be expressed mathematically as a composition of functions: + +.. math:: + + X = f(P, Y) = BasePosition(P) + LocalOffset(Y) + +Where: +- BasePosition(P) determines where in the tensor this thread's tile begins +- LocalOffset(Y) specifies the offset within the tile + +This transformation is highly configurable through the distribution encoding, enabling different strategies for different algorithms while maintaining the same mathematical framework. + +Replication and Cooperation +------------------------------------ + +Replication Space (R-space) introduces a mechanism for expressing data sharing and cooperation patterns between threads. Unlike the other coordinate spaces which map to unique data elements, R-space enables multiple processing elements to work on the same data, facilitating communication and reduction operations. + +Replication Patterns +~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + template + __device__ void example_r_space_operations() + { + constexpr auto tile_distribution = TileDistribution{}; + constexpr auto r_lengths = tile_distribution.get_r_lengths(); + + // Broadcasting with R-space + template + __device__ auto broadcast_across_r_space(DataType value) + { + const auto r_coord = tile_distribution.calculate_r_coord(); + __shared__ DataType shared_value; + + if (r_coord == make_multi_index(0, 0)) { + shared_value = value; // Source thread + } + __syncthreads(); + + return shared_value; // All threads get the value + } + + // Reduction across R-space + template + __device__ auto reduce_across_r_space(DataType local_value) + { + // Use hardware-accelerated reduction + return block_reduce_sum(local_value); + } + } + +R-space enables cooperation patterns that would be difficult to express otherwise. By providing a systematic way to identify which threads share data, it enables automatic generation of communication patterns. + +Memory Linearization +----------------------------- + +D-space represents the final transformation in the coordinate pipeline: converting multi-dimensional coordinates to linear memory addresses. This transformation incorporates all the low-level details of memory layout, including stride patterns, padding, and alignment requirements. + +Linearization Strategies +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "X-coordinates" + X["X = [2, 3]
2D Position"] + end + + subgraph "Layout Options" + RM["Row-Major
D = 2×width + 3"] + CM["Column-Major
D = 3×height + 2"] + BL["Blocked
Complex pattern"] + end + + subgraph "D-coordinate" + D["D = 11
Linear Address"] + end + + X --> RM + X --> CM + X --> BL + RM --> D + + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px + + + + + + +.. image:: diagrams/coordinate_systems_5.svg + :alt: Diagram + :align: center + +The linearization process must consider multiple factors: + +.. code-block:: cpp + + template + __device__ void example_d_space_linearization() + { + // Standard linearization + template + __device__ constexpr auto calculate_linear_offset(const XCoord& x_coord) + { + index_t offset = 0; + static_for<0, ndim, 1>{}([&](auto dim) { + offset += x_coord.at(dim) * strides.at(dim); + }); + return offset; + } + + // Specialized patterns for optimization + // Row-major: offset = x0 * N + x1 + // Column-major: offset = x1 * M + x0 + // Blocked: Complex pattern for cache efficiency + } + +Complete Pipeline Example +------------------------- + +The following is a complete example showing how all coordinate spaces work together: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Step 1: Thread Identification" + TID["Thread ID = 5"] + P["P-coordinates
P = [0, 5]
(warp 0, lane 5)"] + end + + subgraph "Step 2: Work Assignment" + Y["Y-coordinates
Y = [1, 0]
(element in tile)"] + end + + subgraph "Step 3: P+Y Transformation" + TRANS["P + Y → X
Thread position + Element offset"] + X["X-coordinates
X = [1, 5]
(tensor position)"] + end + + subgraph "Step 4: Linearization" + LIN["X → D
Row-major: D = x₀ × width + x₁"] + D["D-coordinate
D = 13
(memory address)"] + end + + subgraph "Step 5: Memory Access" + MEM["Hardware accesses
memory[13]"] + end + + TID --> P + P --> TRANS + Y --> TRANS + TRANS --> X + X --> LIN + LIN --> D + D --> MEM + + style P fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style Y fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:3px + style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px + style MEM fill:#ffebee,stroke:#c62828,stroke-width:3px + + + + +.. image:: diagrams/coordinate_systems_6.svg + :alt: Diagram + :align: center + +Real-World Example: Matrix Multiplication +----------------------------------------- + +:ref:`matrix multiplication ` demonstrates how coordinate systems work in practice/ + +.. code-block:: cpp + + template + __global__ void gemm_kernel_with_coordinates( + const AType* a_ptr, const BType* b_ptr, CType* c_ptr, + index_t M, index_t N, index_t K) + { + // Define distribution encoding + using Encoding = tile_distribution_encoding< + sequence<>, // R: no replication + tuple, // H for M dimension + sequence<4, 2, 8, 4>>, // H for N dimension + tuple, sequence<1, 2>>, // P mappings + tuple, sequence<2, 2>>, // P minor + sequence<1, 1, 2, 2>, // Y major + sequence<0, 3, 0, 3> // Y minor + >; + + constexpr auto distribution = make_static_tile_distribution(Encoding{}); + + // Step 1: Get P-coordinates (thread identity) + const auto p_coord = distribution.calculate_p_coord(); + + // Step 2: Iterate through Y-space (work assignment) + sweep_tile(c_tile, [&](auto y_coord) { + // Step 3: P+Y→X transformation + const auto x_coord = distribution.calculate_index(p_coord, y_coord); + + // Step 4: X→D transformation (handled by tensor view) + // Step 5: Actual computation at these coordinates + c_tile(y_coord) = compute_element(x_coord); + }); + } + +Performance Implications +------------------------ + +The coordinate system framework enables several critical optimizations: + +**Memory Coalescing**: By carefully structuring the P+Y→X transformation, consecutive threads access consecutive memory locations, achieving optimal memory bandwidth utilization. + +**Cache Efficiency**: The Y-space traversal order can be designed to maximize cache reuse, keeping frequently accessed data in fast memory. + +**Register Optimization**: The Y→D transformation enables optimal register allocation, minimizing register pressure while maximizing reuse. + +**Vectorization**: The coordinate transformations naturally align with vector operations, enabling efficient use of SIMD instructions. + +Summary +------- + +The coordinate system framework represents the mathematical foundation that enables CK's high performance and productivity benefits. Through the systematic transformation from thread identity (P-space) through logical work organization (Y-space) to physical tensor coordinates (X-space) and finally to linear memory addresses (D-space), this framework solves the fundamental challenges of GPU programming. + +Key insights from the coordinate system framework: + +**Separation of Concerns**: Each coordinate space captures a different aspect of the computation, enabling independent optimization of each aspect while maintaining a coherent whole. + +**Mathematical Rigor**: The transformations between coordinate spaces are well-defined mathematical functions, enabling formal analysis and verification of distribution strategies. + +**Hardware Abstraction**: The framework abstracts hardware details while enabling hardware-specific optimizations, achieving both portability and performance. + +**Automatic Optimization**: By encoding distribution strategies as coordinate transformations, the framework enables automatic generation of optimal access patterns that would be impractical to implement manually. + +**Composability**: Different distribution strategies can be expressed by composing different transformations, enabling rapid experimentation and optimization. + +These coordinate systems provide the conceptual framework for reasoning about GPU computation and the practical tools for achieving optimal performance. As GPU architectures continue to evolve, this mathematical foundation ensures that CK programs can adapt and continue to achieve high performance. + +Next Steps +---------- + +With a solid understanding of the coordinate system framework, the next sections explore how these concepts are applied in practice. Return to :ref:`ck_tile_index` to see the structure of the complete CK Tile documentation. diff --git a/docs/conceptual/ck_tile/descriptors.rst b/docs/conceptual/ck_tile/descriptors.rst new file mode 100644 index 0000000000..3a52097d06 --- /dev/null +++ b/docs/conceptual/ck_tile/descriptors.rst @@ -0,0 +1,383 @@ +.. _ck_tile_descriptors: + +Tensor Descriptors - Complete Tensor Specifications +=================================================== + +Overview +-------- + +A TensorDescriptor is the complete blueprint for a tensor. It combines a shape, stride information, and a series of :ref:`transformations ` into a single object that defines exactly how a tensor's data is laid out in memory. This specification enables CK Tile to create complex tensor views without any data movement. + +In CK Tile, TensorDescriptors serve as the foundation for all tensor operations, providing: + +- **Memory Layout Specification**: How data is arranged in physical memory +- **Logical View Definition**: How the tensor appears to the programmer +- **Transformation Pipeline**: A series of :ref:`coordinate transformations ` +- **Zero-Copy Views**: Different logical representations of the same data, building on :ref:`BufferViews ` and :ref:`TensorViews ` + +Creating Basic Tensor Layouts +----------------------------- + +CK Tile provides several ways to create tensor descriptors for common memory layouts. + +Custom Strides +~~~~~~~~~~~~~~ + +The most fundamental way to define a tensor is with custom strides. This provides full control over how many elements to "jump" in memory to move to the next item along each dimension. This is particularly useful for creating padded layouts required by GPU algorithms. + +.. code-block:: cpp + + using namespace ck_tile; + + // Create a 3x4 tensor, but make each row take up 8 elements in memory + // (4 for data, 4 for padding) + constexpr auto M = 3; + constexpr auto N = 4; + constexpr auto RowStride = 8; // Padded stride + + auto descriptor = make_naive_tensor_descriptor( + make_tuple(M, N), // Shape: [3, 4] + make_tuple(RowStride, 1) // Strides: [8, 1] + ); + + // The total memory needed is 3 rows * 8 elements/row = 24 + constexpr auto element_space_size = M * RowStride; + + // Calculate offset of the element at [row=1, col=2] + multi_index<2> coord{1, 2}; + auto offset = descriptor.calculate_offset(coord); + // offset = 1*8 + 2*1 = 10 + +Packed Row-Major Layout +~~~~~~~~~~~~~~~~~~~~~~~~~ + +For most cases, a tightly packed, row-major layout is sufficient. The strides are calculated automatically, leaving no unused space between elements. + +.. code-block:: cpp + + using namespace ck_tile; + + // Create a packed 3x4 tensor + auto descriptor_packed = make_naive_tensor_descriptor_packed( + make_tuple(3, 4) + ); + + // Total memory is 3 * 4 = 12 elements + // Strides are automatically [4, 1] for row-major layout + + // Calculate offset of the element at [row=1, col=2] + multi_index<2> coord{1, 2}; + auto offset = descriptor_packed.calculate_offset(coord); + // offset = 1*4 + 2*1 = 6 + +Aligned Layout +~~~~~~~~~~~~~~ + +For GPU performance, memory layouts often need to be aligned. This function creates a row-major layout but ensures that each row's starting address is a multiple of a given alignment value, adding padding if necessary. + +.. code-block:: cpp + + using namespace ck_tile; + + // Create a 4x5 tensor with 8-element alignment + constexpr auto align = 8; // Align each row to 8-element boundary + + auto descriptor_aligned = make_naive_tensor_descriptor_aligned( + make_tuple(4, 5), + align + ); + + // Without alignment, size would be 4*5=20 + // With alignment, the row stride becomes 8 (smallest multiple of 8 >= 5) + // Total size = 4 rows * 8 elements/row = 32 + +The Pipeline Concept +-------------------- + +Every TensorDescriptor in CK Tile can be thought of as a **transformation pipeline**. The functions above create the *first stage* of this pipeline, defining the initial :ref:`transformation ` that takes a simple, one-dimensional block of memory and presents it as a logical, multi-dimensional tensor view. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Pipeline Stages" + S1["Stage 1
Base Layout
[M, N]"] + S2["Stage 2
Transform
Unmerge"] + S3["Stage 3
New View
[M1, M2, N]"] + S4["Stage N
Final View
[...]"] + end + + subgraph "Same Data" + D["Physical Memory
No data movement"] + end + + S1 --> S2 + S2 --> S3 + S3 --> S4 + + S1 -.-> D + S2 -.-> D + S3 -.-> D + S4 -.-> D + + style D fill:#ffebee,stroke:#d32f2f,stroke-width:2px + style S1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style S3 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + +.. image:: diagrams/descriptors_1.svg + :alt: Diagram + :align: center + +.. image:: diagrams/descriptors_1.svg + :alt: Diagram + :align: center + +The Initial Pipeline Stage +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A simple packed descriptor sets up a pipeline with a single transform: + +- **Input**: The raw, one-dimensional memory buffer (hidden dimension ID 0) +- **Output**: The logical dimensions that you interact with (hidden dimension IDs 1, 2, ...) + +This initial stage converts linear memory addresses into multi-dimensional coordinates. See :ref:`ck_tile_adaptors` for how transforms chain together. + +Advanced Layouts: Step-by-Step Transformation +--------------------------------------------- + +The ``transform_tensor_descriptor`` function adds new stages to an existing descriptor's pipeline using :ref:`transforms `. + +Transform a [2, 6] Tensor into a [2, 2, 3] View +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This example reinterprets a 2D tensor with shape [2, 6] as a 3D tensor with shape [2, 2, 3], without changing the underlying 12-element memory buffer. + +**Step 1: Define the Base Descriptor** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create the [2, 6] base descriptor + auto base_descriptor = make_naive_tensor_descriptor_packed( + make_tuple(2, 6) + ); + + // This creates an initial pipeline stage that: + // - Takes the raw buffer (hidden ID 0) as input + // - Produces two outputs (hidden IDs 1 and 2) + // - These outputs become logical dimensions 0 and 1 + +**Step 2: Define the New Transformation Stage** + +To get from [2, 6] to [2, 2, 3], we need: + +- **For logical dimension 0 (length 2)**: Preserve it with PassThroughTransform +- **For logical dimension 1 (length 6)**: Split it with UnmergeTransform([2, 3]) + +**Step 3: Apply Transformation** + +.. code-block:: cpp + + // Create the transformed descriptor + auto transformed_descriptor = transform_tensor_descriptor( + base_descriptor, + make_tuple( + make_pass_through_transform(2), // For dim 0 + make_unmerge_transform(make_tuple(2, 3)) // For dim 1 + ), + make_tuple(sequence<0>{}, sequence<1>{}), // Input mapping + make_tuple(sequence<0>{}, sequence<1, 2>{}) // Output mapping + ); + + // Result: A [2, 2, 3] view of the same data + +Analysis of the Final Pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Transform Pipeline" + T0["Transform 0
Base Unmerge
Input: [0]
Output: [1,2]"] + T1["Transform 1
PassThrough
Input: [1]
Output: [3]"] + T2["Transform 2
Unmerge
Input: [2]
Output: [4,5]"] + end + + subgraph "Hidden Dimensions" + H0["Hidden ID 0
Raw Buffer"] + H1["Hidden ID 1
Dim 0 (size 2)"] + H2["Hidden ID 2
Dim 1 (size 6)"] + H3["Hidden ID 3
Final Dim 0"] + H4["Hidden ID 4
Final Dim 1"] + H5["Hidden ID 5
Final Dim 2"] + end + + H0 --> T0 + T0 --> H1 + T0 --> H2 + H1 --> T1 + H2 --> T2 + T1 --> H3 + T2 --> H4 + T2 --> H5 + + style H0 fill:#ffebee,stroke:#d32f2f,stroke-width:2px + style H3 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style H4 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style H5 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + +.. image:: diagrams/descriptors_2.svg + :alt: Diagram + :align: center + +.. image:: diagrams/descriptors_2.svg + :alt: Diagram + :align: center + +The pipeline now has three stages: + +1. **Base UnmergeTransform**: Converts raw buffer to [2, 6] layout +2. **PassThroughTransform**: Preserves the first dimension +3. **UnmergeTransform**: Splits the second dimension into [2, 3] + +5D to 3D Block Transformation +----------------------------------------------------- + +These concepts are critical in :ref:`GPU programming `. This example transforms a 5D tensor representing a GPU thread block's workload into a simpler 3D view using MergeTransform. See :ref:`ck_tile_thread_mapping` for thread distribution details. + +.. code-block:: cpp + + using namespace ck_tile; + + // Define parameters (typical for a GPU block) + constexpr auto Block_M = 256; + constexpr auto NumWarps = 8; + constexpr auto WarpSize = 64; + constexpr auto KVector = 4; + constexpr auto wavesPerK = 2; + constexpr auto wavesPerM = NumWarps / wavesPerK; + constexpr auto NumIssues = Block_M / wavesPerM; + + // Create the base 5D descriptor + auto base_descriptor = make_naive_tensor_descriptor_packed( + make_tuple(NumIssues, wavesPerM, wavesPerK, WarpSize, KVector) + ); + + // Transform to 3D by merging dimensions + auto transformed_descriptor = transform_tensor_descriptor( + base_descriptor, + make_tuple( + make_pass_through_transform(NumIssues), + make_merge_transform(make_tuple(wavesPerM, wavesPerK)), + make_merge_transform(make_tuple(WarpSize, KVector)) + ), + make_tuple(sequence<0>{}, sequence<1, 2>{}, sequence<3, 4>{}), + make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}) + ); + + // Result: [NumIssues, wavesPerM*wavesPerK, WarpSize*KVector] + // This simplifies thread block management while preserving data layout + +Common Descriptor Patterns +-------------------------- + +Matrix Transposition +~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Create a transposed view of a matrix + auto transposed = transform_tensor_descriptor( + original_matrix, + make_tuple( + make_pass_through_transform(N), + make_pass_through_transform(M) + ), + make_tuple(sequence<1>{}, sequence<0>{}), // Swap dimensions + make_tuple(sequence<0>{}, sequence<1>{}) + ); + +Padding for Convolution +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + +// Add padding to spatial dimensions + auto padded = transform_tensor_descriptor( + input_tensor, + make_tuple( + make_pass_through_transform(N), // Batch + make_pass_through_transform(C), // Channel + make_pad_transform(H, pad_h, pad_h), // Height + make_pad_transform(W, pad_w, pad_w) // Width + ), + make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}), + make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}) + ); + +For a complete convolution example, see :ref:`ck_tile_convolution_example`. + +Tensor Slicing +~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Extract a sub-tensor + auto slice = transform_tensor_descriptor( + full_tensor, + make_tuple( + make_slice_transform(M, start_m, end_m), + make_slice_transform(N, start_n, end_n) + ), + make_tuple(sequence<0>{}, sequence<1>{}), + make_tuple(sequence<0>{}, sequence<1>{}) + ); + +Key Concepts Summary +-------------------- + +TensorDescriptors provide a key abstraction for tensor manipulation: + +- **Pipeline Architecture**: Each descriptor is a transformation pipeline +- **Zero-Copy Views**: All transformations are logical, no data movement +- **Composability**: Complex layouts built from simple transforms +- **GPU Optimization**: Designed for efficient GPU memory access patterns + +Important principles: + +1. **Always Handle All Dimensions**: When transforming, provide a transform for each input dimension +2. **Hidden Dimension IDs**: Track the flow of data through the pipeline +3. **Compile-Time Resolution**: All transformations resolved at compile time +4. **Type Safety**: Template metaprogramming ensures correctness + +Performance Considerations +-------------------------- + +When designing tensor descriptors for GPU kernels: + +1. **Memory Coalescing**: Ensure contiguous threads access contiguous memory +2. **Bank Conflicts**: Avoid patterns that cause :ref:`shared memory conflicts ` +3. **Alignment**: Use aligned layouts for better memory throughput +4. **Padding**: Strategic padding can improve access patterns. Ssee :ref:`ck_tile_lds_index_swapping` for advanced techniques. + +Next Steps +---------- + +- :ref:`ck_tile_tile_window` - Using descriptors for efficient data loading +- :ref:`ck_tile_tile_distribution` - How descriptors enable automatic work distribution +- :ref:`ck_tile_convolution_example` - Real-world application of complex descriptors +- :ref:`ck_tile_static_distributed_tensor` - Managing distributed tensors with descriptors +- :ref:`ck_tile_gemm_optimization` - GEMM kernels using descriptor transformations diff --git a/docs/conceptual/ck_tile/diagrams/adaptors_1.svg b/docs/conceptual/ck_tile/diagrams/adaptors_1.svg new file mode 100644 index 0000000000..e7ab20b093 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/adaptors_1.svg @@ -0,0 +1 @@ +

Adaptor Composition

Chained Transforms

Input
2D

Transform A
(e.g., Merge)

Intermediate
1D

Transform B
(e.g., Pad)

Output
1D Padded

Single Transform

Input Coords
[0,1,2]

Transform
(e.g., Transpose)

Output Coords
[2,0,1]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/adaptors_2.svg b/docs/conceptual/ck_tile/diagrams/adaptors_2.svg new file mode 100644 index 0000000000..417ff1b19c --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/adaptors_2.svg @@ -0,0 +1 @@ +

Adaptor Chaining Flow

Chained Result

Adaptor 2

Adaptor 1

Input 2D
Bottom[0,1]

Bottom Dims
[0,1]

Transform:
Merge[2,3]

Top Dims
[0]

Bottom Dims
[0]

Transform:
Unmerge[2,3]

Top Dims
[0,1]

Output 2D
Top[0,1]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/buffer_views_1.svg b/docs/conceptual/ck_tile/diagrams/buffer_views_1.svg new file mode 100644 index 0000000000..fb696c9e42 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/buffer_views_1.svg @@ -0,0 +1 @@ +

Usage Pattern

1. Load tile from Global → LDS
2. Load working set LDS → VGPR
3. Compute in VGPR
4. Store results VGPR → LDS
5. Reduce in LDS
6. Write final LDS → Global

Compute Flow

Global Memory
Input Data

LDS
Tile Cache

VGPR
Working Set

Compute
Operations

LDS
Reduction

Global Memory
Output Data

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/buffer_views_2.svg b/docs/conceptual/ck_tile/diagrams/buffer_views_2.svg new file mode 100644 index 0000000000..7a58311b33 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/buffer_views_2.svg @@ -0,0 +1 @@ +

Performance Impact

Vectorized Access (1 instruction)

Scalar Access (4 instructions)

Load float[0]

Register 1

Load float[1]

Register 2

Load float[2]

Register 3

Load float[3]

Register 4

Load float4[0]

Vector Register
(4 floats)

4x fewer instructions
Better memory bandwidth
Reduced latency

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/buffer_views_3.svg b/docs/conceptual/ck_tile/diagrams/buffer_views_3.svg new file mode 100644 index 0000000000..8e20da9fa0 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/buffer_views_3.svg @@ -0,0 +1 @@ +

Output

Processing

Input Parameters

Yes

No

Yes

No

Offset
(e.g., 5)

Valid Flag
(optional)

Bounds Check
offset < buffer_size?

Flag Check
valid_flag == True?

Access Memory
buffer[offset]

Valid Result
Return value

Invalid Result
Return 0 or default

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/buffer_views_4.svg b/docs/conceptual/ck_tile/diagrams/buffer_views_4.svg new file mode 100644 index 0000000000..f0b04d283b --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/buffer_views_4.svg @@ -0,0 +1 @@ +

Atomic Operation (Thread-Safe)

Thread 1: atomic_add(5)

Hardware ensures
serialization

Thread 2: atomic_add(3)

Final value: 18 ✓
(Both updates applied)

Non-Atomic Operation (Race Condition)

Thread 1: Read value (10)

Thread 1: Add 5 (15)

Thread 2: Read value (10)

Thread 2: Add 3 (13)

Thread 1: Write 15

Thread 2: Write 13

Final value: 13 ❌
(Lost update from Thread 1)

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/convolution_example.svg b/docs/conceptual/ck_tile/diagrams/convolution_example.svg new file mode 100644 index 0000000000..4a86641997 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/convolution_example.svg @@ -0,0 +1 @@ +

Im2col Optimization

Convolution Process

Input Image
6×6

Kernel
3×3

Sliding Window
Extract 3×3 patches

Dot Product
Element-wise multiply & sum

Output
4×4

Windows Matrix
16×9
(all patches)

Kernel Flattened
9×1

Matrix Multiply
W @ K

Output Flattened
16×1

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_movement.svg b/docs/conceptual/ck_tile/diagrams/coordinate_movement.svg new file mode 100644 index 0000000000..18f190d646 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_movement.svg @@ -0,0 +1 @@ +

Movement Example

Start: [1,1]
Offset: 5

Move [0,1]
→ [1,2]
Offset: 6

Move [1,0]
→ [2,2]
Offset: 10

Move [1,1]
→ [3,3]
Offset: 15

Coordinate Movement System

TensorCoordinate
Position + Descriptor Context

move_coordinate()
Efficient Navigation

TensorAdaptorCoordinate
Position + Transform Context

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_1.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_1.svg new file mode 100644 index 0000000000..8890aa2362 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_1.svg @@ -0,0 +1 @@ +

Transformations

Coordinate Spaces Overview

P-space
Thread Identification
Which thread am I?

Y-space
Logical Tile
Which element in my tile?

X-space
Physical Tensor
Where in the tensor?

R-space
Replication
Data sharing pattern

D-space
Linear Storage
Memory address

P + Y → X
Thread + Element → Position

X → D
Position → Address

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_2.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_2.svg new file mode 100644 index 0000000000..765318910a --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_2.svg @@ -0,0 +1 @@ +

P-space Mapping

GPU Thread Hierarchy

Block

Warp 1

Warp 0

Thread 0
P=[0,0]

Thread 1
P=[0,1]

Thread 2
P=[0,2]

...

Thread 31
P=[0,31]

Thread 32
P=[1,0]

Thread 33
P=[1,1]

...

Thread 63
P=[1,31]

Warp 2...

Warp 7

P-coordinates = [warp_id, lane_id]
or
P-coordinates = [block_x, block_y, thread_x, thread_y]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_3.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_3.svg new file mode 100644 index 0000000000..47846dfe4b --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_3.svg @@ -0,0 +1 @@ +

Example: 4 Threads

Y-space Structure

Thread's Tile (2x2 elements)

Y=[0,0]
Element 0

Y=[0,1]
Element 1

Y=[1,0]
Element 2

Y=[1,1]
Element 3

Each thread processes
the same Y-space pattern
but at different X locations

Thread 0
P=[0,0]

Thread 1
P=[0,1]

Thread 2
P=[1,0]

Thread 3
P=[1,1]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_4.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_4.svg new file mode 100644 index 0000000000..3a9f04c73d --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_4.svg @@ -0,0 +1 @@ +

Example

Output

Transformation

Input

P-coordinates
Thread identity
P=[1,0]

Y-coordinates
Element in tile
Y=[0,1]

P + Y → X
Base position + Offset

X-coordinates
Tensor position
X=[2,1]

Thread P=[1,0] at base (2,0)
Element Y=[0,1] adds offset (0,1)
Result X=[2,1] in tensor

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_5.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_5.svg new file mode 100644 index 0000000000..f91d8b39ef --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_5.svg @@ -0,0 +1 @@ +

D-coordinate

Layout Options

X-coordinates

X = [2, 3]
2D Position

Row-Major
D = 2×width + 3

Column-Major
D = 3×height + 2

Blocked
Complex pattern

D = 11
Linear Address

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/coordinate_systems_6.svg b/docs/conceptual/ck_tile/diagrams/coordinate_systems_6.svg new file mode 100644 index 0000000000..0e0275457a --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/coordinate_systems_6.svg @@ -0,0 +1 @@ +

Step 5: Memory Access

Step 4: Linearization

Step 3: P+Y Transformation

Step 2: Work Assignment

Step 1: Thread Identification

Thread ID = 5

P-coordinates
P = [0, 5]
(warp 0, lane 5)

Y-coordinates
Y = [1, 0]
(element in tile)

P + Y → X
Thread position + Element offset

X-coordinates
X = [1, 5]
(tensor position)

X → D
Row-major: D = x₀ × width + x₁

D-coordinate
D = 13
(memory address)

Hardware accesses
memory[13]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/descriptors_1.svg b/docs/conceptual/ck_tile/diagrams/descriptors_1.svg new file mode 100644 index 0000000000..a46b34e45d --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/descriptors_1.svg @@ -0,0 +1 @@ +

Same Data

Pipeline Stages

Stage 1
Base Layout
[M, N]

Stage 2
Transform
Unmerge

Stage 3
New View
[M1, M2, N]

Stage N
Final View
[...]

Physical Memory
No data movement

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/descriptors_2.svg b/docs/conceptual/ck_tile/diagrams/descriptors_2.svg new file mode 100644 index 0000000000..f9ebb053c0 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/descriptors_2.svg @@ -0,0 +1 @@ +

Hidden Dimensions

Transform Pipeline

Transform 0
Base Unmerge
Input: [0]
Output: [1,2]

Transform 1
PassThrough
Input: [1]
Output: [3]

Transform 2
Unmerge
Input: [2]
Output: [4,5]

Hidden ID 0
Raw Buffer

Hidden ID 1
Dim 0 (size 2)

Hidden ID 2
Dim 1 (size 6)

Hidden ID 3
Final Dim 0

Hidden ID 4
Final Dim 1

Hidden ID 5
Final Dim 2

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/encoding_internals_1.svg b/docs/conceptual/ck_tile/diagrams/encoding_internals_1.svg new file mode 100644 index 0000000000..41647a8c0e --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/encoding_internals_1.svg @@ -0,0 +1 @@ +

Transformation Chain

Generated Components

Encoding Components

R-space Lengths
Replication dimensions

H-space Lengths
Hierarchical decomposition
[[2,2],[2,2]]

P→RH Mappings
Thread to hierarchy
Major/Minor

Y→RH Mappings
Element to hierarchy
Major/Minor

ps_ys_to_xs_adaptor
Coordinate transformer

ys_to_d_descriptor
Memory linearizer

Encoding
Original specification

Replicate
Transform

Unmerge
Transform

Merge
Transform

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/encoding_internals_2.svg b/docs/conceptual/ck_tile/diagrams/encoding_internals_2.svg new file mode 100644 index 0000000000..4032376a6a --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/encoding_internals_2.svg @@ -0,0 +1 @@ +

Output

Transformation Pipeline

Input Coordinates

P-coordinates
[warp_id, lane_id]

Y-coordinates
[y0, y1, y2, y3]

Combine P+Y

Replicate
Transform
(if R-dims exist)

Unmerge
Transform
(break into H-dims)

Merge
Transform
(combine to X-dims)

X-coordinates
[x0, x1]
Tensor position

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/introduction_motivation_1.svg b/docs/conceptual/ck_tile/diagrams/introduction_motivation_1.svg new file mode 100644 index 0000000000..55253de744 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/introduction_motivation_1.svg @@ -0,0 +1 @@ +

Tile Distribution Pattern (Efficient)

Memory_TD

Threads_TD

Mem[0]

Thread 0

Mem[1]

Thread 1

Mem[2]

Mem[3]

Thread 2

Mem[4]

Mem[5]

Thread 3

Mem[6]

Mem[7]

Random Access Pattern (Inefficient)

Memory

Threads

Mem[0]

Thread 0

Mem[23]

Thread 1

Mem[7]

Thread 2

Mem[47]

Thread 3

Mem[15]

Mem[31]

Mem[39]

Mem[55]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/introduction_motivation_2.svg b/docs/conceptual/ck_tile/diagrams/introduction_motivation_2.svg new file mode 100644 index 0000000000..524b6b2d40 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/introduction_motivation_2.svg @@ -0,0 +1 @@ +

Transformations

Coordinate Spaces

P-space
Thread Position
(thread_x, thread_y,
warp_id, block_id)

Y-space
Local Data
(y0, y1, y2, y3)

X-space
Global Position
(x0, x1)

D-space
Memory Address
(linearized)

P + Y → X
Thread data mapping

X → D
Memory linearization

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/lds_index_swapping_1.svg b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_1.svg new file mode 100644 index 0000000000..26bf6ec7f5 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_1.svg @@ -0,0 +1 @@ +

Update K0 with XOR transformation

XOR Transform

3D LDS coordinate [K0, M, K1]

KPerBlock/KPack * MLdsLayer
K0

MPerBlock/MLdsLayer
M

KPack
K1

make_xor_transform

KPerBlock/KPack * MLdsLayer
K0'

MPerBlock/MLdsLayer
M

KPack
K1

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/lds_index_swapping_2.svg b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_2.svg new file mode 100644 index 0000000000..0b02bce106 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_2.svg @@ -0,0 +1 @@ +

4D intermediate transformation space

Unmerge into 2 components

3D LDS coordinate [K0', M, K1]

KPerBlock/KPack * MLdsLayer
K0'

MPerBlock/MLdsLayer
M

KPack
K1

make_unmerge_transform

MLdsLayer
L

MPerBlock/MLdsLayer
M

KPerBlock/KPack
K0''

KPack
K1

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/lds_index_swapping_3.svg b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_3.svg new file mode 100644 index 0000000000..378c0d35d0 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/lds_index_swapping_3.svg @@ -0,0 +1 @@ +

Transformed 2D coordinates [M', K']

Merge into 1 component

Merge into 1 component

4D LDS Coordinates [L, M, K0'', K1]

MLdsLayer
L

MPerBlock/MLdsLayer
M

KPerBlock/KPack
K0''

KPack
K1

make_merge_transform

make_merge_transform

MPerBlock
M'

KPerBlock
K'

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/load_store_traits_1.svg b/docs/conceptual/ck_tile/diagrams/load_store_traits_1.svg new file mode 100644 index 0000000000..51be25c0b2 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/load_store_traits_1.svg @@ -0,0 +1 @@ +

Yes

No

Analyze Distribution

Check Each Dimension

Calculate Stride

Stride == 1?

Candidate for Vectorization

Skip Dimension

Check Alignment

Check Vector Size

Score Dimension

Select Best Dimension

Configure Vector Access

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/load_store_traits_2.svg b/docs/conceptual/ck_tile/diagrams/load_store_traits_2.svg new file mode 100644 index 0000000000..48b6bff271 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/load_store_traits_2.svg @@ -0,0 +1 @@ +

Snake Pattern

0→1→2→3

7←6←5←4

Cache hit!

8→9→10→11

Linear Traversal

0→1→2→3

4→5→6→7

Cache miss

8→9→10→11

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/space_filling_curve.svg b/docs/conceptual/ck_tile/diagrams/space_filling_curve.svg new file mode 100644 index 0000000000..11b0ceda5b --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/space_filling_curve.svg @@ -0,0 +1 @@ +

Snake Pattern

Row 0: →

Row 1: ←

Row 2: →

Continue

Linear Pattern

Row 0: →

Jump back

Row 1: →

Row 2: →

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/static_distributed_tensor.svg b/docs/conceptual/ck_tile/diagrams/static_distributed_tensor.svg new file mode 100644 index 0000000000..6ce7e3c0c8 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/static_distributed_tensor.svg @@ -0,0 +1 @@ +

Global Tensor 64x64

Thread Block 16x16

Thread 0,0
Elements 0:3,0:3

Thread 0,1
Elements 0:3,4:7

Thread 1,0
Elements 4:7,0:3

...

Local Array
16 elements

Local Array
16 elements

Local Array
16 elements

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/sweep_tile_1.svg b/docs/conceptual/ck_tile/diagrams/sweep_tile_1.svg new file mode 100644 index 0000000000..4f145c81af --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/sweep_tile_1.svg @@ -0,0 +1 @@ +

Computation

Y-Sweep

X-Tile (Reused)

X data loaded once
Stays in registers

Y position 0

Y position 1

Y position 2

Y position N

Process(X, Y)

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/sweep_tile_2.svg b/docs/conceptual/ck_tile/diagrams/sweep_tile_2.svg new file mode 100644 index 0000000000..1ce4e41241 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/sweep_tile_2.svg @@ -0,0 +1 @@ +

Sweep Approach

Load X[0]

Process with
Y[0], Y[1], Y[2]

Load Y[0,1,2]

X loaded once!

Traditional Approach

Load X[0]

Process

Load Y[0]

Load X[0]

Process

Load Y[1]

Load X[0]

Process

Load Y[2]

X loaded 3 times!

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/sweep_tile_3.svg b/docs/conceptual/ck_tile/diagrams/sweep_tile_3.svg new file mode 100644 index 0000000000..10f419cede --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/sweep_tile_3.svg @@ -0,0 +1 @@ +

Use Cases

Sweep Performance Benefits

Zero runtime overhead
Compile-time unrolling

Perfect memory coalescing
Sequential access patterns

Automatic vectorization
Compiler optimizations

Register reuse
X data stays in VGPR

Matrix Multiplication
Reuse A columns

Convolution
Reuse filter weights

Reduction
Accumulate over Y

Broadcast
Apply X to all Y

High Performance

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/sweep_tile_4.svg b/docs/conceptual/ck_tile/diagrams/sweep_tile_4.svg new file mode 100644 index 0000000000..50530522cd --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/sweep_tile_4.svg @@ -0,0 +1 @@ +

Complete Workflow

TileDistribution
Define data layout

TileWindow
Create view

DistributedTensor
Load X data

SweepTile
Iterate Y positions

Results
Store outputs

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_coordinates_1.svg b/docs/conceptual/ck_tile/diagrams/tensor_coordinates_1.svg new file mode 100644 index 0000000000..ee4206f4a2 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_coordinates_1.svg @@ -0,0 +1 @@ +

Usage Context

MultiIndex Structure

MultiIndex
Container for N integers

Dimension 0

Dimension 1

Dimension 2

Dimension N-1

Transforms

Adaptors

Tensors

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_coordinates_2.svg b/docs/conceptual/ck_tile/diagrams/tensor_coordinates_2.svg new file mode 100644 index 0000000000..efada63f93 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_coordinates_2.svg @@ -0,0 +1 @@ +

Example: 3D Tensor Access

3D Tensor
shape=[4,5,6]

MultiIndex(3, [1,2,3])

Element at
position [1,2,3]

Coordinate Flow

User Input
[1, 2, 3]

MultiIndex
Storage

Transform
Processing

MultiIndex
Output

Tensor Access
element(coord)

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_views_1.svg b/docs/conceptual/ck_tile/diagrams/tensor_views_1.svg new file mode 100644 index 0000000000..41338c8902 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_views_1.svg @@ -0,0 +1 @@ +

Logical View

Tensor Layer

Access Layer

Memory Foundation

Flat Memory Array
0 1 2 3 4 5 6 7 8 9 10 11

BufferView
Linear Memory Access

TensorDescriptor
Shape & Stride Info

TensorView
Multi-dimensional Access

2D Matrix View
[3×4]
[[0,1,2,3]
[4,5,6,7]
[8,9,10,11]]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_views_2.svg b/docs/conceptual/ck_tile/diagrams/tensor_views_2.svg new file mode 100644 index 0000000000..f57636d293 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_views_2.svg @@ -0,0 +1 @@ +

Result

TensorView Processing

User Input

Valid

Coordinate
(1, 2)

Shape Check
row < 3?
col < 4?

Apply Strides
offset = 1×4 + 2×1

BufferView Access
buffer[6]

Value: 6

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_views_3.svg b/docs/conceptual/ck_tile/diagrams/tensor_views_3.svg new file mode 100644 index 0000000000..df13db0c0d --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_views_3.svg @@ -0,0 +1 @@ +

Custom Stride (Transposed View)

Memory: [0,1,2,3,4,5,6,7,8,9,10,11]
Shape: (4,3)
Strides: (1,4)

[[0, 4, 8]
[1, 5, 9]
[2, 6, 10]
[3, 7, 11]]

Column-Major Layout (Fortran-style)

Memory: [0,3,6,9,1,4,7,10,2,5,8,11]
Shape: (3,4)
Strides: (1,3)

[[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]]

Row-Major Layout (C-style)

Memory: [0,1,2,3,4,5,6,7,8,9,10,11]
Shape: (3,4)
Strides: (4,1)

[[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_views_4.svg b/docs/conceptual/ck_tile/diagrams/tensor_views_4.svg new file mode 100644 index 0000000000..8e521229cf --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_views_4.svg @@ -0,0 +1 @@ +

Optimization Strategies

Memory Access Patterns

Sequential Access
(Good cache usage)

Strided Access
(May cause cache misses)

Random Access
(Poor cache usage)

Use row-major for row iteration

Use col-major for column iteration

Minimize stride between accesses

Vectorize when possible

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tensor_views_5.svg b/docs/conceptual/ck_tile/diagrams/tensor_views_5.svg new file mode 100644 index 0000000000..2faec8d8d3 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tensor_views_5.svg @@ -0,0 +1 @@ +

Use Cases

TensorView

BufferView

Linear indexing only

buffer[5]

No shape information

Direct memory access

Multi-dimensional indexing

tensor(1, 2)

Shape-aware operations

Coordinate transformations

BufferView: Low-level memory ops

TensorView: Matrix/tensor algorithms

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/thread_mapping_1.svg b/docs/conceptual/ck_tile/diagrams/thread_mapping_1.svg new file mode 100644 index 0000000000..119f631829 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/thread_mapping_1.svg @@ -0,0 +1 @@ +

P-space Mapping

Thread Identification

GPU Device

Thread Block

Warp 0

Warp 1

Thread 32
lane_id=0

Thread 33
lane_id=1

...

Thread 63
lane_id=31

Thread 0
lane_id=0

Thread 1
lane_id=1

...

Thread 31
lane_id=31

Warp 2

...

Warp 7

Thread ID = blockIdx.x * blockDim.x + threadIdx.x

Warp ID = threadIdx.x / 32

Lane ID = threadIdx.x % 32

P-coordinates
NDimP=1: [thread_id]
NDimP=2: [warp_id, lane_id]

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/thread_mapping_2.svg b/docs/conceptual/ck_tile/diagrams/thread_mapping_2.svg new file mode 100644 index 0000000000..f523de1c8c --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/thread_mapping_2.svg @@ -0,0 +1 @@ +

Thread to Data Mapping

Memory Access

Data Tiles

Thread Grid

Coalesced Access
Adjacent threads → Adjacent memory

Thread[0,0]
Warp 0

Data[0:4, 0:4]
16 elements

Thread[0,1]
Warp 0

Data[0:4, 4:8]
16 elements

Thread[1,0]
Warp 1

Data[4:8, 0:4]
16 elements

Thread[1,1]
Warp 1

Data[4:8, 4:8]
16 elements

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_1.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_1.svg new file mode 100644 index 0000000000..19e7140013 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_1.svg @@ -0,0 +1 @@ +

GPU Execution

Coordinate Spaces

Logical View

Tensor
Multi-dimensional data

TileDistribution
Work assignment

TileWindow
Data view

X: Physical tensor coords

Y: Tile pattern coords

P: Processing element coords

R: Replication coords (optional)

Warps
32 threads each

Lanes
Thread within warp

Registers
Thread-local storage

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_2.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_2.svg new file mode 100644 index 0000000000..6f588a46c4 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_2.svg @@ -0,0 +1 @@ +

Output

Transformation Pipeline

Input

Thread Coordinates
(warpId, laneId)

P → Y
Thread to pattern

Y → X
Pattern to physical

Y → D
Pattern to register

Memory Coordinates
Global addresses

Register Indices
Local storage

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_3.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_3.svg new file mode 100644 index 0000000000..0974e138fd --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_3.svg @@ -0,0 +1 @@ +

Memory Pattern

Thread Assignment

Problem Space (256×256 Matrix)

Full Matrix
65,536 elements

Tile 1
32×32

Tile 2
32×32

Tile N
32×32

Warp 0
32 threads

Warp 1
32 threads

Lane 0-31
Individual threads

Coalesced Access
Sequential addresses
No bank conflicts

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_4.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_4.svg new file mode 100644 index 0000000000..894151380d --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_4.svg @@ -0,0 +1 @@ +

Level 3: Thread Distribution

Level 2: Warp Distribution

Level 1: Block Distribution

Thread Block
256 threads

Block Tile 1
64×64

Block Tile 2
64×64

Warp
32 threads

Warp Tile 1
16×16

Warp Tile 2
16×16

Thread

Thread Tile
2×2

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_5.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_5.svg new file mode 100644 index 0000000000..2e46ee58cf --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_5.svg @@ -0,0 +1 @@ +

Memory Access

Per Thread

Thread Grid (32×32)

Matrix C (128×128)

16,384 elements

1,024 threads

4×4 tile
16 elements

Coalesced reads
Efficient writes
No conflicts

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_6.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_6.svg new file mode 100644 index 0000000000..2195465e60 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_6.svg @@ -0,0 +1 @@ +

Output

Stage 3

Stage 2

Stage 1

Input

Thread ID
(0-1023)

P-coordinates
(warp, lane)

Y-coordinates
(tile position)

X-coordinates
(tensor indices)

Memory addresses
Register indices

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_distribution_7.svg b/docs/conceptual/ck_tile/diagrams/tile_distribution_7.svg new file mode 100644 index 0000000000..e9ec5a5780 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_distribution_7.svg @@ -0,0 +1 @@ +

Performance

With TileDistribution

Manual Implementation

Calculate indices manually

Handle boundary conditions

Ensure coalescing

Manage bank conflicts

~200 lines of code

make_tile_distribution()

Automatic optimization

~10 lines of code

Same performance

Fewer bugs

Portable across GPUs

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_window_1.svg b/docs/conceptual/ck_tile/diagrams/tile_window_1.svg new file mode 100644 index 0000000000..6c2203c332 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_window_1.svg @@ -0,0 +1 @@ +

Optimizations

Operations

Components

TensorView
Data source

TileDistribution
Thread mapping

TileWindow
Access gateway

LoadStoreTraits
Access optimizer

DistributedTensor
Register storage

Load
Global → Registers

Compute
In registers

Store
Registers → Global

Coalescing
Adjacent access

Vectorization
Multi-element ops

Bank conflict
avoidance

Space-filling
curve traversal

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_window_2.svg b/docs/conceptual/ck_tile/diagrams/tile_window_2.svg new file mode 100644 index 0000000000..60ec2dd1ce --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_window_2.svg @@ -0,0 +1 @@ +

Snake Access Pattern

0,1,2,3

7,6,5,4

8,9,10,11

15,14,13,12

Linear Access Pattern

0,1,2,3

4,5,6,7

8,9,10,11

12,13,14,15

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_window_3.svg b/docs/conceptual/ck_tile/diagrams/tile_window_3.svg new file mode 100644 index 0000000000..9b2293d295 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_window_3.svg @@ -0,0 +1 @@ +

Step 3: Load Data

Step 2: Apply Distribution

Step 1: Create Window

load()

Tensor
[256, 256]

Origin
(64, 64)

Window Size
[32, 32]

TileDistribution
Thread mapping

TileWindow
Created

Global Memory
Window region

Registers
Distributed tensor

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_window_4.svg b/docs/conceptual/ck_tile/diagrams/tile_window_4.svg new file mode 100644 index 0000000000..f031c7778b --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_window_4.svg @@ -0,0 +1 @@ +

Result

Memory Transaction

Vectorization

Load Analysis

Analyze access pattern
Detect coalescing opportunities

Scalar: 4 loads

Vector2: 2 loads

Vector4: 1 load

Coalesced access
32 threads → 1 transaction

Non-coalesced
32 threads → 32 transactions

Thread registers
Local data

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/tile_window_5.svg b/docs/conceptual/ck_tile/diagrams/tile_window_5.svg new file mode 100644 index 0000000000..16ae1d01cc --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/tile_window_5.svg @@ -0,0 +1 @@ +

Hardware Utilization

Memory Access Optimization

Vectorization
4x fewer transactions

Coalescing
32x bandwidth efficiency

Precomputation
Zero overhead addressing

Space-filling
Optimal cache usage

Memory Bandwidth
Near 100% utilization

Latency Hiding
Overlapped operations

Register Reuse
Minimal spills

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_1.svg b/docs/conceptual/ck_tile/diagrams/transforms_1.svg new file mode 100644 index 0000000000..3f00bbee54 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_1.svg @@ -0,0 +1 @@ +

Tensor Coordinate Transformation

Forward Transform

Inverse Transform

Same data,
different views

Same data,
different views

Lower Dimension Space
Source coordinate system

Upper Dimension Space
Target coordinate system

Linear Data in Memory
Layout determined by tensor
shape & strides

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_10.svg b/docs/conceptual/ck_tile/diagrams/transforms_10.svg new file mode 100644 index 0000000000..34f7b7c04b --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_10.svg @@ -0,0 +1 @@ +

XorTransform: 2D → 2D XOR Mapping

Forward Transform
apply XOR reverse

Inverse Transform
apply XOR mapping

XOR pattern
view

Normal
view

Lower Coordinate Space
2D: [4, 8]
XOR-transformed coords

Upper Coordinate Space
2D: [4, 8]
Normal coords

Same Tensor Data

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_11.svg b/docs/conceptual/ck_tile/diagrams/transforms_11.svg new file mode 100644 index 0000000000..688dcab9ca --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_11.svg @@ -0,0 +1 @@ +

SliceTransform: 1D → 1D Sub-region

Forward Transform
idx + slice_begin

Inverse Transform
idx - slice_begin

Full tensor
view

Sub-region
view

Lower Coordinate Space
1D: [0, 9] (original range)

Upper Coordinate Space
1D: [0, 4] (slice range)

Tensor Data in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_12.svg b/docs/conceptual/ck_tile/diagrams/transforms_12.svg new file mode 100644 index 0000000000..f754ba4964 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_12.svg @@ -0,0 +1 @@ +

ModuloTransform: 1D → 1D Cyclic

Forward Transform
idx * cycle_count

Inverse Transform
idx % modulus

Lower Coordinate Space
1D: [0, 3] (modulus range)

Upper Coordinate Space
1D: [0, 15] (full range)

Tensor Data in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_2.svg b/docs/conceptual/ck_tile/diagrams/transforms_2.svg new file mode 100644 index 0000000000..26b40010bb --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_2.svg @@ -0,0 +1 @@ +

Operations

Transform Types

EmbedTransform
Linear → Multi-D Strided

MergeTransform
Multi-D → Linear

UnmergeTransform
Linear → Multi-D

ReplicateTransform
0D → Multi-D Broadcast

OffsetTransform
Translation

PassThroughTransform
Identity

PadTransform
Boundaries

Forward
calculate_lower_index()

Backward
calculate_upper_index()

Update
update_lower_index()

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_3.svg b/docs/conceptual/ck_tile/diagrams/transforms_3.svg new file mode 100644 index 0000000000..acd9de4a23 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_3.svg @@ -0,0 +1 @@ +

MergeTransform: Multi-D → Linear

Forward Transform
2×5 + 3 = 13

Inverse Transform
13÷5=2, 13%5=3

Multi-dimensional
view

Linear
view

Lower Coordinate Space
2D: [4, 5]
Coord: (2, 3)

Upper Coordinate Space
1D Linear
Index: 13

Same Tensor Data
Layout: row-major
Size: 20 elements

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_4.svg b/docs/conceptual/ck_tile/diagrams/transforms_4.svg new file mode 100644 index 0000000000..0bbf78430a --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_4.svg @@ -0,0 +1 @@ +

UnmergeTransform: Linear → Multi-D

Forward Transform
14 = 1×8 + 3×2 + 0

Inverse Transform
linearize back

Linear
view

Multi-dimensional
view

Lower Coordinate Space
1D Linear
Index: 14

Upper Coordinate Space
3D: [3, 4, 2]
Coord: (1, 3, 0)

Same Tensor Data
Layout: row-major
Size: 24 elements

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_5.svg b/docs/conceptual/ck_tile/diagrams/transforms_5.svg new file mode 100644 index 0000000000..3f57a2b675 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_5.svg @@ -0,0 +1 @@ +

EmbedTransform: Linear → Multi-D Strided

Forward Transform
Strides: [12, 1]
14 ÷ 12 = 1, 14 % 12 = 2

Inverse Transform
1×12 + 2×1 = 14

Linear
index view

Multi-dimensional
strided view

Lower Coordinate Space
1D Linear
Index: 14

Upper Coordinate Space
2D: [2, 3]
Coord: (1, 2)

Linear Buffer in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_6.svg b/docs/conceptual/ck_tile/diagrams/transforms_6.svg new file mode 100644 index 0000000000..014fa90176 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_6.svg @@ -0,0 +1 @@ +

ReplicateTransform: 0D → Multi-D Broadcasting

Forward Transform
[] → (i,j) for any i,j

Inverse Transform
(i,j) → [] for any i,j

One scalar
value

Broadcasted view
at all positions

Lower Coordinate Space
0D: Scalar
Empty coordinate []

Upper Coordinate Space
2D: [3, 4]
All coords: (i, j)

Single Scalar Value

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_7.svg b/docs/conceptual/ck_tile/diagrams/transforms_7.svg new file mode 100644 index 0000000000..676196744d --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_7.svg @@ -0,0 +1 @@ +

OffsetTransform: 1D → 1D Translation

Forward Transform
idx → idx + 16

Inverse Transform
idx + 16 → idx

Lower
view

Upper
view

Lower Coordinate Space
1D: [0, 63]
Coord: index + offset

Upper Coordinate Space
1D: [0, 47]
Coord: index

Linear Buffer in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_8.svg b/docs/conceptual/ck_tile/diagrams/transforms_8.svg new file mode 100644 index 0000000000..ddb41be8fe --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_8.svg @@ -0,0 +1 @@ +

PassThroughTransform: 1D → 1D Identity

Perfect Identity
idx → idx

Perfect Identity
idx → idx

Same buffer
same view

Same buffer
same view

Lower Coordinate Space
1D: [0, 59]
Coord: index

Upper Coordinate Space
1D: [0, 59]
Coord: index

Linear Buffer in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/diagrams/transforms_9.svg b/docs/conceptual/ck_tile/diagrams/transforms_9.svg new file mode 100644 index 0000000000..5b219099b0 --- /dev/null +++ b/docs/conceptual/ck_tile/diagrams/transforms_9.svg @@ -0,0 +1 @@ +

PadTransform: 1D → 1D with Padding

Forward Transform
idx + left_pad

Inverse Transform
idx - left_pad

Original view

Padded view

Lower Coordinate Space
1D: [0, 2] (original data)

Upper Coordinate Space
1D: [0, 4] (with padding)

Tensor Data in Memory

\ No newline at end of file diff --git a/docs/conceptual/ck_tile/encoding_internals.rst b/docs/conceptual/ck_tile/encoding_internals.rst new file mode 100644 index 0000000000..499ec0bd4a --- /dev/null +++ b/docs/conceptual/ck_tile/encoding_internals.rst @@ -0,0 +1,489 @@ +.. meta:: + :description: CK Tile encoding internals documentation + :keywords: CK Tile, encoding, tile distribution, GPU programming, compile-time computation + +.. _ck_tile_encoding_internals: + +****************** +Encoding Internals +****************** + +Overview +======== + +The tile distribution encoding system represents the core mathematical framework that transforms high-level tensor distribution specifications into concrete, optimized GPU kernel implementations. This advanced compile-time machinery bridges the gap between abstract mathematical descriptions and executable coordinate transformations, enabling the Composable Kernel framework to generate highly efficient code for complex tensor operations. + +At its heart, the encoding system defines how multi-dimensional tensor data is distributed across GPU processing elements through a hierarchical decomposition scheme. By specifying relationships between different coordinate spaces of replication (R), hierarchical (H), partition (P), and yield (Y) dimension, the encoding provides a complete blueprint for data layout and access patterns that can be resolved entirely at compile time. This is the internal mechanism behind :ref:`ck_tile_tile_distribution`. See :ref:`ck_tile_coordinate_systems` for more information about coordinate spaces. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Encoding Components" + RS["R-space Lengths
Replication dimensions"] + HS["H-space Lengths
Hierarchical decomposition
[[2,2],[2,2]]"] + P2RH["P→RH Mappings
Thread to hierarchy
Major/Minor"] + Y2RH["Y→RH Mappings
Element to hierarchy
Major/Minor"] + end + + subgraph "Generated Components" + ADAPTOR["ps_ys_to_xs_adaptor
Coordinate transformer"] + DESC["ys_to_d_descriptor
Memory linearizer"] + ENC["Encoding
Original specification"] + end + + subgraph "Transformation Chain" + T1["Replicate
Transform"] + T2["Unmerge
Transform"] + T3["Merge
Transform"] + end + + RS --> T1 + HS --> T2 + P2RH --> ADAPTOR + Y2RH --> ADAPTOR + + T1 --> T2 + T2 --> T3 + T3 --> ADAPTOR + + HS --> DESC + Y2RH --> DESC + + style RS fill:#fce4ec,stroke:#c2185b,stroke-width:2px + style HS fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style ADAPTOR fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style DESC fill:#fff3e0,stroke:#f57c00,stroke-width:3px + + + +.. image:: diagrams/encoding_internals_1.svg + :alt: Diagram + :align: center + +Encoding Structure +================== + +The tile distribution encoding employs a template-based type system that captures the complete specification of tensor distribution patterns at compile time: + +.. code-block:: cpp + + template // Y to RH mapping (minor) + struct tile_distribution_encoding + { + // All computations resolved at compile time + static constexpr index_t NDimX = HsLengthss::size(); + static constexpr index_t NDimP = Ps2RHssMajor::size(); + static constexpr index_t NDimY = Ys2RHsMajor::size(); + static constexpr index_t NDimR = RsLengths::size(); + + // Static member functions for compile-time access + __host__ __device__ static constexpr auto get_rs_lengths() { + return RsLengths_{}; + } + + __host__ __device__ static constexpr auto get_hs_lengthss() { + return HsLengthss_{}; + } + + // Nested detail struct performs complex compile-time calculations + struct detail { + // Precomputed mappings and transformations + static constexpr auto get_h_dim_lengths_prefix_sum(); + static constexpr auto get_uniformed_idx_y_to_h(); + // ... compile-time computation ... + }; + }; + +Key Template Features +--------------------- + +1. **Template Metaprogramming**: All parameters are types, not values, enabling compile-time optimization +2. **Constexpr Functions**: Everything is computed at compile time +3. **Type Aliases**: Clean access to template parameters +4. **Static Member Functions**: No runtime overhead + +Parameter Breakdown +=================== + +R-Dimensions: Replication Specification +--------------------------------------- + +The ``RsLengths`` parameter defines dimensions that are replicated across processing units, enabling data sharing patterns essential for many tensor operations: + +.. code-block:: cpp + + // Example: GEMM with warp-level replication + using RsLengths = Sequence; + + // This creates replication pattern: + // - NWarpPerBlock warps share the same A data + // - MWarpPerBlock warps share the same B data + +Replication serves several purposes: + +- **Data Reuse**: Same input data needed by multiple output computations +- **Reduction Operations**: Multiple threads collaborate on single result +- **Memory Efficiency**: Reduces global memory bandwidth requirements + +H-Dimensions: Hierarchical Decomposition +---------------------------------------- + +The ``HsLengthss`` parameter represents hierarchical decomposition of tensor dimensions: + +.. code-block:: cpp + + // Example: Block-level GEMM decomposition + using HsLengthss = Tuple< + Sequence, // M-dimension + Sequence // N-dimension + >; + + // This creates hierarchy: + // - MRepeat: iterations per thread in M + // - MWarp: warps assigned to M + // - MThread: threads per warp for M + // - MVec: vector size for M + +The decomposition enables: + +- **Memory Coalescing**: Aligning with warp/thread organization +- **Register Blocking**: Tile sizes that fit in register file +- **Shared Memory Utilization**: Tiles that exploit data reuse + +P-Dimensions: Partition Mapping +------------------------------- + +The ``Ps2RHssMajor`` and ``Ps2RHssMinor`` parameters define work assignment: + +.. code-block:: cpp + + // Example: 2D thread block mapping + // P0 = warp_id, P1 = lane_id + using Ps2RHssMajor = Tuple< + Sequence<1>, // P0 maps to H1 (warp dimension) + Sequence<2> // P1 maps to H2 (thread dimension) + >; + using Ps2RHssMinor = Tuple< + Sequence<1>, // Use second component of H1 + Sequence<2> // Use third component of H2 + >; + +The mapping mechanism: + +- **Major Index**: Which RH-dimension group (0=R, 1-N=H) +- **Minor Index**: Component within that group + +Y-Dimensions: Logical View Mapping +---------------------------------- + +The ``Ys2RHsMajor`` and ``Ys2RHsMinor`` define the user-facing interface: + +.. code-block:: cpp + + // Example: 2D tile access pattern + using Ys2RHsMajor = Sequence<1, 1, 2, 2>; // Y→H mapping + using Ys2RHsMinor = Sequence<0, 1, 0, 1>; // Component selection + + // Creates 2x2 logical view: + // Y[0,0] → H1[0], H2[0] + // Y[0,1] → H1[1], H2[0] + // Y[1,0] → H1[0], H2[1] + // Y[1,1] → H1[1], H2[1] + +Transformation Pipeline +======================= + +The encoding generates a transformation pipeline that converts coordinates using the concepts from :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors`: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "Input Coordinates" + P["P-coordinates
[warp_id, lane_id]"] + Y["Y-coordinates
[y0, y1, y2, y3]"] + end + + subgraph "Transformation Pipeline" + C1["Combine P+Y"] + T1["Replicate
Transform
(if R-dims exist)"] + T2["Unmerge
Transform
(break into H-dims)"] + T3["Merge
Transform
(combine to X-dims)"] + end + + subgraph "Output" + X["X-coordinates
[x0, x1]
Tensor position"] + end + + P --> C1 + Y --> C1 + C1 --> T1 + T1 --> T2 + T2 --> T3 + T3 --> X + + style P fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style Y fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + +.. image:: diagrams/encoding_internals_2.svg + :alt: Diagram + :align: center + +Building the Transformation Chain +--------------------------------- + +.. code-block:: cpp + + template + __host__ __device__ auto make_ps_ys_to_xs_adaptor(const Encoding& encoding) + { + // Step 1: Create individual transforms + constexpr auto replicate_transform = make_replicate_transform( + encoding.get_rs_lengths()); + + constexpr auto unmerge_transform = make_unmerge_transform( + encoding.get_hs_lengthss()); + + constexpr auto merge_transform = make_merge_transform( + encoding.get_rhs_to_xs_mapping()); + + // Step 2: Chain transforms together + constexpr auto transform_chain = chain_transforms( + replicate_transform, + unmerge_transform, + merge_transform); + + // Step 3: Create adaptor with the chain + return make_tile_adaptor( + transform_chain, + encoding.get_lower_dimension_hidden_idss()); + } + +Transform Implementation Example +-------------------------------- + +.. code-block:: cpp + + // Replicate transform implementation + template + struct replicate_transform + { + static constexpr index_t num_of_upper_dimension = size(Lengths{}); + static constexpr index_t num_of_lower_dimension = 2 * num_of_upper_dimension; + + template + __host__ __device__ constexpr auto + calculate_lower_index(const UpperIndex& idx_upper) const + { + // Replicate each coordinate: [a,b] -> [a,b,0,0] + auto idx_lower = make_zero_multi_index(); + + static_for<0, num_of_upper_dimension, 1>{}([&](auto i) { + idx_lower(i) = idx_upper[i]; + idx_lower(i + num_of_upper_dimension) = 0; + }); + + return idx_lower; + } + }; + +Y to D Linearization +==================== + +The Y→D descriptor handles memory layout within each thread, building on :ref:`ck_tile_descriptors` concepts: + +.. code-block:: cpp + + template + struct ys_to_d_descriptor + { + static constexpr index_t num_of_dimension = size(YLengths{}); + + // Calculate linear offset from Y coordinates + template + __host__ __device__ constexpr index_t + calculate_offset(const YIndex& idx_y) const + { + index_t offset = 0; + + static_for<0, num_of_dimension, 1>{}([&](auto i) { + offset += idx_y[i] * YStrides{}[i]; + }); + + return offset; + } + + // Get element space size (total elements per thread) + __host__ __device__ static constexpr index_t + get_element_space_size() + { + return reduce_on_sequence( + YLengths{}, + multiplies{}, + number<1>{}); + } + }; + +Memory Layout Optimization +-------------------------- + +.. code-block:: cpp + + // Optimized layout for vector operations + template + struct make_ys_to_d_descriptor_for_gemm + { + // Layout: [M/VectorSize][N][VectorSize] + // This ensures vector loads are contiguous in memory + using type = tile_descriptor< + Sequence, + Sequence>; + }; + +Integration in Distributed Tensor +--------------------------------- + +This shows how the encoding integrates with :ref:`ck_tile_static_distributed_tensor`: + +.. code-block:: cpp + + template + struct static_distributed_tensor + { + using ys_to_d_descriptor = typename TileDistribution::ys_to_d_descriptor; + + // Thread-local storage + static constexpr index_t thread_buffer_size = + ys_to_d_descriptor::get_element_space_size(); + + DataType thread_buffer_[thread_buffer_size]; + + // Access element at Y coordinate + template + __host__ __device__ DataType& at(const YIndex& idx_y) + { + const index_t offset = ys_to_d_descriptor{}.calculate_offset(idx_y); + return thread_buffer_[offset]; + } + }; + +Practical Examples +================== + +Example 1: Simple 2x2 Distribution +---------------------------------- + +.. code-block:: cpp + + // No replication, simple hierarchy + using SimpleEncoding = tile_distribution_encoding< + Sequence<>, // rs_lengths: no replication + Tuple< // hs_lengthss: 2x2 hierarchy + Sequence<2>, + Sequence<2> + >, + Tuple, Sequence<>>, // ps_to_rhss_major + Tuple, Sequence<>>, // ps_to_rhss_minor + Sequence<1, 2>, // ys_to_rhs_major + Sequence<0, 0> // ys_to_rhs_minor + >; + +Example 2: GEMM Distribution +---------------------------- + +.. code-block:: cpp + + // Complex GEMM distribution with replication + template + using GemmBlockEncoding = tile_distribution_encoding< + Sequence<>, // No block-level replication + Tuple< // Hierarchical decomposition + Sequence, // M + Sequence // N + >, + Tuple< // Warp assignment + Sequence<1, 2>, // [warp_m, warp_n] + Sequence<> + >, + Tuple< + Sequence<1, 0>, // Major indices + Sequence<> + >, + Sequence<1, 1, 2, 2>, // Y mapping + Sequence<0, 1, 0, 1> // Y components + >; + +Performance Implications +======================== + +The encoding system is designed for maximum GPU performance. See :ref:`ck_tile_gpu_basics` for hardware fundamentals. + +Memory Access Patterns +---------------------- + +- **Coalescing**: Hierarchical decomposition ensures adjacent threads access adjacent memory +- **Bank Conflicts**: Careful dimension ordering prevents shared memory conflicts. See :ref:`ck_tile_lds_bank_conflicts` for more information. +- **Vectorization**: Natural support for vector loads and stores. See :ref:`ck_tile_load_store_traits` for more information. + +Register Efficiency +------------------- + +- **Optimal Allocation**: Y→D linearization minimizes register usage +- **Spill Avoidance**: Compile-time sizing prevents register spills +- **Reuse Patterns**: Encoding enables efficient register reuse + +Compile-Time Optimization +------------------------- + +.. code-block:: cpp + + // All encoding operations resolve at compile time + template + struct encoding_optimizer { + // Compute all derived values at compile time + static constexpr auto total_elements = /* computed */; + static constexpr auto access_pattern = /* computed */; + static constexpr auto memory_layout = /* computed */; + + // Generate optimized code paths + template + __device__ void apply_optimized(Func&& f) { + if constexpr (is_simple_pattern) { + // Direct access path + } else if constexpr (is_strided_pattern) { + // Strided access path + } else { + // General access path + } + } + }; + +Summary +======= + +The tile distribution encoding system demonstrates compile-time computation: + +- **Mathematical Foundation**: Complete specification through dimensional relationships +- **Zero Overhead**: All computations resolve at compile time +- **Composable Design**: Individual transforms compose into complex mappings +- **Hardware Alignment**: Natural mapping to GPU execution hierarchy +- **Performance Focus**: Every design decision optimizes for GPU efficiency + +The encoding internals show how CK Tile achieves practical performance. By leveraging C++ template metaprogramming and careful architectural design, the framework generates code that rivals hand-optimized implementations while maintaining clarity and composability. + +For practical examples of how the encoding system is used, see :ref:`ck_tile_thread_mapping`. For coordinate operations that build on these encodings, see :ref:`ck_tile_coordinate_movement`. diff --git a/docs/conceptual/ck_tile/hardware/gemm_optimization.rst b/docs/conceptual/ck_tile/hardware/gemm_optimization.rst new file mode 100644 index 0000000000..a31b6b7803 --- /dev/null +++ b/docs/conceptual/ck_tile/hardware/gemm_optimization.rst @@ -0,0 +1,385 @@ +.. meta:: + :description: Block GEMM optimization on MI300 using CK Tile + :keywords: GEMM, matrix multiplication, MI300, CK, Composable Kernel, GPU optimization + +.. _ck_tile_gemm_optimization: + +******************************************************************** +A Block GEMM on MI300 +******************************************************************** + +Introduction to GEMMs +===================== + +This document illustrates key concepts of implementing a block GEMM (General Matrix Multiplication) kernel on AMD's MI300 GPU. GEMM is a fundamental building block for many machine learning workloads, including attention mechanisms and Mixture of Experts (MoE) models. + +The problem addressed here is the standard matrix multiplication: :math:`C = A \cdot B`, where matrix A has dimensions **M x K** and matrix B has dimensions **K x N**. The resulting matrix C will have dimensions **M x N**. For simplicity and a better memory access pattern, it will be assumed that matrix B is in a column-major format, which means its shape is logically represented as **N x K**. + +Format and Dimensions +===================== + +The first step in designing the kernel is to select the data format and dimensions. + +Data Format: bf16 +----------------- + +While ``float32`` is a common choice, its high precision is computationally expensive and can be unnecessary for model convergence. A more suitable alternative is a half-precision floating-point format. We will use **bfloat16 (bf16)**. + +Bfloat16 is a 16-bit format that uses the same 8-bit exponent as ``float32``. This allows it to have the same dynamic range, which is critical for avoiding overflow and underflow during training. The key difference is that ``bf16`` uses only 7 bits for the mantissa (versus 23 bits in ``float32``), which makes it functionally equivalent to a simple right bit-shift of a 32-bit float: ``(float32 >> 16)``. + +Dimensions: M=4864, N=4096 +-------------------------- + +To maximize hardware utilization, dimensions are used that utilize the GPU's resources well. For this example, **M = 4864** and **N = 4096** are used. The rationale behind these particular values will be explained later. + +Input data +---------- + +The input will be uniformly distributed random data on the interval [-1, 1]: + +.. code-block:: cpp + + initializeMatrix(A.data(), M, K, -1.0, 1.0); + initializeMatrix(B.data(), N, K, -1.0, 1.0); + +Simple Matmul +============= + +On the AMD **MI300** GPU series (see :ref:`ck_tile_gpu_basics`), each Compute Unit (CU) contains **four SIMD units**. Each SIMD unit can execute a single **wavefront** of 64 threads in parallel. Since there are four wavefronts per CU, a CU can therefore sustain the execution of up to **256 concurrent threads**. + +These 256 threads then can be logically grouped into a **thread block**, which is responsible for computing a **sub-block (tile)** of the output matrix ``C``. A block of 256 threads can be arranged as a **16×16 thread block**, where each thread computes one element of a **16×16 tile** of the result matrix ``C``. Multiple thread blocks are then organized into a **grid**, such that the collection of blocks covers the entire output matrix. + +Consider a baseline matrix multiplication kernel where **each thread computes one output element** of ``C``. The HIP launch configuration can be defined as: + +.. code-block:: cpp + + dim3 blockSizeRef(16, 16); + dim3 gridSizeRef((N + blockSizeRef.x - 1) / blockSizeRef.x, + (M + blockSizeRef.y - 1) / blockSizeRef.y); + + matrixMulHIP<<>>(d_A, d_B, d_C); + +And the GPU Kernel: + +.. code-block:: cpp + + __global__ void matrixMulHIP(s_type * __restrict__ A, + s_type* __restrict__ B, + float* __restrict__ C) + { + // Calculate global thread coordinates in output matrix C + int row = blockIdx.y * blockDim.y + threadIdx.y; + int col = blockIdx.x * blockDim.x + threadIdx.x; + + // Boundary check for valid threads + if (row < N && col < N) { + float value = 0.0f; + // Perform the dot product of row from A and column from B + for (int k = 0; k < K; ++k) { + value += A[row * K + k] * B[col * K + k]; + } + // Store computed value in output matrix + C[row * N + col] = value; + } + } + +This kernel has a very low compute throughput according to ``rocprofv3`` profiler output. It is stalling on global memory read transactions effectively starving the rest of the pipeline that needs that data to proceed. + +Memory Bandwidth Analysis +------------------------- + +In a naïve implementation of matrix multiplication, **pressure on global memory loads** quickly becomes the bottleneck. To understand why, it is necessary to look at how a single **16×16 block** of the destination matrix ``C`` is computed by one block of threads within a compute unit. + +Each thread in the block is responsible for computing a single element of ``C``. To do so, it loops over the ``K`` dimension and, in every iteration, fetches **two values** from global memory: + +- one from a row of ``A`` +- one from a column of ``B`` + +This means: + +- Number of threads in a 16×16 block is 256. +- Each thread performs 2K global loads +- **Total global loads** = 256 × 2K = 512K +- **Total global stores** = 256 (one per output element in ``C``) + +To reuse each element of ``A`` and ``B`` perfectly (loading each only once), the unique data required would be: + +- Unique ``A`` elements: 16 × K = 16K +- Unique ``B`` elements: 16 × K = 16K +- **Total unique loads** = 16K + 16K = 32K +- **Total stores** = 256 + +- **Naïve kernel**: 512K global loads + 256 stores +- **Ideal reuse**: 32K global loads + 256 stores + +This illustrates a **16× difference in memory traffic** for the same computation on a small, 16x16 block. + +What is Tiling? +=============== + +Cooperative Loading with LDS +---------------------------- + +In the naïve implementation, threads within the same compute unit (CU) do not cooperate with each other at all. Each thread independently and greedily loads the row elements of ``A`` and the column elements of ``B`` that it needs in order to compute its corresponding value in ``C``. + +Each CU on the MI300 has **64 KB of Local Data Share (LDS)** (see :ref:`ck_tile_lds_bank_conflicts` for optimization techniques) that acts as a shared memory space accessible by all threads in that CU. This opens the possibility of **cooperative loading**. + +Instead of having every thread repeatedly fetch its own data directly from global memory, threads can **collaboratively preload** a block of data into LDS. Once in LDS, this data can be reused by many threads, reducing redundant global memory fetches. + +Entire rows or columns of ``A`` and ``B`` can't be preloaded into LDS, since they might be very large and LDS has a fixed capacity. The solution is to load **small blocks (tiles)** of data at a time. For example: + +- Load a **16×16 tile** from ``A`` and ``B`` into LDS +- Allow all threads in the CU to reuse the data from that tile to compute their portion of the result +- Once done, move the tile window forward along the ``K`` dimension +- Repeat until the entire **16×16 output block** of ``C`` is computed + +This technique of **tiling with cooperative loading** reduces global memory traffic and improves GPU efficiency by leveraging fast, on-chip LDS as in LDS has a better speed and reuse of the data. + +Tiling Mathematics +------------------ + +How many elements of matrices A and B need to be loaded with the tiling approach? + +For a thread block computing a ``TILE_M × TILE_N`` output tile with K-blocking: + +- Elements of **A** loaded per block: + + .. math:: + \text{A\_loads} = \mathrm{TILE\_M} \cdot K + +- Elements of **B** loaded per block: + + .. math:: + \text{B\_loads} = \mathrm{TILE\_N} \cdot K + +- Total outputs produced per block: + + .. math:: + \text{outputs} = \mathrm{TILE\_M} \cdot \mathrm{TILE\_N} + +The **average loads per output element** (ignoring C traffic) are: + +.. math:: + \text{loads per output} = \frac{\mathrm{TILE\_M}\cdot K + \mathrm{TILE\_N}\cdot K}{\mathrm{TILE\_M} \cdot \mathrm{TILE\_N}} = K \left(\frac{1}{\mathrm{TILE\_M}} + \frac{1}{\mathrm{TILE\_N}}\right) + +To simplify the formula, consider a square tile of size T, to compute one value in C: + +- Naïve (no tiling) = 2K loads per output. +- With tiling = 2K/T. +- **Reduction factor = T**. + +Example: T=16 + +.. math:: + \text{loads per output} = \frac{2K}{16} = \frac{K}{8} + +Compared to the naïve 2K, this gives a **16× reduction** in global memory traffic per output element. + +LDS Usage and Tiling Efficiency +------------------------------- + +How much space in LDS would this tiling use? Matrices **A** and **B** store data in **bf16** format. For a small 16×16 tile: + +- Each matrix contains 16 × 16 = 256 elements. +- At 2 bytes per element, each matrix occupies 256 × 2 = 512 bytes. +- Total for A and B: 512 × 2 = 1 KB. + +There is much more space in LDS, so why not try a bigger tile size? 32 KB for each matrix can be used, which allows the tile size to be increased to **256×64**. With this tile size, each compute unit (CU) will output a **256×256 block in C**. With this approach, the number of global memory reads will be **256 times smaller per element in C** compared to a brute-force approach. + +Variation of the GEMM in Inference +---------------------------------- + +When implementing GEMM in inference, because B matrix is the weight which is static, the B matrix will be preshuffled to the warp GEMM MFMA shape to have a faster access for registers to do the MFMA operations. In this strategy there are the following optimizations: + +- Shared Memory bypass of the B Matrix. +- Loop over the A Matrix stored in the shared memory and let B stays in the registers. +- Ping Pong buffering for the GEMM Pipeline + + +Utilization Considerations +-------------------------- + +This section explains why the input dimensions **M = 4864** and **N = 4096** are convenient choices. + +The MI300 has **304 compute units (CUs)**. If a tile size of **256×64** is chosen, where the **K dimension** is iterated over, then the output grid size is: + +.. code-block:: text + + M / 256 × N / 256 = 4864 / 256 × 4096 / 256 = 19 × 16 = 304 + +This matches the total number of compute units on the GPU. That means every CU can be fully occupied with one tile of work, and imbalance or underutilization is not as much of a concern. + +Advanced Optimizations +====================== + +Matrix Fused Multiply-Add +------------------------- + +Because compute-to-memory-access ratio can be a bottleneck, optimizing for bandwidth only isn't enough. + +GPUs offer dedicated **matrix (or tensor) cores** for multiplication tasks. These cores are specifically designed to accelerate matrix operations. + +To take full advantage of these specialized cores, intrinsic instructions can be used. Intrinsic instructions are hardware-specific functions that allow for direct access to the matrix core pipelines. For this example, ``__builtin_amdgcn_mfma_f32_16x16x16f16``, has a low latency of only 16 cycles, will be used. + +16x16 matrices will be used as input, and 16x16 matrices will be used as output. These instructions work as *accumulate add*, what they effectively do is: ``D = A*B + C``. This is useful in this example since results will be accumulated over multiple tiles over K dimension. + +Optimizing Data Flow with Pipelining +------------------------------------ + +To maximize performance, the flow for this kernel uses a **pipeline** or **double buffering** to keep the compute units continuously fed with data, reducing idle time. This pipeline consists of a series of stages that process data concurrently: + +* **Stage 1: Global Memory to Registers:** The first stage involves pre-loading data directly from **global memory** into Vector General Purpose Registers (VGPR). This is the slowest part of the pipeline. Because of this, this operation is performed as early as possible. + +* **Stage 2: Registers to LDS (Shared Memory):** As data is being loaded from global memory, the next stage of the pipeline moves the data from the VGPRs into **LDS (Local Data Share)**, or shared memory. This is an intermediate step that makes the data accessible to all threads within the workgroup at very low latency. + +* **Stage 3: LDS to Registers:** With the data now in LDS, the data is transferred from LDS back into a different set of VGPR registers, which will serve as the direct input for the compute operations. + +* **Stage 4: Computation with MFMA:** The Matrix-FMA (MFMA) intrinsic uses the data from the VGPRs to perform the actual matrix multiplication and accumulation. + +By using this pipelined approach, the different stages of data movement and computation happen in parallel. While the current VGPRs are being consumed by the MFMA operation, the next set of data is already being moved from LDS to another set of VGPRs, and the next tile of data is being loaded from global memory into a third set of VGPRs. This overlapping of operations is key to keeping the GPU's compute units fully utilized. + +CK Tile Implementation +====================== + +Here's how CK Tile implements an optimized GEMM kernel: + +.. code-block:: cpp + + template + __global__ void ck_tile_gemm_kernel(const ADataType* __restrict__ a_global, + const BDataType* __restrict__ b_global, + CDataType* __restrict__ c_global, + index_t M, + index_t N, + index_t K) + { + // Define tile distribution encoding + // See :ref:`ck_tile_encoding_internals` and :ref:`ck_tile_tile_distribution` + using Encoding = tile_distribution_encoding< + sequence<>, // No replication + tuple, // M dimension hierarchy + sequence<4, 2, 8, 4>>, // N dimension hierarchy + tuple, sequence<1, 2>>, // Thread mapping + tuple, sequence<2, 2>>, // Minor indices + sequence<1, 1, 2, 2>, // Y-space mapping + sequence<0, 3, 0, 3> // Y-space minor + >; + + constexpr auto tile_dist = make_static_tile_distribution(Encoding{}); + + // Create tensor views for global memory + // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_buffer_views` + auto a_global_view = make_naive_tensor_view( + a_global, make_tuple(M, K), make_tuple(K, 1)); + auto b_global_view = make_naive_tensor_view( + b_global, make_tuple(N, K), make_tuple(K, 1)); + auto c_global_view = make_naive_tensor_view( + c_global, make_tuple(M, N), make_tuple(N, 1)); + + // Calculate block offset + const index_t block_m_id = blockIdx.y; + const index_t block_n_id = blockIdx.x; + + // Create tile windows for loading + // See :ref:`ck_tile_tile_window` for tile window details + auto a_window = make_tile_window( + a_global_view, + make_tuple(number{}, number{}), + make_tuple(block_m_id * MPerBlock, 0), + tile_dist); + + auto b_window = make_tile_window( + b_global_view, + make_tuple(number{}, number{}), + make_tuple(block_n_id * NPerBlock, 0), + tile_dist); + + // Allocate LDS storage + // See :ref:`ck_tile_static_distributed_tensor` for distributed tensors + auto a_lds = make_static_distributed_tensor(); + auto b_lds = make_static_distributed_tensor(); + + // Initialize accumulator + auto c_reg = make_static_distributed_tensor(); + // See :ref:`ck_tile_sweep_tile` for sweep operations + sweep_tile(c_reg, [](auto idx, auto& val) { val = 0; }); + + // Main GEMM loop with pipelining + constexpr index_t num_k_tiles = K / KPerBlock; + + // Preload first tile + a_window.load(a_lds); + b_window.load(b_lds); + __syncthreads(); + + // Pipeline loop + for(index_t k_tile = 0; k_tile < num_k_tiles - 1; ++k_tile) { + // Move windows for next iteration + // See :ref:`ck_tile_coordinate_movement` for window movement + a_window.move_slice_window(make_tuple(0, KPerBlock)); + b_window.move_slice_window(make_tuple(0, KPerBlock)); + + // Prefetch next tile while computing current + auto a_lds_next = make_static_distributed_tensor(); + auto b_lds_next = make_static_distributed_tensor(); + + a_window.load_async(a_lds_next); + b_window.load_async(b_lds_next); + + // Compute with current tile + gemm_tile(a_lds, b_lds, c_reg); + + // Wait for prefetch and swap buffers + __syncthreads(); + a_lds = a_lds_next; + b_lds = b_lds_next; + } + + // Last tile computation + gemm_tile(a_lds, b_lds, c_reg); + + // Store result + auto c_window = make_tile_window( + c_global_view, + make_tuple(number{}, number{}), + make_tuple(block_m_id * MPerBlock, block_n_id * NPerBlock), + tile_dist); + + c_window.store(c_reg); + } + + +Key Takeaways +============= + +1. **Tiling is essential**: Reduces memory traffic by orders of magnitude +2. **Use specialized hardware**: MFMA instructions provide massive speedup +3. **Pipeline operations**: Hide memory latency with computation +4. **CK Tile abstractions**: Automatically handle complex optimizations +5. **Hardware-aware dimensions**: Choose problem sizes that map well to CU count + +By understanding these optimization techniques and using CK Tile's high-level abstractions, developers can improve performance onGPUs without manual low-level optimization. + +Related Topics + +- :ref:`ck_tile_tile_distribution` - Core distribution mechanism used in GEMM +- :ref:`ck_tile_tile_window` - Window-based data access patterns +- :ref:`ck_tile_static_distributed_tensor` - LDS memory management for tiles +- :ref:`ck_tile_lds_bank_conflicts` - Avoiding bank conflicts in GEMM +- :ref:`ck_tile_thread_mapping` - How threads map to GEMM computation +- :ref:`ck_tile_load_store_traits` - Optimized memory access patterns +- :ref:`ck_tile_space_filling_curve` - Advanced traversal patterns +- :ref:`ck_tile_sweep_tile` - Iterating over distributed data +- :ref:`ck_tile_gpu_basics` - Understanding the hardware +- :ref:`ck_tile_coordinate_systems` - Mathematical foundation diff --git a/docs/conceptual/ck_tile/hardware/gpu_basics.rst b/docs/conceptual/ck_tile/hardware/gpu_basics.rst new file mode 100644 index 0000000000..c8109c8200 --- /dev/null +++ b/docs/conceptual/ck_tile/hardware/gpu_basics.rst @@ -0,0 +1,38 @@ +.. meta:: + :description: Introduction to AMD CDNA Architecture for CK developers + :keywords: CDNA, RDNA, ROCm, CK, Composable Kernel, GPU architecture, compute units + +.. _ck_tile_gpu_basics: + +******************************************************************** +Intro to AMD CDNA Architecture +******************************************************************** + +The AMD CDNA architecture is a specialized GPU design for high-performance computing (HPC) and AI workloads. Unlike the RDNA architecture used in gaming GPUs, CDNA is optimized for data center tasks, prioritizing compute density, memory bandwidth, and scalability. This is achieved through several key architectural features. + +For more information about the AMD GPU architecture, see the `GPU architecture documentation `_. + +Implications for CK Tile +======================== + +Understanding the CDNA architecture is crucial for effective use of CK Tile: + +1. **Thread Organization**: CK Tile's hierarchical :ref:`ck_tile_thread_mapping` (blocks → warps → threads) directly maps to CDNA's hardware organization. + +2. **Memory Hierarchy**: CK Tile's :ref:`ck_tile_buffer_views` and :ref:`ck_tile_tile_window` are designed to efficiently utilize the L2, Infinity Cache, and LDS hierarchy. + +3. **Register Pressure**: CK Tile's compile-time optimizations help minimize VGPR usage, preventing spills to slower memory. + +4. **Warp Execution**: CK Tile's :ref:`ck_tile_tile_distribution` ensures that threads within a warp access contiguous memory for optimal SIMD execution. + +5. **LDS Utilization**: CK Tile's :ref:`ck_tile_static_distributed_tensor` and :ref:`ck_tile_tile_window` make effective use of the 64KB LDS per CU. + +By understanding these architectural features, developers can better appreciate how CK Tile's abstractions map to hardware capabilities and why certain design decisions were made in the framework. + +Related Topics + +- :ref:`ck_tile_thread_mapping` - How threads are organized and mapped to hardware +- :ref:`ck_tile_coordinate_systems` - Mathematical foundation for data distribution +- :ref:`ck_tile_lds_bank_conflicts` - Optimizing shared memory access patterns +- :ref:`ck_tile_load_store_traits` - Memory access optimization strategies +- :ref:`ck_tile_gemm_optimization` - Practical application of architecture knowledge diff --git a/docs/conceptual/ck_tile/hardware/index.rst b/docs/conceptual/ck_tile/hardware/index.rst new file mode 100644 index 0000000000..d9191c7298 --- /dev/null +++ b/docs/conceptual/ck_tile/hardware/index.rst @@ -0,0 +1,127 @@ +.. meta:: + :description: CK Tile Hardware-Specific Documentation + :keywords: CDNA, GPU architecture, LDS, GEMM, CK, Composable Kernel + +.. _ck_tile_hardware: + +******************************************************************** +CK Tile Hardware Documentation +******************************************************************** + +This section provides in-depth coverage of hardware-specific concepts and optimizations for CK Tile on AMD GPUs. + +Overview +======== + +Understanding the underlying hardware architecture is crucial for achieving optimal performance with CK Tile. This documentation covers: + +- AMD CDNA architecture fundamentals +- Memory hierarchy and optimization techniques +- Practical examples of high-performance kernels + +Documentation Structure +======================= + +.. toctree:: + :maxdepth: 2 + :caption: Hardware Topics + + gpu_basics + lds_bank_conflicts + gemm_optimization + +GPU Architecture Basics +----------------------- + +:ref:`ck_tile_gpu_basics` provides an introduction to AMD CDNA architecture. + +LDS and Bank Conflicts +---------------------- + +:ref:`ck_tile_lds_bank_conflicts` explains Local Data Share (LDS) optimization. + +GEMM Optimization Case Study +---------------------------- + +:ref:`ck_tile_gemm_optimization` demonstrates a complete optimization journey. + + +Key Hardware Considerations +=========================== + + +Memory Hierarchy +---------------- + +1. **Global Memory**: High latency, high bandwidth + + - Optimize with coalesced access patterns + - Use tile windows for automatic optimization + +2. **L2/Infinity Cache**: Intermediate storage + + - Benefits from spatial and temporal locality + - CK Tile's tiling naturally improves cache hit rates + +3. **LDS**: Low latency, shared within CU + + - 64KB per CU, organized in 32 banks + - CK Tile handles bank conflict avoidance + +4. **Registers**: Lowest latency, per-thread storage + + - 512 VGPRs available per wavefront + - CK Tile's compile-time optimization minimizes usage + +Compute Resources +----------------- + +1. **Wavefront Execution**: 64 threads in lockstep + + - CK Tile ensures coalesced memory access + - Automatic warp-level synchronization + +2. **Matrix Units**: Specialized MFMA instructions + + - 16x16x16 operations in 16 cycles + - CK Tile can leverage these automatically + +3. **Occupancy**: Balancing threads vs resources + + - Register pressure affects occupancy + - CK Tile helps through efficient register use + +Performance Guidelines +====================== + +To achieve optimal performance with CK Tile: + +1. **Choose appropriate tile sizes**: + + - Match hardware capabilities (e.g., 256x256 for GEMM) + - Consider LDS capacity and register pressure + +2. **Align problem dimensions**: + + - Match CU count when possible (304 for MI300) + - Use padding for non-aligned sizes + +3. **Enable pipelining**: + + - Use double buffering for latency hiding + - CK Tile supports async operations + +4. **Profile and verify**: + + - Use rocprof to check for bottlenecks + - Verify bank conflict avoidance + - Monitor occupancy and register usage + +Next Steps +========== + +- Review :ref:`ck_tile_gpu_basics` for architecture fundamentals +- Study :ref:`ck_tile_lds_bank_conflicts` for shared memory optimization +- Explore :ref:`ck_tile_gemm_optimization` for a complete optimization example + +For practical implementation, refer back to the main :ref:`ck_tile_conceptual` documentation to see how these hardware concepts integrate with CK Tile's abstractions. diff --git a/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst b/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst new file mode 100644 index 0000000000..8802fba9e8 --- /dev/null +++ b/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst @@ -0,0 +1,209 @@ +.. meta:: + :description: Understanding AMD GPU LDS and Bank Conflicts in CK Tile + :keywords: LDS, bank conflicts, shared memory, CK, Composable Kernel, GPU optimization + +.. _ck_tile_lds_bank_conflicts: + +******************************************************************** +Understanding AMD GPU LDS and Bank Conflicts +******************************************************************** + +Introduction +============ + +Local Data Share (**LDS**) is AMD's shared memory within a compute unit (see :ref:`ck_tile_gpu_basics` for architecture details). It is organized into **32 or 64 banks** depending on the hardware architecture, each bank has a 4 bytes width. Understanding how memory addresses map to banks is key to avoiding **bank conflicts**. + +Bank Mapping +============ + +For AMD GCN architecture, the LDS bank mapping is typically: + +.. math:: + + \text{bank} = \left( \frac{\text{address in bytes}}{4} \right) \bmod 32 + +This means: + +- Addresses that differ by multiples of ``bank numbers * 4 bytes`` map to the same bank. +- Conflicts occur when multiple threads in the same wave access the same bank **in the same cycle**. + +Not all the lanes can produce bank conflicts. HW divides access to LDS from wavefront into phases. Which lanes would be considered in each phase depends on the width of the instruction. Let us consider ``ds_write_b128`` as an example as it is the instruction that has the largest granularity write with the highest performance. Here access will be divided into 8 phases for 64 lane wavefront. If in 1 phase there will not be two thread access the same bank, there will bot be bank conflict: + +- lane0~lane7 +- lane8~lane15 +- lane16~lane23 +- lane24~lane31 +- lane32~lane39 +- lane40~lane47 +- lane48~lane55 +- lane56~lane63 + +If within each group of lanes there is no conflict it is an LDS bank conflict free write access. + +Bank Access Patterns +==================== + +LDS bank access can be simulated for a given set of thread addresses. With a 32 bank LDS with 4 bytes per bank, each thread will be writing 8 2-byte elements (16 bytes total), consuming 4 banks in LDS. fp16 or bf16 are the common formats GPU kernels have to deal with. With the phase access pattern like above by default it is a bank conflict free LDS write access. + +Write Access Pattern +-------------------- + +For LDS write instructions like ``ds_write_b128``, the hardware provides conflict-free access when threads write to consecutive addresses. Each phase of 8 lanes writes to different banks, avoiding conflicts. + +Read Access Pattern +------------------- + +Similarly for LDS read instruction ``ds_read_b128``, when there is no bank conflict in these 8 lane groups: + +- 0:3+20:23 +- 4:7+16:19 +- 8:11+28:31 +- 12:15+24:27 +- 32:35+52:55 +- 36:39+48:51 +- 40:43+60:63 +- 44:47+56:59 + +then it's bank conflict-free for LDS reading. + +The reason for accessing the data vertically is because in most LDS access the MFMA instruction in the next step and the MFMA are requirde to access the data vertically like above. + +The LDS read access pattern illustrated below is typical for LDS usage in machine learning workloads. The read pattern can generate 4-way bank conflicts in every phase of access. You can experiment with ``row_padding`` (padding in a number of banks) to see if the problem can be solved this way, but also remember that in practice this will require additional LDS storage. The bigger the padding, the more additional storage is necessary. + +XOR Preshuffle: An Alternative to Padding +========================================= + +Another technique to reduce LDS bank conflicts is **XOR preshuffling** (see :ref:`ck_tile_lds_index_swapping` for detailed implementation). Instead of adding padding between rows, we can permute the column indices for each row using XOR. This method can help to avoid bank conflicts without allocating extra storage in LDS. + +For a wavefront of 64 threads, if each thread writes a vector of 8 fp16 elements (16 bytes), and the row size is 64 elements, the column index for each element in a row is adjusted as follows: + +- ``KTypeSize = 2`` +- ``KPerBlock = 64`` // 64 elements per row +- ``KPack = 8`` // 8 elements per thread + +The adjusted column position for element ``(x, y)`` is: + +.. math:: + + x' = \left( y \bmod \frac{\text{KPerBlock}}{\text{KPack}} \right) \oplus x + +where :math:`\oplus` is the bitwise XOR, and :math:`x, y` are the original positions of a vector element with respect to the LDS banks. + +C++ Implementation +================== + +Here's how CK implements XOR preshuffling: + +.. code-block:: cpp + + // XOR-based column index adjustment + template + __device__ constexpr index_t xor_preshuffle(index_t row, index_t col) + { + constexpr index_t num_cols = KPerBlock / KPack; + return (row % num_cols) ^ col; + } + + // LDS write with XOR preshuffle + template + __device__ void lds_write_with_xor(DataType* lds_ptr, + const DataType* src, + index_t row, + index_t col) + { + // Apply XOR preshuffle to column index + index_t col_xor = xor_preshuffle<64, 8>(row, col); + + // Write to LDS with adjusted column + index_t offset = row * RowStride + col_xor * 8; + + // Vectorized write (assuming 128-bit write) + *reinterpret_cast(lds_ptr + offset) = + *reinterpret_cast(src); + } + + // LDS read with XOR preshuffle + template + __device__ void lds_read_with_xor(DataType* dst, + const DataType* lds_ptr, + index_t row, + index_t col) + { + // Apply same XOR preshuffle for read + index_t col_xor = xor_preshuffle<64, 8>(row, col); + + // Read from LDS with adjusted column + index_t offset = row * RowStride + col_xor * 8; + + // Vectorized read + *reinterpret_cast(dst) = + *reinterpret_cast(lds_ptr + offset); + } + +Integration with CK Tile +======================== + +CK Tile handles LDS bank conflict avoidance through its abstractions: + +1. **TileWindow** (:ref:`ck_tile_tile_window`): Automatically applies XOR preshuffling when loading/storing to LDS +2. **StaticDistributedTensor** (:ref:`ck_tile_static_distributed_tensor`): Manages LDS allocation with proper alignment +3. **LoadStoreTraits** (:ref:`ck_tile_load_store_traits`): Selects optimal access patterns to minimize conflicts + +Example usage in CK Tile: + +.. code-block:: cpp + + // CK Tile automatically handles bank conflict avoidance + template + __device__ void gemm_kernel() + { + // Create tile window with automatic XOR preshuffle + auto a_window = make_tile_window( + a_tensor_view, + tile_size, + origin, + tile_distribution); + + // Load to LDS - XOR preshuffle applied automatically + auto a_lds_tensor = make_static_distributed_tensor< + element_type, + decltype(tile_distribution)>(); + + a_window.load(a_lds_tensor); + + // Subsequent reads from LDS are conflict-free + // See :ref:`ck_tile_sweep_tile` for sweep operations + sweep_tile(a_lds_tensor, [](auto idx, auto& val) { + // Process data... + }); + } + +Performance Impact +================== + +Proper LDS bank conflict avoidance can have significant performance impact: + +- **4-way conflicts**: Can reduce effective LDS bandwidth by 75% +- **XOR preshuffle**: Restores full bandwidth with zero storage overhead +- **Padding**: Also effective but requires 12.5-25% more LDS storage + +Best Practices +============== + +1. **Use CK Tile abstractions**: They automatically handle bank conflict avoidance +2. **Prefer XOR preshuffle**: No storage overhead compared to padding +3. **Verify with profiling**: Use rocprof to check for LDS bank conflicts +4. **Consider access patterns**: Design algorithms with bank-friendly patterns + +By understanding LDS bank conflicts and using CK Tile's automatic conflict avoidance mechanisms, developers can achieve optimal shared memory performance without manual optimization. + +Related Topics +============== + +- :ref:`ck_tile_lds_index_swapping` - Detailed XOR preshuffle implementation +- :ref:`ck_tile_swizzling_example` - Morton ordering for memory swizzling +- :ref:`ck_tile_gpu_basics` - Understanding AMD GPU architecture +- :ref:`ck_tile_tile_window` - Automatic conflict avoidance in data access +- :ref:`ck_tile_static_distributed_tensor` - LDS memory management +- :ref:`ck_tile_gemm_optimization` - Practical application in GEMM kernels +- :ref:`ck_tile_transforms` - Coordinate transformations for conflict avoidance diff --git a/docs/conceptual/ck_tile/index.rst b/docs/conceptual/ck_tile/index.rst new file mode 100644 index 0000000000..287143d6de --- /dev/null +++ b/docs/conceptual/ck_tile/index.rst @@ -0,0 +1,108 @@ +.. _ck_tile_conceptual: + +CK Tile Conceptual Documentation +================================ + +Welcome to the conceptual documentation for CK Tile, the core abstraction layer of Composable Kernel that enables efficient GPU programming through compile-time coordinate transformations and tile-based data distribution. + +See the :ref:`ck_tile_index` for the complete CK Tile documentation structure. + +Overview +-------- + +CK Tile provides a mathematical framework for expressing complex GPU computations through: + +- **Automatic Memory Coalescing**: Ensures optimal memory access patterns without manual optimization +- **Thread Cooperation**: Coordinates work distribution across the GPU's hierarchical execution model +- **Zero-Overhead Abstractions**: Compile-time optimizations ensure no runtime performance penalty +- **Portable Performance**: Same code achieves high performance across different GPU architectures + +Why CK Tile? +------------ + +Traditional GPU programming requires manual management of: + +- Thread-to-data mapping calculations +- Memory coalescing patterns +- Bank conflict avoidance +- Boundary condition handling + +CK Tile automates all of these concerns through a unified abstraction that maps logical problem coordinates to physical GPU resources. + + +Learning Path +------------- + +1. **Start Here**: :ref:`ck_tile_introduction` + + The fundamental problems CK Tile solves and why it's essential for efficient GPU programming. + +2. **Foundation**: :ref:`ck_tile_buffer_views` + + How CK Tile provides structured access to raw GPU memory across different address spaces. + +3. **Multi-Dimensional Views**: :ref:`ck_tile_tensor_views` + + How to work with multi-dimensional data structures and memory layouts. + +4. **Core API**: :ref:`ck_tile_distribution` + + The tile distribution system that maps work to GPU threads. + +5. **Mathematical Framework**: :ref:`ck_tile_coordinate_systems` + + The coordinate transformation system that powers CK Tile's abstractions. + +6. **Reference**: :ref:`ck_tile_terminology` + + Glossary of all terms and concepts used in CK Tile. + + +Key Concepts at a Glance +------------------------ + +**Coordinate Spaces** + +- **P-space**: Processing element coordinates (thread, warp, block) +- **Y-space**: Local tile access patterns +- **X-space**: Physical tensor coordinates +- **D-space**: Linearized memory addresses + +**Core Components** + +- **BufferView**: Type-safe access to GPU memory +- **TileDistribution**: Automatic work distribution +- **TileWindow**: Efficient data loading/storing +- **Encoding**: Compile-time distribution specification + +Quick Example +------------- + +.. code-block:: cpp + + // Define how to distribute a 256x256 tile across threads + using Encoding = tile_distribution_encoding< + sequence<>, // No replication + tuple, // M dimension hierarchy + sequence<4,2,8,4>>, // N dimension hierarchy + tuple, sequence<1,2>>, // Thread mapping + tuple, sequence<2,2>>, // Minor indices + sequence<1,1,2,2>, // Y-space mapping + sequence<0,3,0,3> // Y-space minor + >; + + // Create distribution and load data + auto distribution = make_static_tile_distribution(Encoding{}); + auto window = make_tile_window(tensor_view, tile_size, origin, distribution); + auto tile = window.load(); + + // Process tile efficiently + sweep_tile(tile, [](auto idx) { /* computation */ }); + + +Next Steps +---------- + +To dive deeper, start with :ref:`ck_tile_introduction` to understand the motivation and core concepts behind CK Tile. + +For practical examples, see the `example/ck_tile `_ directory in the Composable Kernel repository. diff --git a/docs/conceptual/ck_tile/introduction_motivation.rst b/docs/conceptual/ck_tile/introduction_motivation.rst new file mode 100644 index 0000000000..9884901556 --- /dev/null +++ b/docs/conceptual/ck_tile/introduction_motivation.rst @@ -0,0 +1,309 @@ +.. _ck_tile_introduction: + +Introduction and Motivation - Why Tile Distribution Matters +=========================================================== + +Overview +-------- + +The evolution of GPU computing has brought unprecedented computational power to modern applications, yet harnessing this power efficiently remains one of the most challenging aspects of high-performance computing. At the heart of this challenge lies a fundamental mismatch between how developers conceptualize algorithms and how GPU hardware executes them. While developers think in terms of mathematical operations on multi-dimensional data structures, GPUs operate through thousands of threads accessing memory in complex patterns that must satisfy stringent hardware constraints. + +This conceptual gap manifests most acutely in memory access patterns. Modern GPUs achieve their high performance through massive parallelism, with thousands of threads executing simultaneously. However, this parallelism comes with a critical constraint: memory bandwidth. Despite continuous improvements in computational throughput, memory bandwidth has not scaled proportionally, creating what is often called the "memory wall." The efficiency with which threads access memory determines whether a GPU kernel achieves a few percent or near 100% of the hardware's theoretical performance. + +The Composable Kernel (CK) framework addresses this challenge through its tile distribution system, a compile-time abstraction that automatically generates optimal memory access patterns while preserving the natural expression of algorithms. This documentation explores the mathematical foundations and practical implementation of tile distribution, demonstrating how it bridges the gap between algorithmic intent and hardware reality. + +In this introduction, we establish the fundamental problems that tile distribution solves, explore why these problems are critical for GPU performance, and provide the conceptual framework necessary to understand the compile-time coordinate transformation system that powers CK's approach to efficient GPU computation. + +The GPU Memory Problem +---------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Random Access Pattern (Inefficient)" + subgraph "Threads" + T0_R["Thread 0"] + T1_R["Thread 1"] + T2_R["Thread 2"] + T3_R["Thread 3"] + end + + subgraph "Memory" + M0["Mem[0]"] + M7["Mem[7]"] + M15["Mem[15]"] + M23["Mem[23]"] + M31["Mem[31]"] + M39["Mem[39]"] + M47["Mem[47]"] + M55["Mem[55]"] + end + + T0_R -.-> M23 + T1_R -.-> M7 + T2_R -.-> M47 + T3_R -.-> M15 + end + + subgraph "Tile Distribution Pattern (Efficient)" + subgraph "Threads_TD" + T0_TD["Thread 0"] + T1_TD["Thread 1"] + T2_TD["Thread 2"] + T3_TD["Thread 3"] + end + + subgraph "Memory_TD" + M0_TD["Mem[0]"] + M1_TD["Mem[1]"] + M2_TD["Mem[2]"] + M3_TD["Mem[3]"] + M4_TD["Mem[4]"] + M5_TD["Mem[5]"] + M6_TD["Mem[6]"] + M7_TD["Mem[7]"] + end + + T0_TD --> M0_TD + T0_TD --> M1_TD + T1_TD --> M2_TD + T1_TD --> M3_TD + T2_TD --> M4_TD + T2_TD --> M5_TD + T3_TD --> M6_TD + T3_TD --> M7_TD + end + + style T0_R fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style T1_R fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style T2_R fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style T3_R fill:#fee2e2,stroke:#ef4444,stroke-width:2px + + style T0_TD fill:#d1fae5,stroke:#10b981,stroke-width:2px + style T1_TD fill:#d1fae5,stroke:#10b981,stroke-width:2px + style T2_TD fill:#d1fae5,stroke:#10b981,stroke-width:2px + style T3_TD fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + +.. image:: diagrams/introduction_motivation_1.svg + :alt: Diagram + :align: center + +Why Random Memory Access is Slow +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The architecture of modern GPUs represents a study in trade-offs. While these devices can execute thousands of threads simultaneously and perform trillions of floating-point operations per second, they remain fundamentally constrained by the physics of memory access. Understanding this constraint is crucial to appreciating why tile distribution is not merely an optimization technique but an essential component of high-performance GPU computing. + +GPU memory systems are designed around the assumption of regular, predictable access patterns. The memory controller can service requests from 32 threads (a warp on AMD GPUs) in a single transaction when these threads access consecutive memory locations. This optimization, known as memory coalescing, can improve effective memory bandwidth by up to 32x compared to random access patterns. However, when threads within a warp access memory locations that are scattered throughout the address space, each access requires a separate memory transaction, reducing the effective bandwidth to a fraction of the theoretical maximum. + +The impact extends beyond raw bandwidth. Modern GPUs employ cache hierarchies to reduce memory latency, but these caches are effective only when access patterns exhibit spatial or temporal locality. Random access patterns defeat these optimizations, causing frequent cache misses that expose the full latency of global memory access, which can be hundreds of cycles. During these stalls, the computational units sit idle, unable to hide the latency even with the GPU's massive thread count. + +Furthermore, the GPU's Single Instruction, Multiple Thread (SIMT) execution model requires that all threads in a warp execute the same instruction at the same time. When threads access memory in unpredictable patterns, the memory controller cannot optimize the requests, leading to serialization of what should be parallel operations. This serialization effect compounds with each level of the memory hierarchy, from L1 cache through L2 cache to global memory, multiplying the performance impact. + +The Thread Cooperation Challenge +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The challenge of efficient thread cooperation becomes particularly evident when examining a fundamental operation like matrix multiplication. Consider a scenario where 256 threads must cooperate to multiply two matrices. The naive approach, where each thread computes one element of the output matrix, illustrates precisely why GPU programming requires compile-time abstractions. + +.. code-block:: cpp + + // Inefficient: Random access pattern + __device__ void naive_matrix_multiply() + { + int thread_id = threadIdx.x + blockIdx.x * blockDim.x; + + // Get this thread's output position + int row = thread_id / MATRIX_WIDTH; + int col = thread_id % MATRIX_WIDTH; + + // Each thread computes one element of C = A * B + float result = 0.0f; + for (int k = 0; k < MATRIX_WIDTH; k++) + { + // Random access pattern - threads in a warp access non-contiguous memory + // Thread 0: A[0,0], A[0,1], A[0,2]... + // Thread 1: A[1,0], A[1,1], A[1,2]... + // These are far apart in memory! + float a_element = global_memory_A[row * MATRIX_WIDTH + k]; + + // Even worse for B - accessing column-wise causes strided access + // Thread 0: B[0,0], B[1,0], B[2,0]... + // Thread 1: B[0,1], B[1,1], B[2,1]... + // Massive stride between accesses! + float b_element = global_memory_B[k * MATRIX_WIDTH + col]; + + result += a_element * b_element; + } + + // Write result - adjacent threads write to adjacent locations (at least this is good) + global_memory_C[row * MATRIX_WIDTH + col] = result; + } + +This seemingly straightforward implementation suffers from fundamental inefficiencies that stem from the mismatch between the algorithm's logical structure and the hardware's physical constraints. The memory access pattern is essentially random from the hardware's perspective, as adjacent threads access memory locations separated by large strides. This pattern prevents the memory controller from coalescing accesses, forcing it to issue separate transactions for each thread. + +The lack of coordination between threads exacerbates the problem. While all threads in a warp execute the same instruction, they operate on completely different data with no sharing or reuse. This independence, which might seem desirable in traditional parallel programming, actually works against GPU architecture. The hardware cannot exploit any commonality in the access patterns, leading to severe underutilization of memory bandwidth. + +Cache utilization suffers dramatically under this access pattern. Each thread traces a unique path through memory, with no overlap between threads' working sets. The L1 and L2 caches, designed to capture and exploit locality, instead thrash continuously as each thread's accesses evict data needed by others. The effective cache capacity approaches zero, exposing every memory access to the full latency of global memory. + +Perhaps most critically, this approach fails to utilize the available memory bandwidth efficiently. Modern GPUs can achieve memory bandwidths exceeding 1 TB/s, but only when accesses are properly structured. The random access pattern of the naive implementation might achieve less than 10% of this theoretical maximum, effectively reducing a high-performance GPU to the performance level of a much simpler processor. + +The Tile Distribution Solution +------------------------------ + +Structured Mapping from Logical to Physical Coordinates +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The fundamental innovation of tile distribution lies in its approach to the memory access problem. Rather than attempting to optimize the naive access patterns after the fact, tile distribution provides a mathematical framework that generates optimized patterns from the outset. This framework establishes a structured mapping between logical coordinates and physical coordinates that respect hardware constraints. + +The essence of tile distribution is the recognition that efficient GPU computation requires a careful choreography of thread cooperation. Instead of each thread operating independently, threads are organized into hierarchical groups that work together on tiles of data. This organization ensures that when threads access memory, they do so in patterns that the hardware can optimize. + +.. code-block:: cpp + + // Efficient: Tile-based distribution using CK Tile + template + __device__ void tile_distributed_matrix_multiply() + { + // 1. Define tile distribution encoding at compile time + using Encoding = tile_distribution_encoding< + sequence<>, // No replication + tuple, // M dimension hierarchy + sequence<4, 2, 8, 4>>, // N dimension hierarchy + tuple, sequence<1, 2>>, // P to RH major + tuple, sequence<2, 2>>, // P to RH minor + sequence<1, 1, 2, 2>, // Y to RH major + sequence<0, 3, 0, 3> // Y to RH minor + >; + + // 2. Create the distribution + constexpr auto distribution = make_static_tile_distribution(Encoding{}); + + // 3. Create tile window for efficient memory access + auto tile_window = make_tile_window( + tensor_view, + window_lengths, + origin, + distribution + ); + + // 4. Load data with coalesced access pattern + auto loaded_tensor = tile_window.load(); + + // 5. Process tile data efficiently + sweep_tile(loaded_tensor, [](auto y_indices) { + auto value = loaded_tensor(y_indices); + // ... efficient computation + }); + } + +The transformation from inefficient to efficient memory access is profound. Where the naive implementation scattered memory requests across the address space, tile distribution ensures that adjacent threads access adjacent memory locations. This transformation happens through an advanced encoding system that captures the hierarchical nature of both the computation and the hardware. + +The encoding shown above demonstrates the multi-level hierarchy that tile distribution employs. The sequence<4, 2, 8, 4> represents a four-level decomposition: four repetitions per thread, two warps per block, eight threads per warp, and four elements per vector operation. This hierarchical structure maps directly to the GPU's hardware organization, ensuring that each level of the hierarchy operates at maximum efficiency. + +Memory access patterns become predictable and regular under tile distribution. The hardware's memory coalescing logic can now combine the requests from all threads in a warp into a single transaction, achieving the full memory bandwidth. The predictability extends beyond individual accesses to entire access sequences, enabling the hardware's prefetching mechanisms to anticipate and prepare data before it's needed. + +Thread cooperation emerges naturally from the tile distribution structure. Threads within a warp work on adjacent data, enabling efficient data sharing through register shuffle operations. Warps within a block coordinate through shared memory, with access patterns that avoid bank conflicts. This cooperation transforms what was a collection of independent computations into a unified, efficient operation. + +Cache utilization improves as well. The structured access patterns ensure that data loaded into cache by one thread is likely to be used by neighboring threads. Temporal locality emerges from the tile-based processing, where all operations on a tile complete before moving to the next tile. This locality transforms the cache from a liability into a high performance accelerator. + +The scalability of tile distribution across different GPU architectures represents one of its most key features. The same high-level code can achieve near-optimal performance on GPUs with different numbers of compute units, different cache sizes, and different memory bandwidths. The compile-time nature of the encoding allows the compiler to generate architecture-specific optimizations while maintaining portable source code. + +The Coordinate Mapping Insight +------------------------------ + +At the heart of tile distribution lies a profound mathematical insight: efficient GPU computation requires a systematic framework for mapping between different coordinate spaces. This framework transforms the complex problem of thread-to-data assignment into a series of well-defined mathematical transformations, each serving a specific purpose in the journey from abstract algorithm to concrete hardware execution. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Coordinate Spaces" + P["P-space
Thread Position
(thread_x, thread_y,
warp_id, block_id)"] + Y["Y-space
Local Data
(y0, y1, y2, y3)"] + X["X-space
Global Position
(x0, x1)"] + D["D-space
Memory Address
(linearized)"] + end + + subgraph "Transformations" + T1["P + Y → X
Thread data mapping"] + T2["X → D
Memory linearization"] + end + + P --> T1 + Y --> T1 + T1 --> X + X --> T2 + T2 --> D + + style P fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style Y fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style X fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px + style T1 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + style T2 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + + + +.. image:: diagrams/introduction_motivation_2.svg + :alt: Diagram + :align: center + +The elegance of this approach emerges from its separation of concerns. Each coordinate space represents a distinct aspect of the computation, and the transformations between them encapsulate specific optimization strategies. This separation allows developers to reason about their algorithms in natural terms while the framework handles the complex mapping to efficient hardware execution patterns. + +**Thread Position Space (P-space)** represents the physical organization of threads on the GPU. This space captures the hierarchical nature of GPU execution, from individual threads identified by their x and y coordinates within a block, to warps that execute in lockstep, to thread blocks that share resources. The coordinates in P-space—thread_x, thread_y, warp_id, and block_id—directly correspond to the hardware's execution model. Understanding P-space is crucial because it determines which threads can cooperate efficiently through shared memory and which threads will execute their memory accesses simultaneously. + +**Local Data Space (Y-space)** embodies the algorithm's perspective on data organization. In this space, each thread reasons about its local portion of work using coordinates like y0, y1, y2, and y3. These coordinates are algorithm-specific and represent the natural way to index the data being processed. For matrix multiplication, Y-space might represent the local tile coordinates within a larger matrix. For convolution, it might represent the spatial dimensions and channels of a local receptive field. The beauty of Y-space is that it allows algorithms to be expressed in their most natural form, without concern for hardware-specific optimizations. + +**Global Position Space (X-space)** serves as the bridge between algorithmic intent and physical reality. This space represents the actual global coordinates of data in the problem domain, such as the row and column indices in a matrix or the spatial coordinates in an image. X-space is where the distributed nature of the computation becomes explicit, as each thread's local Y-space coordinates combine with its position in P-space to determine which global data elements it accesses. + +**Memory Address Space (D-space)** represents the final destination: linearized memory addresses that the hardware actually uses. This space accounts for the fact that multi-dimensional data structures must ultimately be stored in linear memory. The transformation to D-space incorporates layout optimizations such as padding for alignment, interleaving for better cache utilization, and address space considerations for different memory types (global, shared, or constant memory). + +The transformative power of tile distribution emerges from the composition of these mappings. The **P + Y → X** transformation combines a thread's position with its local data coordinates to determine global data positions. This transformation encodes the distribution strategy, determining how work is partitioned across threads. The subsequent **X → D** transformation converts these logical positions into physical memory addresses, incorporating layout optimizations that ensure efficient memory access patterns. + +The mathematical rigor of this framework enables critical optimizations. Because each transformation is well-defined and composable, the compiler can analyze the complete transformation chain and generate optimal code. The framework can automatically ensure memory coalescing by structuring the P + Y → X transformation appropriately. It can minimize bank conflicts in shared memory by carefully designing the X → D mapping. Most importantly, it can adapt these optimizations to different hardware architectures by adjusting the transformation parameters while keeping the high-level algorithm description unchanged. + +What's Coming Next +------------------ + +Having established the fundamental motivation for tile distribution and its coordinate mapping framework, this documentation now embarks on a systematic journey through the complete CK Tile system. This journey is carefully structured to build understanding layer by layer, starting from the most basic abstractions and progressing to advanced optimization techniques. + +The foundation of the exploration begins with raw memory access through :ref:`ck_tile_buffer_views`, the fundamental abstraction that provides type-safe, address-space-aware access to GPU memory. Understanding BufferView is crucial because it establishes the patterns and principles that permeate the entire CK Tile system. From there, it progresses to :ref:`ck_tile_tensor_views`, which adds multi-dimensional structure to raw memory, enabling natural expression of algorithms while maintaining the efficiency of the underlying buffer operations. + +With these foundational concepts established, the documentation delves into the :ref:`ck_tile_coordinate_systems` that powers tile distribution. This engine implements the mathematical framework that have been introduced, providing compile-time transformations between P-space, Y-space, X-space, and D-space. Understanding these transformations at a deep level enables developers to reason about performance implications and design custom distribution strategies for novel algorithms. The :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors` provide the building blocks for these transformations. + +The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_window` abstraction provides the gateway for efficient data access. + +The exploration of coordinate systems goes beyond the basic P, Y, X, D framework to encompass advanced topics such as multi-level tiling, replication strategies, and specialized coordinate systems for specific algorithm classes. The :ref:`ck_tile_encoding_internals` reveals the mathematical foundations, while :ref:`ck_tile_thread_mapping` shows how these abstractions map to hardware. This comprehensive treatment ensures that developers can handle not just common cases but also novel algorithms that require custom distribution strategies. + +The implementation details reveal the template metaprogramming techniques that enable CK Tile's zero-overhead abstractions. Topics like :ref:`ck_tile_descriptors`, :ref:`ck_tile_load_store_traits`, and :ref:`ck_tile_static_distributed_tensor` show how these abstractions achieve zero overhead. By understanding these implementation strategies, advanced developers can extend the framework, contribute optimizations, and debug performance issues at the deepest level. + +The connection between abstract coordinate transformations and concrete hardware thread mapping represents a critical piece of the puzzle. The documentation will examine how logical thread organizations map to physical GPU resources, how to avoid common pitfalls like bank conflicts (see :ref:`ck_tile_lds_bank_conflicts` and :ref:`ck_tile_lds_index_swapping`) and divergent execution, and how to structure computations for maximum hardware utilization. The :ref:`ck_tile_hardware` section provides deep dives into architecture-specific optimizations. + +Finally, the advanced topics section explores cutting-edge optimization techniques, including :ref:`ck_tile_space_filling_curve` for optimal memory traversal, :ref:`ck_tile_sweep_tile` for clean iteration patterns, and practical examples like :ref:`ck_tile_convolution_example` and :ref:`ck_tile_gemm_optimization`. These topics prepare developers to push the boundaries of GPU performance and contribute to the ongoing evolution of high-performance computing. + +Summary +------- + +The journey through this introduction has revealed tile distribution as a fundamental paradigm shift in how GPU programming is approached. By establishing a mathematical framework for coordinate transformation, tile distribution bridges the gap between algorithmic elegance and hardware efficiency. + +The significance of this approach extends beyond mere performance optimization. Tile distribution enables developers to express algorithms in their natural mathematical form while achieving performance that approaches the theoretical limits of the hardware. This reconciliation of abstraction and efficiency has been a goal of high-performance computing, and tile distribution provides a step towards this goal. + +The structured, predictable mappings between logical and physical coordinates that tile distribution provides yield multiple benefits. Efficient memory access emerges naturally from the framework, with coalesced access patterns and cache-friendly layouts arising from the mathematical structure rather than manual optimization. Thread cooperation becomes an inherent property of the system, with the distribution encoding automatically organizing threads into efficient collaborative patterns. The scalability across different hardware architectures demonstrates the power of abstraction—the same high-level code achieves near-optimal performance whether running on a small mobile GPU or a massive datacenter accelerator. + +Perhaps most importantly, tile distribution provides a predictable optimization framework grounded in mathematical principles. Performance characteristics can be analyzed and predicted based on the transformation structure, enabling systematic optimization rather than trial-and-error tuning. This predictability transforms GPU optimization from an art practiced by a few experts into a science accessible to a broader community of developers. + +The systematic mapping through P-space, Y-space, X-space, and D-space provides a mental model that clarifies the entire GPU computation process. This model enables developers to reason about their code at multiple levels of abstraction simultaneously, understanding both the high-level algorithmic behavior and the low-level hardware execution patterns. + +As the documentation dives deeper into the implementation details, starting with the foundational BufferView abstraction, it is important to remember that each component serves the larger purpose of enabling efficient, scalable GPU computation. The journey from raw memory to advanced tile distributions mirrors the evolution of GPU programming itself, from low-level, hardware-specific optimizations to high-level, portable abstractions that preserve efficiency. + +By providing a framework for achieving optimal memory access patterns, tile distribution enables developers to take advantage of the computing power of GPUs without having to know the details of the underlying architecture. + +Next Steps +---------- + +Continue to :ref:`ck_tile_buffer_views` to start building your understanding from the ground up. diff --git a/docs/conceptual/ck_tile/lds_index_swapping.rst b/docs/conceptual/ck_tile/lds_index_swapping.rst new file mode 100644 index 0000000000..891b32f9ed --- /dev/null +++ b/docs/conceptual/ck_tile/lds_index_swapping.rst @@ -0,0 +1,462 @@ +.. meta:: + :description: CK Tile LDS index swapping documentation + :keywords: CK Tile, LDS, index swapping, XOR preshuffle, bank conflicts, GPU optimization + +.. _ck_tile_lds_index_swapping: + +******************************** +Load Datat Share Index Swapping +******************************** + +Overview +======== + +Local Data Share (LDS) index swapping, also known as XOR preshuffle, is a critical optimization technique in CK Tile for resolving bank conflicts in shared memory. Bank conflicts occur when multiple threads in a warp attempt to access different addresses within the same memory bank simultaneously, causing serialization and performance degradation. CK Tile generalizes the XOR preshuffle technique through a compile-time coordinate transformation system that automatically handles complex access patterns. + +The key insight is that transforming the logical 2D coordinates used to access LDS into a different 2D coordinate space ensures that threads accessing data simultaneously access different memory banks. This transformation is implemented through CK Tile's composable transform system, making it both flexible and efficient. See :ref:`ck_tile_transforms` and :ref:`ck_tile_coordinate_systems` for more information about the composable transform system. + +Coordinate Transformation Pipeline +================================== + +CK Tile performs coordinate transformations to bring LDS access from the original 2D position (M, K dimensions) into transformed (M', K') coordinates: + +Step 1: XOR Transform +--------------------- + +The original K coordinate is split into K0 and K1, where K1 represents the thread vector size along the K dimension (KPack) and K0 is KPerBlock/KPack. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "3D LDS coordinate [K0, M, K1]" + K0["KPerBlock/KPack * MLdsLayer
K0"] + M["MPerBlock/MLdsLayer
M"] + K1["KPack
K1"] + end + + subgraph "XOR Transform" + XT["make_xor_transform"] + end + + subgraph "Update K0 with XOR transformation" + K01["KPerBlock/KPack * MLdsLayer
K0'"] + M1["MPerBlock/MLdsLayer
M"] + K11["KPack
K1"] + end + + K0 --> XT + M --> XT + K1 --> K11 + + XT --> K01 + XT --> M1 + + style K0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style K01 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style M fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style M1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + + style K1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style K11 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + +.. image:: diagrams/lds_index_swapping_1.svg + :alt: Diagram + :align: center + +The XOR transformation updates the K0 coordinate using the formula: + +.. code-block:: cpp + + K0' = K0 ^ (M % (KPerBlock / KPack * MLdsLayer)) + +This XOR operation redistributes accesses across memory banks by mixing bits from the M and K dimensions. + +Step 2: Unmerge Transform +------------------------- + +The transformed K0' is split into L and K0'' components, creating an intermediate 4D coordinate space. This is necessary when MLdsLayer > 1, allowing multiple rows to share the same set of memory banks for better utilization with smaller tile sizes. + + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "3D LDS coordinate [K0', M, K1]" + K0["KPerBlock/KPack * MLdsLayer
K0'"] + M["MPerBlock/MLdsLayer
M"] + K1["KPack
K1"] + end + + subgraph "Unmerge into 2 components" + UM["make_unmerge_transform"] + end + + subgraph "4D intermediate transformation space" + L["MLdsLayer
L"] + M1["MPerBlock/MLdsLayer
M"] + K01["KPerBlock/KPack
K0''"] + K11["KPack
K1"] + end + + K0 --> UM + M --> M1 + K1 --> K11 + + UM --> L + UM --> K01 + + style K0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style L fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style K01 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + + style M fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style M1 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + style K1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style K11 fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + + +.. image:: diagrams/lds_index_swapping_2.svg + :alt: Diagram + :align: center + +The unmerge operation: + +.. code-block:: cpp + + L = K0' / (KPerBlock/KPack) + K0'' = K0' % (KPerBlock/KPack) + +When MLdsLayer == 1, this simplifies to L=0 and K0''=K0'. + +Step 3: Merge Transform +----------------------- + +The final step merges the 4D coordinates back into 2D transformed coordinates (M', K'). + + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "4D LDS Coordinates [L, M, K0'', K1]" + L["MLdsLayer
L"] + M1["MPerBlock/MLdsLayer
M"] + K0["KPerBlock/KPack
K0''"] + K1["KPack
K1"] + end + + subgraph "Merge into 1 component" + ME0["make_merge_transform"] + end + + subgraph "Merge into 1 component" + ME1["make_merge_transform"] + end + + subgraph "Transformed 2D coordinates [M', K']" + M11["MPerBlock
M'"] + K01["KPerBlock
K'"] + end + + L --> ME0 + M1 --> ME0 + + K0 --> ME1 + K1 --> ME1 + + ME0 --> M11 + ME1 --> K01 + + style K0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style K1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style K01 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + + style M1 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style L fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style M11 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + +.. image:: diagrams/lds_index_swapping_3.svg + :alt: Diagram + :align: center + + +C++ Implementation +================== + +Here's how the complete transformation chain is implemented in CK Tile using :ref:`ck_tile_descriptors` and transforms: + +.. code-block:: cpp + + template + struct LdsIndexSwapping { + static constexpr index_t KPerBlock_over_KPack = KPerBlock / KPack; + static constexpr index_t MPerBlock_over_MLdsLayer = MPerBlock / MLdsLayer; + + // Step 1: Create base descriptor + using BaseLengths = Sequence< + KPerBlock_over_KPack * MLdsLayer, + MPerBlock_over_MLdsLayer, + KPack + >; + using BaseStrides = Sequence< + KPack, + KPerBlock * MLdsLayer, + 1 + >; + + using BaseDescriptor = TensorDescriptor; + + // Step 2: Apply XOR transform + using PermutedDescriptor = decltype( + transform_tensor_descriptor( + BaseDescriptor{}, + make_tuple( + make_xor_transform( + Sequence{} + ), + make_pass_through_transform(Number{}) + ), + Sequence<1, 0>{}, // XOR on dims [1,0] + Sequence<2>{} // Pass through dim 2 + ) + ); + + // Step 3: Apply unmerge and final transforms + using FinalDescriptor = decltype( + transform_tensor_descriptor( + PermutedDescriptor{}, + make_tuple( + make_unmerge_transform( + Sequence{} + ), + make_pass_through_transform(Number{}), + make_pass_through_transform(Number{}) + ), + Sequence<0>{}, // Unmerge dim 0 + Sequence<1>{}, // Pass through dim 1 + Sequence<2>{}, // Pass through dim 2 + Sequence<0, 2>{}, // Output dims from unmerge + Sequence<1>{}, // Output dim 1 + Sequence<3>{} // Output dim 3 + ) + ); + }; + + + + +Practical Usage in GEMM +========================== + +Here's how LDS index swapping is used in a real GEMM kernel. See :ref:`ck_tile_gemm_optimization` for more information about GEMM optimization. + +.. code-block:: cpp + + template + __global__ void gemm_kernel_with_lds_swapping( + const DataType* __restrict__ a_global, + const DataType* __restrict__ b_global, + DataType* __restrict__ c_global, + index_t M, index_t N, index_t K) + { + // Shared memory allocation + __shared__ DataType a_lds[BlockM * BlockK]; + __shared__ DataType b_lds[BlockK * BlockN]; + + // Create LDS descriptor with index swapping + constexpr index_t MLdsLayer = 2; // Typical value for bank conflict avoidance + + using ALdsDesc = typename LdsIndexSwapping< + BlockK, KPack, MLdsLayer, BlockM + >::FinalDescriptor; + + // Load from global to LDS with swapped indices + auto load_a_to_lds = [&](index_t k_offset) { + // Each thread loads its portion + index_t tid = threadIdx.x; + constexpr index_t NumThreads = blockDim.x; + constexpr index_t ElementsPerThread = (BlockM * BlockK) / NumThreads; + + #pragma unroll + for (index_t i = 0; i < ElementsPerThread; ++i) { + index_t linear_idx = tid * ElementsPerThread + i; + + // Convert linear index to 2D coordinates + index_t m_idx = linear_idx / BlockK; + index_t k_idx = linear_idx % BlockK; + + // Load from global memory + DataType value = a_global[ + (blockIdx.y * BlockM + m_idx) * K + k_offset + k_idx + ]; + + // Store to LDS using swapped coordinates + ALdsDesc desc; + index_t lds_offset = desc.calculate_offset({ + 0, // L component (for this example) + m_idx / MLdsLayer, // M component + k_idx / KPack, // K0 component + k_idx % KPack // K1 component + }); + + a_lds[lds_offset] = value; + } + }; + + // Main GEMM computation loop + for (index_t k = 0; k < K; k += BlockK) { + // Load tiles to LDS with index swapping + load_a_to_lds(k); + __syncthreads(); + + // Compute using swapped LDS layout + // ... (matrix multiplication using transformed coordinates) + } + } + +Bank Conflict Analysis +====================== + +The effectiveness of index swapping can be analyzed by examining access patterns: + +.. code-block:: cpp + + template + struct BankConflictAnalyzer { + static constexpr index_t NumBanks = 32; + static constexpr index_t BankWidth = 4; // 4 bytes per bank + + template + static void analyze_access_pattern() { + // Simulate warp access pattern + index_t bank_access[NumBanks] = {0}; + + // Each thread in warp accesses one element + for (index_t tid = 0; tid < WarpSize; ++tid) { + // Calculate coordinates for this thread + index_t m_coord = tid / 8; // Example mapping + index_t k_coord = tid % 8; + + // Get LDS offset using descriptor + LdsDescriptor desc; + index_t offset = desc.calculate_offset({m_coord, k_coord}); + + // Determine bank + index_t bank = (offset * sizeof(float) / BankWidth) % NumBanks; + bank_access[bank]++; + } + + // Check for conflicts + index_t max_conflict = 0; + for (index_t bank = 0; bank < NumBanks; ++bank) { + max_conflict = max(max_conflict, bank_access[bank]); + } + + printf("Max bank conflict: %d-way\n", max_conflict); + } + }; + +Performance Benefits +==================== + +LDS index swapping provides several key benefits: + +1. **Conflict-Free Access**: Eliminates or significantly reduces bank conflicts +2. **Higher Throughput**: Enables full memory bandwidth utilization +3. **Automatic Optimization**: Transformation parameters can be tuned per architecture +4. **Composability**: Integrates seamlessly with other CK Tile transformations + +Advanced Configurations +======================= + +Different configurations can be used based on tile sizes and data types: + +.. code-block:: cpp + + // Configuration for different scenarios + template + struct LdsSwappingConfig { + // Smaller tiles may need different MLdsLayer + static constexpr index_t MLdsLayer = + (TileSize <= 32) ? 1 : + (TileSize <= 64) ? 2 : 4; + + // Adjust KPack based on data type + static constexpr index_t KPack = + sizeof(DataType) == 2 ? 8 : // FP16/BF16 + sizeof(DataType) == 4 ? 4 : 2; // FP32 + + // Validate configuration + static_assert(TileSize % (MLdsLayer * KPack) == 0, + "Tile size must be divisible by MLdsLayer * KPack"); + }; + + +Integration with Tile Distribution +================================== + +LDS index swapping works seamlessly with CK Tile's distribution system. See :ref:`ck_tile_tile_distribution` for more information about CK Tile's distribution system. + +.. code-block:: cpp + + template + struct DistributedLdsAccess { + using LdsDesc = typename LdsIndexSwapping<...>::FinalDescriptor; + + __device__ void load_from_lds( + const float* lds_ptr, + StaticDistributedTensor& reg_tensor) + { + // Each thread loads its distributed portion + auto coord = make_tensor_coordinate(LdsDesc{}, {0, 0, 0, 0}); + + #pragma unroll + for (index_t i = 0; i < reg_tensor.size(); ++i) { + // Calculate swapped LDS coordinates for this element + auto [m, k] = TileDistribution::get_local_tile_index(i); + + // Move to correct position + move_tensor_coordinate(LdsDesc{}, coord, {0, m, k/4, k%4}); + + // Load with transformed coordinates + reg_tensor[i] = lds_ptr[coord.get_offset()]; + } + } + }; + +Summary +======= + +LDS index swapping in CK Tile provides a effective and flexible solution to the bank conflict problem: + +- **Generalized XOR Preshuffle**: Extends the basic XOR technique through composable transforms +- **Multi-Step Pipeline**: Coordinates flow through XOR → Unmerge → Merge transformations +- **Automatic Optimization**: Parameters like MLdsLayer adapt to tile sizes and data types +- **Zero Overhead**: All transformations resolve at compile time +- **Seamless Integration**: Works naturally with other CK Tile components + +By understanding and utilizing LDS index swapping, kernels can achieve maximum shared memory bandwidth, which is often the limiting factor in GPU kernel performance. The transformation-based approach makes it easy to experiment with different swapping strategies while maintaining code clarity. + +For practical examples of how index swapping is used in complete kernels, see :ref:`ck_tile_swizzling_example`. For more on coordinate operations used here, see :ref:`ck_tile_coordinate_movement` and :ref:`ck_tile_tensor_coordinates`. diff --git a/docs/conceptual/ck_tile/load_store_traits.rst b/docs/conceptual/ck_tile/load_store_traits.rst new file mode 100644 index 0000000000..f9555a8bfe --- /dev/null +++ b/docs/conceptual/ck_tile/load_store_traits.rst @@ -0,0 +1,480 @@ +.. _ck_tile_load_store_traits: + +LoadStoreTraits - Memory Access Optimization Engine +=================================================== + +Overview +-------- + +LoadStoreTraits is a critical optimization component that analyzes :ref:`tile distributions ` to determine the most efficient memory access patterns. It serves as the engine behind :ref:`TileWindow's ` high-performance data movement, automatically identifying the best dimension for vectorization and creating optimized access sequences using :ref:`space-filling curves `. + +At compile time, LoadStoreTraits performs compile-time analysis of the distribution pattern to extract key information about memory access opportunities. This analysis determines how many elements can be loaded or stored in a single instruction, which dimension provides the best vectorization opportunity, and what traversal order maximizes cache utilization. The result is a set of compile-time constants and methods that guide the runtime execution of load and store operations. + +Key Concepts +------------ + +Vectorization Selection +~~~~~~~~~~~~~~~~~~~~~~~ + +LoadStoreTraits analyzes tensor dimensions to find the optimal one for vectorized loads and stores, prioritizing: + +- **Contiguous memory access** (stride = 1) +- **Maximum vector length** based on data type and :ref:`hardware capabilities ` +- **Alignment requirements** for efficient memory transactions + +Space-Filling Curve Integration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The system automatically creates a :ref:`space-filling curve ` that maximizes cache utilization while respecting vectorization constraints. This ensures that consecutive memory accesses are spatially close, reducing cache misses and improving memory bandwidth utilization. + +Access Pattern Optimization +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +LoadStoreTraits manages the trade-off between vector size and number of memory accesses, finding a solution that minimizes total memory transactions while maximizing bandwidth utilization. + +C++ Implementation +------------------ + +The LoadStoreTraits class analyzes distribution patterns at compile time: + +.. code-block:: cpp + + template + struct load_store_traits + { + // Compile-time analysis results + static constexpr index_t ndim_y = Distribution::ndim_y; + static constexpr index_t ndim_x = Distribution::ndim_x; + + // Find which Y dimension has stride 1 (best for vectorization) + static constexpr index_t vector_dim_y = []() { + // Complex compile-time analysis to find optimal dimension + const auto strides = Distribution::calculate_y_strides(); + for (index_t i = 0; i < ndim_y; ++i) { + if (strides[i] == 1) return i; + } + return ndim_y - 1; // Default to last dimension + }(); + + // Calculate how many scalars fit in a vector + static constexpr index_t scalar_per_vector = []() { + // Determine based on data type and hardware capabilities + if constexpr (sizeof(DataType) == 4) { // float32 + return min(Distribution::get_y_length(vector_dim_y), 4); + } else if constexpr (sizeof(DataType) == 2) { // float16 + return min(Distribution::get_y_length(vector_dim_y), 8); + } + return 1; + }(); + + // Total scalars accessed per memory operation + static constexpr index_t scalars_per_access = scalar_per_vector; + + // Space-filling curve for optimal traversal + // See :ref:`ck_tile_space_filling_curve` for details + using sfc_type = space_filling_curve; + static constexpr sfc_type sfc_ys = make_space_filling_curve(); + + // Total number of accesses needed + static constexpr index_t num_access = + Distribution::get_num_of_element_y() / scalars_per_access; + + // Get Y indices for a given access + CK_TILE_DEVICE constexpr auto get_y_indices(index_t i_access) const + { + return sfc_ys.get_index(i_access); + } + + // Get detailed vectorized access information + CK_TILE_DEVICE constexpr auto get_vectorized_access_info(index_t i_access) const + { + const auto base_indices = get_y_indices(i_access); + // Return structure with base indices, vector dimension, and size + return vectorized_access_info{ + base_indices, + vector_dim_y, + scalar_per_vector + }; + } + }; + +Vectorization Selection Algorithm +--------------------------------- + +LoadStoreTraits employs an advanced algorithm to select the best dimension for vectorization: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TD + A[Analyze Distribution] --> B{Check Each Dimension} + B --> C[Calculate Stride] + C --> D{Stride == 1?} + D -->|Yes| E[Candidate for Vectorization] + D -->|No| F[Skip Dimension] + E --> G[Check Alignment] + G --> H[Check Vector Size] + H --> I[Score Dimension] + F --> B + I --> J[Select Best Dimension] + J --> K[Configure Vector Access] + + style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style J fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style K fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + +.. image:: diagrams/load_store_traits_1.svg + :alt: Diagram + :align: center + +**Example: Comparing Different Memory Layouts** + +.. code-block:: cpp + + // Row-major layout [4×16] + using RowMajorDist = tile_distribution_encoding< + sequence<>, // No replication + tuple, sequence<4, 4>>, // 4x16 total + tuple, sequence<1>>, // Thread mapping + tuple, sequence<0>>, // Minor indices + sequence<2, 4>, // Y-space per thread + sequence<1, 1> // Y-space minor + >; + + // Column-major layout [16×4] + using ColMajorDist = tile_distribution_encoding< + sequence<>, // No replication + tuple, sequence<2, 2>>, // 16x4 total + tuple, sequence<1>>, // Thread mapping + tuple, sequence<0>>, // Minor indices + sequence<4, 2>, // Y-space per thread + sequence<1, 1> // Y-space minor + >; + + // LoadStoreTraits analysis + using RowTraits = load_store_traits; + using ColTraits = load_store_traits; + + // Row-major: vectorizes dimension 1 (4 elements) + static_assert(RowTraits::vector_dim_y == 1); + static_assert(RowTraits::scalar_per_vector == 4); + + // Column-major: vectorizes dimension 1 (2 elements) + static_assert(ColTraits::vector_dim_y == 1); + static_assert(ColTraits::scalar_per_vector == 2); + +Memory Access Patterns +---------------------- + +LoadStoreTraits creates efficient access patterns using space-filling curves: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Linear Traversal" + L1["0→1→2→3"] + L2["4→5→6→7"] + L3["Cache miss"] + L4["8→9→10→11"] + end + + subgraph "Snake Pattern" + S1["0→1→2→3"] + S2["7←6←5←4"] + S3["Cache hit!"] + S4["8→9→10→11"] + end + + L1 --> L2 + L2 --> L3 + L3 --> L4 + + S1 --> S2 + S2 --> S3 + S3 --> S4 + + style L3 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style S3 fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + +.. image:: diagrams/load_store_traits_2.svg + :alt: Diagram + :align: center + +**C++ Access Pattern Example:** + +.. code-block:: cpp + + // Create a 6x8 tile distribution + using TileDist = tile_distribution_encoding< + sequence<>, + tuple, sequence<2, 4>>, // 6x8 tile + tuple, sequence<1>>, + tuple, sequence<0>>, + sequence<3, 4>, // 3x4 per thread + sequence<1, 1> + >; + + using Traits = load_store_traits; + + // Access pattern visualization + template + CK_TILE_DEVICE void visualize_access_pattern() + { + printf("Tile: %dx%d\n", TileDist::get_tile_m(), TileDist::get_tile_n()); + printf("Vector dimension: %d\n", Traits::vector_dim_y); + printf("Scalars per access: %d\n", Traits::scalars_per_access); + printf("\nAccess sequence:\n"); + + // Show first few accesses + static_for<0, min(6, Traits::num_access), 1>{}([](auto i) { + const auto indices = Traits::get_y_indices(i); + const auto info = Traits::get_vectorized_access_info(i); + + printf("Access %d: Base=[%d,%d], Vector size=%d\n", + i, indices[0], indices[1], info.vector_size); + }); + } + +Performance Analysis +-------------------- + +Memory Access Efficiency +~~~~~~~~~~~~~~~~~~~~~~~~ + +LoadStoreTraits optimizes for several performance metrics: + +.. code-block:: cpp + + template + struct memory_access_analyzer + { + using Traits = load_store_traits; + + // Calculate memory bandwidth utilization + static constexpr float bandwidth_utilization() + { + constexpr index_t bytes_per_access = Traits::scalar_per_vector * sizeof(float); + constexpr index_t cache_line_size = 64; // bytes + return static_cast(bytes_per_access) / cache_line_size * 100.0f; + } + + // Calculate total memory transactions + static constexpr index_t total_transactions() + { + return Traits::num_access; + } + + // Check coalescing efficiency (see :ref:`ck_tile_gpu_basics`) + static constexpr bool is_perfectly_coalesced() + { + // Perfect coalescing when adjacent threads access adjacent memory + return Traits::vector_dim_y == Distribution::ndim_y - 1 && + Traits::scalar_per_vector >= 4; + } + }; + +Comparing Different Configurations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Configuration 1: Simple 8x8 tile + using Simple8x8 = tile_distribution_encoding< + sequence<>, + tuple, sequence<2, 4>>, + tuple, sequence<1>>, + tuple, sequence<0>>, + sequence<4, 4>, + sequence<1, 1> + >; + + // Configuration 2: Optimized for vectorization + using OptimizedVector = tile_distribution_encoding< + sequence<>, + tuple, sequence<2, 8>>, + tuple, sequence<1>>, + tuple, sequence<0>>, + sequence<2, 8>, // 2x8 per thread for better vectorization + sequence<1, 1> + >; + + // Analysis + using SimpleAnalyzer = memory_access_analyzer; + using OptimizedAnalyzer = memory_access_analyzer; + + static_assert(SimpleAnalyzer::bandwidth_utilization() == 25.0f); // 4*4/64 + static_assert(OptimizedAnalyzer::bandwidth_utilization() == 50.0f); // 8*4/64 + + // Better bandwidth utilization leads to improved performance + // See :ref:`ck_tile_gemm_optimization` for real-world examples + +Integration with Space-Filling Curves +------------------------------------- + +LoadStoreTraits automatically configures space-filling curves for optimal access: + +.. code-block:: cpp + + template + struct space_filling_curve_optimizer + { + using Traits = load_store_traits; + + static constexpr auto create_optimized_curve() + { + // Move vector dimension to end of access order + array dim_order; + + // Fill non-vector dimensions first + index_t pos = 0; + for (index_t i = 0; i < Distribution::ndim_y; ++i) { + if (i != Traits::vector_dim_y) { + dim_order[pos++] = i; + } + } + + // Vector dimension last for contiguous access + dim_order[pos] = Traits::vector_dim_y; + + // Create space-filling curve with optimized order + return space_filling_curve{ + Distribution::get_y_lengths(), + dim_order, + Traits::scalar_per_vector, + true // Enable snake pattern + }; + } + }; + +Advanced Optimizations +---------------------- + +Multi-Level Vectorization +~~~~~~~~~~~~~~~~~~~~~~~~~ + +For complex :ref:`distributions `, LoadStoreTraits can identify multiple levels of vectorization: + +.. code-block:: cpp + + template + struct multi_level_vectorization + { + // Primary vector dimension (innermost, stride 1) + static constexpr index_t primary_vector_dim = + load_store_traits::vector_dim_y; + + // Secondary vector dimension (next best option) + static constexpr index_t secondary_vector_dim = []() { + const auto strides = Distribution::calculate_y_strides(); + for (index_t i = 0; i < Distribution::ndim_y; ++i) { + if (i != primary_vector_dim && + strides[i] <= 4) { // Small stride + return i; + } + } + return -1; + }(); + + // Can use 2D vectorization? + static constexpr bool supports_2d_vector = secondary_vector_dim >= 0; + }; + +Adaptive Vector Size Selection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +LoadStoreTraits adapts vector size based on multiple factors: + +.. code-block:: cpp + + template + struct adaptive_vector_size + { + static constexpr index_t calculate_optimal_vector_size() + { + constexpr index_t dim_length = + Distribution::get_y_length(load_store_traits::vector_dim_y); + + // Hardware-specific vector sizes + constexpr array valid_sizes = {8, 4, 2, 1}; + + // Find largest valid size that divides dimension length + for (auto size : valid_sizes) { + if (dim_length % size == 0 && + size * sizeof(DataType) <= 32) { // Max vector register size + return size; + } + } + return 1; + } + }; + +Best Practices +-------------- + +1. **Design Distributions for Vectorization** + + .. code-block:: cpp + + // Good: Inner dimension is power of 2 + using GoodDist = tile_distribution_encoding< + sequence<>, + tuple, sequence<2, 8>>, // Inner dim = 16 + tuple, sequence<1>>, + tuple, sequence<0>>, + sequence<2, 8>, // 8 elements for vectorization + sequence<1, 1> + >; + +2. **Consider Data Type Size** + + .. code-block:: cpp + + // Adjust distribution based on data type + template + using AdaptiveDist = std::conditional_t< + sizeof(DataType) == 2, // FP16 + tile_distribution_encoding<...>, // 8-wide vectors + tile_distribution_encoding<...> // 4-wide vectors for FP32 + >; + +3. **Align for Cache Lines** + + .. code-block:: cpp + + // Ensure tile dimensions align with cache lines + static_assert(TileDist::get_tile_n() * sizeof(float) % 64 == 0, + "Tile width should align to cache lines"); + + For more optimization techniques, see :ref:`ck_tile_lds_bank_conflicts` and :ref:`ck_tile_lds_index_swapping`. + +Summary +------- + +LoadStoreTraits provides: + +- **Automatic vectorization analysis**: Identifies optimal dimensions and vector sizes +- **Space-filling curve optimization**: Creates cache-friendly access patterns. See :ref:`ck_tile_space_filling_curve` for more information. +- **Compile-time optimization**: All analysis done at compile time for zero overhead +- **Hardware adaptation**: Adjusts to different data types and :ref:`architectures ` +- **Performance transparency**: Clear metrics for memory efficiency + +The compile-time analysis performed by LoadStoreTraits ensures that every memory operation in CK Tile achieves near-optimal performance, making it a critical component in the high-performance computing stack. + +Next Steps +---------- + +- :ref:`ck_tile_space_filling_curve` - Deep dive into traversal patterns +- :ref:`ck_tile_tile_window` - How LoadStoreTraits enables efficient data access +- :ref:`ck_tile_static_distributed_tensor` - The target of optimized loads/stores +- :ref:`ck_tile_coordinate_systems` - Understanding the coordinate transformations +- :ref:`ck_tile_gemm_optimization` - Real-world application of LoadStoreTraits diff --git a/docs/conceptual/ck_tile/space_filling_curve.rst b/docs/conceptual/ck_tile/space_filling_curve.rst new file mode 100644 index 0000000000..4b95f71a69 --- /dev/null +++ b/docs/conceptual/ck_tile/space_filling_curve.rst @@ -0,0 +1,511 @@ +.. _ck_tile_space_filling_curve: + +Space-Filling Curves - Optimal Memory Traversal +=============================================== + +Overview +-------- + +The SpaceFillingCurve (SFC) class provides a systematic way to traverse multi-dimensional tensors, supporting both scalar and vectorized access patterns. This is particularly important for optimizing memory access patterns in :ref:`GPU kernels `, where the order of memory accesses can dramatically impact performance through cache utilization, memory coalescing, and prefetching effectiveness. + +A space-filling curve is a continuous curve that visits every point in a multi-dimensional space exactly once. In the context of CK Tile, it defines a mapping from a 1D access index to multi-dimensional :ref:`tensor coordinates `, enabling efficient traversal patterns that maximize hardware utilization. + +Key Concepts +------------ + +Tensor Traversal +~~~~~~~~~~~~~~~~ + +The space-filling curve defines a mapping from a 1D access index to multi-dimensional tensor coordinates. This abstraction allows complex multi-dimensional access patterns to be expressed as simple linear iterations, while maintaining optimal memory access characteristics. + +Vectorized Access +~~~~~~~~~~~~~~~~~ + +:ref:`GPUs ` support vector load and store instructions that can access multiple consecutive elements in a single operation. SpaceFillingCurve supports this by allowing specification of how many elements to access per dimension (``scalars_per_access``), enabling efficient utilization of these hardware features. + +Dimension Ordering +~~~~~~~~~~~~~~~~~~ + +The order in which dimensions are traversed impacts memory access patterns. Row-major vs column-major ordering, for example, can mean the difference between the preferred sequential memory access and strided access which can potentially cause cache thrashing. + +Snake Patterns +~~~~~~~~~~~~~~ + +Snake, or serpentine, patterns reverse the traversal direction on alternate rows and planes, keeping consecutive accesses spatially close. This pattern is particularly effective for maintaining cache locality when moving between rows or higher-dimensional boundaries. + +Usage +~~~~~ + +SFC mainly uses Tile Transpose, Tile shuffling iteration, and CShuffle to access the tile data in the discrete way the application requires and have the best cache memory coherent hit. + +C++ Implementation +------------------ + +The C++ template class provides compile-time optimization of traversal patterns: + +.. code-block:: cpp + + template + struct space_filling_curve + { + static constexpr index_t ndim = NDimSFC; + static constexpr auto tensor_lengths = SFCLengths{}; + static constexpr auto dim_access_order = DimAccessOrder{}; + static constexpr auto scalars_per_access = ScalarsPerAccess{}; + static constexpr bool snake_curved = IsSnakeCurved; + + // Calculate access dimensions (with ceiling division) + static constexpr auto access_lengths = []() { + array lengths; + for (index_t i = 0; i < ndim; ++i) { + lengths[i] = (tensor_lengths[i] + scalars_per_access[i] - 1) + / scalars_per_access[i]; + } + return lengths; + }(); + + // Total number of accesses needed + static constexpr index_t get_num_of_access() + { + index_t total = 1; + for (index_t i = 0; i < ndim; ++i) { + total *= access_lengths[i]; + } + return total; + } + + // Convert 1D access index to N-D coordinates + CK_TILE_DEVICE constexpr auto get_index(index_t i_access) const + { + array indices; + + // Calculate indices in access space + index_t remaining = i_access; + for (index_t i = ndim - 1; i >= 0; --i) { + const index_t dim = dim_access_order[i]; + indices[dim] = remaining % access_lengths[dim]; + remaining /= access_lengths[dim]; + } + + // Apply snake pattern if enabled + if constexpr (snake_curved) { + apply_snake_pattern(indices); + } + + // Scale by scalars_per_access + for (index_t i = 0; i < ndim; ++i) { + indices[i] *= scalars_per_access[i]; + } + + return indices; + } + + // Calculate step between two accesses + CK_TILE_DEVICE constexpr auto get_step_between( + index_t start, index_t end) const + { + const auto start_idx = get_index(start); + const auto end_idx = get_index(end); + + array step; + for (index_t i = 0; i < ndim; ++i) { + step[i] = end_idx[i] - start_idx[i]; + } + return step; + } + }; + +Basic Usage Examples +-------------------- + +Scalar Access Patterns +~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Row-major traversal of 4x6 matrix + using RowMajorCurve = space_filling_curve< + 2, // 2D + sequence<4, 6>, // Shape: 4x6 + sequence<0, 1>, // Dimension order: row then column + sequence<1, 1>, // Scalar access + false // No snake pattern + >; + + // Total accesses needed + constexpr index_t num_access = RowMajorCurve::get_num_of_access(); // 24 + + // Access pattern (first 10) + static_for<0, 10, 1>{}([](auto i) { + constexpr auto indices = RowMajorCurve{}.get_index(i); + printf("Access %d: [%d, %d]\n", i, indices[0], indices[1]); + }); + // Output: [0,0], [0,1], [0,2], [0,3], [0,4], [0,5], [1,0], [1,1], ... + +Vectorized Access Patterns +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Vector-4 access on dimension 1 + using VectorizedCurve = space_filling_curve< + 2, // 2D + sequence<4, 8>, // Shape: 4x8 + sequence<0, 1>, // Row-major + sequence<1, 4>, // Vector-4 on dimension 1 + false + >; + + // Access pattern visualization + static_for<0, VectorizedCurve::get_num_of_access(), 1>{}([](auto i) { + constexpr auto indices = VectorizedCurve{}.get_index(i); + printf("Access %d: row %d, cols [%d:%d]\n", + i, indices[0], indices[1], indices[1] + 3); + }); + // Output: row 0, cols [0:3], row 0, cols [4:7], row 1, cols [0:3], ... + +Column-Major vs Row-Major +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Compare access patterns + using RowMajor = space_filling_curve<2, sequence<4, 6>, + sequence<0, 1>, sequence<1, 1>, false>; + using ColMajor = space_filling_curve<2, sequence<4, 6>, + sequence<1, 0>, sequence<1, 1>, false>; + + // Row-major: [0,0], [0,1], [0,2], ... (traverse rows) + // Col-major: [0,0], [1,0], [2,0], ... (traverse columns) + +Advanced Patterns +----------------- + +Snake Pattern for Cache Optimization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The snake pattern reverses traversal direction on alternate rows, minimizing the distance between consecutive accesses: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Linear Pattern" + L1["Row 0: →"] + L2["Row 1: →"] + L3["Jump back"] + L4["Row 2: →"] + end + + subgraph "Snake Pattern" + S1["Row 0: →"] + S2["Row 1: ←"] + S3["Continue"] + S4["Row 2: →"] + end + + L1 --> L3 + L3 --> L2 + L2 --> L3 + L3 --> L4 + + S1 --> S2 + S2 --> S4 + + style L3 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style S3 fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + +.. image:: diagrams/space_filling_curve.svg + :alt: Diagram + :align: center + +.. code-block:: cpp + + using SnakeCurve = space_filling_curve< + 2, + sequence<4, 8>, + sequence<0, 1>, + sequence<1, 1>, + true // Enable snake pattern + >; + + // Access pattern with snake: + // Row 0: [0,0], [0,1], [0,2], ..., [0,7] + // Row 1: [1,7], [1,6], [1,5], ..., [1,0] (reversed!) + // Row 2: [2,0], [2,1], [2,2], ..., [2,7] + // Row 3: [3,7], [3,6], [3,5], ..., [3,0] (reversed!) + +GEMM Tile Access Pattern +~~~~~~~~~~~~~~~~~~~~~~~~ + +For :ref:`matrix multiplication `, optimal access patterns are crucial: + +.. code-block:: cpp + + // GEMM tile: 16x32 with vector-8 loads + // Column-major for coalesced access in GEMM + // See :ref:`ck_tile_gemm_optimization` for complete example + using GemmTileCurve = space_filling_curve< + 2, + sequence<16, 32>, // Tile size + sequence<1, 0>, // Column-major + sequence<1, 8>, // Vector-8 loads + false + >; + + // This creates a pattern where: + // - Each access loads 8 consecutive elements + // - Accesses proceed down columns (coalesced for column-major storage) + // - Total accesses: 16 * (32/8) = 64 + +3D Tensor Patterns +~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // 3D tensor with mixed vectorization + using Tensor3D = space_filling_curve< + 3, + sequence<4, 8, 16>, // 4x8x16 tensor + sequence<0, 1, 2>, // Access order + sequence<1, 2, 4>, // Different vector sizes per dimension + false + >; + + // Access pattern: + // - Dimension 0: scalar access + // - Dimension 1: vector-2 access + // - Dimension 2: vector-4 access + // Total accesses: 4 * (8/2) * (16/4) = 64 + +Performance Analysis +-------------------- + +Step Analysis for Memory Patterns +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Understanding step patterns between accesses is crucial for performance: + +.. code-block:: cpp + + template + struct access_pattern_analyzer + { + static constexpr void analyze_locality() + { + index_t sequential_steps = 0; + index_t cache_line_jumps = 0; + index_t large_jumps = 0; + + constexpr SFC sfc{}; + + for (index_t i = 0; i < SFC::get_num_of_access() - 1; ++i) { + const auto step = sfc.get_step_between(i, i + 1); + + // Calculate Manhattan distance + index_t distance = 0; + for (index_t d = 0; d < SFC::ndim; ++d) { + distance += abs(step[d]); + } + + if (distance <= 1) { + sequential_steps++; + } else if (distance <= 16) { // Within cache line + cache_line_jumps++; + } else { + large_jumps++; + } + } + + // Report statistics... + } + }; + +Optimizing for Hardware +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Optimize for GPU memory coalescing (see :ref:`ck_tile_gpu_basics`) + template + struct coalesced_access_pattern + { + // For coalescing, adjacent threads should access adjacent memory + static constexpr index_t vector_size = sizeof(float4) / sizeof(DataType); + + using OptimalPattern = space_filling_curve< + 2, + sequence, + sequence<1, 0>, // Column-major for coalescing + sequence<1, vector_size>, // Vectorized on fast-changing dimension + false + >; + }; + +Handling Edge Cases +------------------- + +Non-Divisible Dimensions +~~~~~~~~~~~~~~~~~~~~~~~~ + +When tensor dimensions aren't evenly divisible by vector size: + +.. code-block:: cpp + + // 5x7 tensor with 2x3 access pattern + using EdgeCaseCurve = space_filling_curve< + 2, + sequence<5, 7>, + sequence<0, 1>, + sequence<2, 3>, + false + >; + + // Access lengths use ceiling division: ceil(5/2) x ceil(7/3) = 3x3 + static_assert(EdgeCaseCurve::access_lengths[0] == 3); + static_assert(EdgeCaseCurve::access_lengths[1] == 3); + + // Boundary handling needed for accesses that exceed tensor bounds + template + CK_TILE_DEVICE void safe_access(index_t i_access) + { + const auto indices = SFC{}.get_index(i_access); + + // Check bounds for each dimension + bool in_bounds = true; + for (index_t d = 0; d < SFC::ndim; ++d) { + if (indices[d] + SFC::scalars_per_access[d] > SFC::tensor_lengths[d]) { + in_bounds = false; + break; + } + } + + if (in_bounds) { + // Full vector access + } else { + // Partial access with masking + } + } + +Integration with CK Tile +------------------------ + +LoadStoreTraits Integration +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:ref:`LoadStoreTraits ` uses space-filling curves to optimize memory access: + +.. code-block:: cpp + + template + struct load_store_traits + { + // Create optimized space-filling curve + // See :ref:`ck_tile_tile_distribution` for Distribution details + using sfc_type = space_filling_curve< + Distribution::ndim_y, + typename Distribution::y_lengths, + optimized_dim_order, // Computed order + optimized_scalars_per_access, + true // Enable snake for cache optimization + >; + + static constexpr sfc_type sfc_ys{}; + }; + +TileWindow Usage +~~~~~~~~~~~~~~~~ + +:ref:`TileWindow ` leverages space-filling curves for systematic tile traversal: + +.. code-block:: cpp + + template + CK_TILE_DEVICE void process_tile(const TileWindow& window) + { + using Traits = typename TileWindow::traits_type; + constexpr auto sfc = Traits::sfc_ys; + + // Traverse tile using space-filling curve + static_for<0, sfc.get_num_of_access(), 1>{}([&](auto i) { + const auto indices = sfc.get_index(i); + // Process element at indices... + }); + } + +Best Practices +-------------- + +1. **Choose Appropriate Dimension Order** + + .. code-block:: cpp + + // For row-major storage, use row-major traversal + using RowMajorSFC = space_filling_curve<2, Shape, sequence<0, 1>, ...>; + + // For column-major storage, use column-major traversal + using ColMajorSFC = space_filling_curve<2, Shape, sequence<1, 0>, ...>; + +2. **Optimize Vector Size** + + .. code-block:: cpp + + // Match vector size to cache line for optimal bandwidth + // See :ref:`ck_tile_lds_bank_conflicts` for cache optimization + constexpr index_t optimal_vector = min( + tensor_length_fast_dim, + cache_line_size / sizeof(DataType) + ); + +3. **Enable Snake Pattern for Large Tensors** + + .. code-block:: cpp + + // Snake pattern helps when jumping between rows/planes + using CacheFriendlySFC = space_filling_curve< + NDim, Lengths, Order, Scalars, + true // Enable snake + >; + +4. **Consider Memory Layout** + + .. code-block:: cpp + + // Align access patterns with physical memory layout + static_assert( + SFC::scalars_per_access[fastest_dim] * sizeof(DataType) + % cache_line_size == 0, + "Vector access should align with cache lines" + ); + +Summary +------- + +Space-filling curves provide: + +- **Systematic traversal**: Convert N-D access to 1D iteration +- **Vectorization support**: Efficient use of vector load and store instructions +- **Cache optimization**: Snake patterns and dimension ordering for locality +- **Flexibility**: Adaptable to different :ref:`tensor shapes ` and access patterns +- **Performance**: Compile-time optimization with zero runtime overhead + +The advanced traversal patterns enabled by space-filling curves are fundamental to achieving high performance in GPU kernels, ensuring that memory access patterns align with :ref:`hardware capabilities `. + +Next Steps +---------- + +- :ref:`ck_tile_load_store_traits` - How curves optimize memory access +- :ref:`ck_tile_sweep_tile` - Traversing distributed tensors +- :ref:`ck_tile_static_distributed_tensor` - The data structures being traversed +- :ref:`ck_tile_tile_window` - Integration with data loading +- :ref:`ck_tile_gemm_optimization` - Real-world application example diff --git a/docs/conceptual/ck_tile/static_distributed_tensor.rst b/docs/conceptual/ck_tile/static_distributed_tensor.rst new file mode 100644 index 0000000000..bfd50c0899 --- /dev/null +++ b/docs/conceptual/ck_tile/static_distributed_tensor.rst @@ -0,0 +1,429 @@ +.. meta:: + :description: CK Tile static distributed tensor documentation + :keywords: CK Tile, static distributed tensor, thread-local storage, GPU programming, ROCM + +.. _ck_tile_static_distributed_tensor: + +************************* +Static Distributed Tensor +************************* + +Overview +======== + +Static distributed tensors represent the thread-local data containers in CK Tile's programming model. Unlike traditional GPU programming where developers manually manage thread-local arrays and coordinate access patterns, static distributed tensors provide a high-level abstraction that automatically handles data distribution across threads while maintaining the performance characteristics of register-based storage. + +Each thread in a workgroup owns a portion of the overall tensor data, stored in its registers or local memory. The distribution pattern follows the :ref:`tile distribution ` rules, ensuring that collective operations across all threads reconstruct the complete logical tensor while individual threads operate only on their local portions. + +This design enables three critical optimizations: + + * It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy. + * It eliminates redundant memory accesses since each thread maintains its own working set. + * It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output. + +Thread-Local Storage Model +========================== + +The static distributed tensor implements an advanced storage model that maps multi-dimensional tensor data to thread-local arrays: + +.. code-block:: cpp + + template + struct StaticDistributedTensor { + // Each thread stores its portion of the tensor + static constexpr index_t kNumElements = + TileDistribution::GetNumElementsPerThread(); + + // Thread-local storage - typically maps to registers + DataType data_[kNumElements]; + + // Access using Y-space coordinates (see :ref:`ck_tile_coordinate_systems`) + __device__ DataType& operator()(const YIndex& idx) { + // Convert Y coordinate to local buffer offset + index_t offset = TileDistribution::YToLocalOffset(idx); + return data_[offset]; + } + }; + +The storage layout follows these principles: + +1. **Contiguous Storage**: Each thread's data is stored in a contiguous array, optimizing register allocation and enabling vectorized operations. + +2. **Deterministic Mapping**: The Y-coordinate to buffer offset mapping is computed at compile time, eliminating runtime overhead. + +3. **Alignment Guarantees**: The storage layout respects hardware alignment requirements for efficient memory operations. + +Memory Layout and Access Patterns +================================= + +Understanding how static distributed tensors organize memory is important for performance optimization. Consider a 2D tensor distributed across a 2D thread block: + +.. code-block:: cpp + + // Define a 64x64 tensor distributed across 16x16 threads + using TileDist = TileDistribution< + Sequence<64, 64>, // Tensor dimensions + Sequence<16, 16> // Thread block dimensions + >; + + // Each thread owns a 4x4 subtensor + using MyTensor = StaticDistributedTensor; + + __device__ void example_kernel() { + MyTensor accumulator; + + // Initialize thread-local data + for(index_t i = 0; i < 4; ++i) { + for(index_t j = 0; j < 4; ++j) { + // Y-space coordinates for this thread's elements + YIndex y_idx = make_tuple( + threadIdx.y * 4 + i, + threadIdx.x * 4 + j + ); + accumulator(y_idx) = 0.0f; + } + } + } + +The memory layout follows a hierarchical pattern: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TD + A[Global Tensor 64x64] --> B[Thread Block 16x16] + B --> C[Thread 0,0
Elements 0:3,0:3] + B --> D[Thread 0,1
Elements 0:3,4:7] + B --> E[Thread 1,0
Elements 4:7,0:3] + B --> F[...] + + C --> G[Local Array
16 elements] + D --> H[Local Array
16 elements] + E --> I[Local Array
16 elements] + + + + + +.. image:: diagrams/static_distributed_tensor.svg + :alt: Diagram + :align: center + +Element Access and Indexing +=========================== + +Static distributed tensors provide multiple indexing modes to support different access patterns: + +.. code-block:: cpp + + template + class StaticDistributedTensor { + public: + // Y-space indexing (most common) - see :ref:`ck_tile_coordinate_systems` + __device__ DataType& operator()(const YIndex& y_idx) { + return data_[YToOffset(y_idx)]; + } + + // Direct buffer indexing (for vectorized operations) + __device__ DataType& operator[](index_t offset) { + return data_[offset]; + } + + // Structured access for multi-dimensional patterns + template + __device__ DataType& at(Coords... coords) { + YIndex y_idx = make_tuple(coords...); + return (*this)(y_idx); + } + + // Vectorized access for performance + template + __device__ auto get_vector(index_t offset) { + using VectorType = vector_type_t; + return *reinterpret_cast(&data_[offset]); + } + }; + +The indexing system supports several optimization strategies: + +1. **Compile-Time Resolution**: When indices are known at compile time, the compiler can optimize away all indexing calculations. + +2. **Vectorized Access**: Accessing multiple elements as vectors enables efficient register-to-register transfers. + +3. **Boundary Checking**: Debug builds include automatic boundary checking to catch indexing errors early. + +Thread Coordination and Synchronization +======================================= + +Static distributed tensors excel at patterns where threads cooperate to process larger data structures: + +.. code-block:: cpp + + // Matrix multiplication accumulator pattern + // See :ref:`ck_tile_gemm_optimization` for complete example + template + __device__ void gemm_accumulate( + const TileWindow& a_window, + const TileWindow& b_window, + StaticDistributedTensor& c_accumulator) + { + // Each thread accumulates its portion + constexpr index_t kInnerTiles = 8; + + for(index_t k = 0; k < kInnerTiles; ++k) { + // Load tiles from global memory + auto a_tile = a_window.load(k); + auto b_tile = b_window.load(k); + + // Synchronize to ensure all loads complete + __syncthreads(); + + // Local accumulation (no synchronization needed) + for(index_t i = 0; i < 4; ++i) { + for(index_t j = 0; j < 4; ++j) { + CType sum = 0; + for(index_t kk = 0; kk < 4; ++kk) { + sum += a_tile(i, kk) * b_tile(kk, j); + } + c_accumulator.at(i, j) += sum; + } + } + } + } + +Key coordination patterns include: + +1. **Accumulation**: Each thread maintains partial results that combine to form the final answer. + +2. **Scatter/Gather**: Threads can efficiently reorganize data through coordinated read/write patterns. + +3. **Reduction**: Tree-based reduction algorithms naturally map to the distributed storage model. + +Practical Usage Patterns +======================== + +Static distributed tensors are useful in many common GPU programming patterns: + +**1. Register Blocking for Matrix Operations** + +.. code-block:: cpp + + // Optimize register usage for small matrix tiles + template + struct RegisterTile { + using Distribution = TileDistribution< + Sequence, + Sequence<1, 1> // Single thread owns entire tile + >; + using Tensor = StaticDistributedTensor; + + __device__ void compute() { + Tensor tile; + // All M*N elements in registers of one thread + // Enables aggressive unrolling and scheduling + } + }; + +**2. Warp-Level Primitives** + +.. code-block:: cpp + + // Distribute across warp for collaborative operations + template + struct WarpDistributedVector { + using Distribution = TileDistribution< + Sequence<32>, // 32 elements + Sequence<32> // 32 threads in warp + >; + using Tensor = StaticDistributedTensor; + + __device__ T warp_reduce_sum() { + Tensor data; + // Each thread has one element + // Use warp shuffle for reduction + T value = data[0]; + for(int offset = 16; offset > 0; offset /= 2) { + value += __shfl_down_sync(0xffffffff, value, offset); + } + return value; + } + }; + +**3. Shared Memory Staging** + +.. code-block:: cpp + + // Combine with shared memory for complex patterns + // See :ref:`ck_tile_lds_bank_conflicts` for LDS optimization + template + struct StagedComputation { + using RegTensor = StaticDistributedTensor; + using LdsTensor = StaticDistributedTensor; + + __device__ void process() { + RegTensor reg_data; + __shared__ T shared_buffer[1024]; + + // Stage 1: Compute in registers + compute_local(reg_data); + + // Stage 2: Exchange through shared memory + store_to_lds(reg_data, shared_buffer); + __syncthreads(); + + // Stage 3: Load different pattern + LdsTensor lds_data; + load_from_lds(shared_buffer, lds_data); + } + }; + +Performance Considerations +========================== + +Optimizing static distributed tensor usage requires understanding several :ref:`performance factors `: + +**Register Pressure**: Each thread's local storage typically maps to registers. Excessive storage requirements can cause register spilling: + +.. code-block:: cpp + + // Monitor register usage + template + struct RegisterAnalysis { + static constexpr index_t kRegistersPerElement = sizeof(T) / 4; + static constexpr index_t kTotalRegisters = Size * kRegistersPerElement; + + static_assert(kTotalRegisters <= 64, + "Exceeds typical register budget"); + }; + +**Memory Coalescing**: When loading/storing distributed tensors, ensure access patterns promote coalescing. See :ref:`ck_tile_gpu_basics` for more information about coalescing. + +.. code-block:: cpp + + // Good: Coalesced access pattern + template + __device__ void coalesced_store(Tensor& tensor, float* global_ptr) { + index_t tid = threadIdx.x + blockIdx.x * blockDim.x; + #pragma unroll + for(index_t i = 0; i < Tensor::kNumElements; ++i) { + global_ptr[tid + i * gridDim.x * blockDim.x] = tensor[i]; + } + } + +**Instruction Scheduling**: Organize operations to maximize instruction-level parallelism: + +.. code-block:: cpp + + // Interleave independent operations + template + __device__ void optimized_accumulate(Tensor& acc, + const Tensor& a, + const Tensor& b) { + #pragma unroll + for(index_t i = 0; i < Tensor::kNumElements; i += 4) { + // Group independent operations + float tmp0 = a[i+0] * b[i+0]; + float tmp1 = a[i+1] * b[i+1]; + float tmp2 = a[i+2] * b[i+2]; + float tmp3 = a[i+3] * b[i+3]; + + // Accumulate after multiplies complete + acc[i+0] += tmp0; + acc[i+1] += tmp1; + acc[i+2] += tmp2; + acc[i+3] += tmp3; + } + } + +Integration with CK Tile Ecosystem +================================== + +Static distributed tensors integrate seamlessly with other CK Tile components: + +.. code-block:: cpp + + // Complete example: Distributed GEMM kernel + template + __global__ void distributed_gemm_kernel( + const float* __restrict__ a_ptr, + const float* __restrict__ b_ptr, + float* __restrict__ c_ptr, + index_t M, index_t N, index_t K) + { + // Define distributions + constexpr index_t kTileM = 128; + constexpr index_t kTileN = 128; + constexpr index_t kTileK = 32; + + using ATileDist = TileDistribution< + Sequence, + Sequence<32, 8> + >; + using BTileDist = TileDistribution< + Sequence, + Sequence<8, 32> + >; + using CTileDist = TileDistribution< + Sequence, + Sequence<32, 32> + >; + + // Create distributed accumulator + StaticDistributedTensor c_accumulator; + + // Initialize to zero + #pragma unroll + for(index_t i = 0; i < c_accumulator.kNumElements; ++i) { + c_accumulator[i] = 0.0f; + } + + // Main GEMM loop + for(index_t k_tile = 0; k_tile < K; k_tile += kTileK) { + // Create tile windows for this iteration + // See :ref:`ck_tile_tile_window` for details + auto a_window = make_tile_window( + a_ptr, ALayout{M, K}, + ATileDist{}, + {blockIdx.y * kTileM, k_tile} + ); + + auto b_window = make_tile_window( + b_ptr, BLayout{K, N}, + BTileDist{}, + {k_tile, blockIdx.x * kTileN} + ); + + // Load tiles to distributed tensors + // See :ref:`ck_tile_load_store_traits` for optimized loading + auto a_tile = a_window.load(); + auto b_tile = b_window.load(); + + // Distributed matrix multiply + distributed_gemm_accumulate(a_tile, b_tile, c_accumulator); + } + + // Store results + auto c_window = make_tile_window( + c_ptr, CLayout{M, N}, + CTileDist{}, + {blockIdx.y * kTileM, blockIdx.x * kTileN} + ); + c_window.store(c_accumulator); + } + +Summary +======= + +Static distributed tensors provide the foundation for high-performance thread-local computation in CK Tile. By abstracting the complexities of register allocation, thread coordination, and memory access patterns, they enable developers to write clear, maintainable code that achieves hardware efficiency. The key benefits include: + +- **Automatic Distribution**: The :ref:`tile distribution ` system handles all thread-to-data mapping +- **Register Efficiency**: Thread-local storage maps directly to registers when possible +- **Zero-Overhead Abstraction**: All distribution logic resolves at compile time +- **Seamless Integration**: Works naturally with :ref:`tile windows `, :ref:`descriptors `, and other CK Tile components +- **Performance Transparency**: The storage model makes performance characteristics clear and predictable + +When combined with the broader CK Tile ecosystem, static distributed tensors enable the construction of complex GPU kernels that match hand-tuned assembly performance while maintaining the clarity of high-level mathematical expressions. diff --git a/docs/conceptual/ck_tile/sweep_tile.rst b/docs/conceptual/ck_tile/sweep_tile.rst new file mode 100644 index 0000000000..4dfb6a2ad1 --- /dev/null +++ b/docs/conceptual/ck_tile/sweep_tile.rst @@ -0,0 +1,560 @@ +.. meta:: + :description: CK Tile sweep operations documentation + :keywords: CK Tile, sweep operations, tile iteration, GPU programming + +.. _ck_tile_sweep_tile: + +********** +Sweep Tile +********** + +Overview +======== + +Sweep operations are the clean way to iterate over distributed data in CK Tile. They complete the tile distribution workflow by providing clean, efficient iteration patterns that automatically handle all the complex indexing details. Sweep operations are similar to ``forEach()`` operation. Sweep operations call a function for every data element. + +Sweep operations use the "load once, use many times" pattern. Load X data once into registers, then sweep through Y positions while keeping X in fast memory. This maximizes data reuse and minimizes memory bandwidth requirements. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "X-Tile (Reused)" + XT["X data loaded once
Stays in registers"] + end + + subgraph "Y-Sweep" + Y1["Y position 0"] + Y2["Y position 1"] + Y3["Y position 2"] + YN["Y position N"] + end + + subgraph "Computation" + C["Process(X, Y)"] + end + + XT --> C + Y1 --> C + Y2 --> C + Y3 --> C + YN --> C + + style XT fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style C fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + + + + + +.. image:: diagrams/sweep_tile_1.svg + :alt: Diagram + :align: center + +The Complete GPU Workflow +========================= + +Sweep operations are the final piece of the distributed computing puzzle: + +1. **TileDistribution**: "Here's how to divide work" +2. **TileWindow**: "Here's the data, loaded efficiently" +3. **Sweep Operations**: "Here's how to process every element" +4. **User code**: "Thanks! *does computation*" + +Without sweep operations, manual nested loops and complex index calculations are required, increasing the risk of missing elements or double-processing. Sweep operations provide lambda-based iteration with automatic handling of all elements. + +See :ref:`ck_tile_coordinate_systems` for more information about coordinate systems. + +Basic Sweep Implementation +========================== + +The fundamental sweep pattern in C++: + +.. code-block:: cpp + + template + __device__ void sweep_tile( + const DistributedTensor& tensor, + Func&& func) + { + // Get Y-space dimensions + constexpr auto y_lengths = tensor.get_tile_distribution() + .get_y_vector_lengths(); + + // Generate nested loops at compile time + static_for<0, y_lengths.size(), 1>{}([&](auto i) { + sweep_tile_impl(tensor, func, make_tuple()); + }); + } + + // Recursive implementation for arbitrary dimensions + template + __device__ void sweep_tile_impl( + const DistributedTensor& tensor, + Func&& func, + tuple indices) + { + constexpr auto y_lengths = tensor.get_tile_distribution() + .get_y_vector_lengths(); + + if constexpr (Dim == y_lengths.size()) { + // Base case: call function with complete indices + func(make_multi_index(indices...)); + } else { + // Recursive case: iterate this dimension + static_for<0, y_lengths[Dim], 1>{}([&](auto i) { + sweep_tile_impl( + tensor, func, + tuple_cat(indices, make_tuple(i)) + ); + }); + } + } + + + +Memory Efficiency Pattern +========================= + +The sweep pattern provides significant memory efficiency benefits. This is particularly important for GPU architectures (see :ref:`ck_tile_gpu_basics`) where memory bandwidth is often the limiting factor: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Traditional Approach" + T1["Load X[0]"] --> P1["Process"] + T2["Load Y[0]"] --> P1 + T3["Load X[0]"] --> P2["Process"] + T4["Load Y[1]"] --> P2 + T5["Load X[0]"] --> P3["Process"] + T6["Load Y[2]"] --> P3 + Note1["X loaded 3 times!"] + end + + subgraph "Sweep Approach" + S1["Load X[0]"] --> SP["Process with
Y[0], Y[1], Y[2]"] + S2["Load Y[0,1,2]"] --> SP + Note2["X loaded once!"] + end + + style Note1 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style Note2 fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + +.. image:: diagrams/sweep_tile_2.svg + :alt: Diagram + :align: center + +Practical Sweep Patterns +======================== + +Pattern 1: Simple Element Processing +------------------------------------ + +This pattern demonstrates the basic usage with :ref:`ck_tile_static_distributed_tensor`: + +.. code-block:: cpp + + template + __device__ void simple_sweep_example( + StaticDistributedTensor& input, + StaticDistributedTensor& output) + { + // Process each element + sweep_tile(input, [&](auto y_indices) { + DataType value = input.get_element(y_indices); + DataType result = compute_function(value); + output.set_element(y_indices, result); + }); + } + +Pattern 2: Accumulation +----------------------- + +.. code-block:: cpp + + template + __device__ DataType sweep_accumulate( + const StaticDistributedTensor& tensor) + { + DataType sum = 0; + + sweep_tile(tensor, [&](auto y_indices) { + sum += tensor.get_element(y_indices); + }); + + return sum; + } + +Pattern 3: Conditional Processing +--------------------------------- + +.. code-block:: cpp + + template + __device__ void conditional_sweep( + StaticDistributedTensor& tensor, + DataType threshold) + { + sweep_tile(tensor, [&](auto y_indices) { + DataType value = tensor.get_element(y_indices); + if (value > threshold) { + // Process only values above threshold + tensor.set_element(y_indices, process_large_value(value)); + } + }); + } + + +GEMM Sweep Pattern +================== + +The sweep pattern is fundamental to high-performance matrix multiplication. See :ref:`ck_tile_gemm_optimization` for more information about GEMM optimization details. + +.. code-block:: cpp + + template + __device__ void gemm_sweep_tile( + const TileWindow& a_window, + const TileWindow& b_window, + TileWindow& c_window) + { + // Phase 1: Load A tile into registers (X dimension) + auto a_tile = make_static_distributed_tensor(); + a_window.load(a_tile); // Load once, reuse many times + + // Phase 2: Create C accumulator + auto c_accumulator = make_static_distributed_tensor(); + + // Initialize accumulator + sweep_tile(c_accumulator, [&](auto y_indices) { + c_accumulator.set_element(y_indices, 0); + }); + + // Phase 3: Sweep through B positions (Y dimension) + constexpr index_t k_per_block = BDistribution::get_lengths()[1]; + + for (index_t k = 0; k < k_per_block; ++k) { + // Load current B slice + auto b_slice = make_static_distributed_tensor(); + b_window.load_slice(b_slice, k); + + // Compute C += A * B for this slice + sweep_tile(c_accumulator, [&](auto c_indices) { + CDataType sum = c_accumulator.get_element(c_indices); + + // Inner product for this C element + constexpr index_t inner_dim = ADistribution::get_lengths()[1]; + for (index_t i = 0; i < inner_dim; ++i) { + auto a_indices = make_multi_index(c_indices[0], i); + auto b_indices = make_multi_index(i, c_indices[1]); + + sum += a_tile.get_element(a_indices) * + b_slice.get_element(b_indices); + } + + c_accumulator.set_element(c_indices, sum); + }); + } + + // Phase 4: Store result + c_window.store(c_accumulator); + } + +Advanced Sweep Patterns +======================= + +Multi-Dimensional Sweep +----------------------- + +.. code-block:: cpp + + template + __device__ void tensor_3d_sweep( + StaticDistributedTensor& tensor) + { + // Sweep through 3D tensor with nested structure + sweep_tile(tensor, [&](auto indices) { + // indices is MultiIndex<3> with [d0, d1, d2] + index_t d0 = indices[0]; + index_t d1 = indices[1]; + index_t d2 = indices[2]; + + // Process based on 3D position + DataType value = tensor.get_element(indices); + + // Example: Different processing for different planes + if (d2 == 0) { + // First plane: special processing + value = special_process(value); + } else { + // Other planes: normal processing + value = normal_process(value); + } + + tensor.set_element(indices, value); + }); + } + +Strided Sweep +------------- + +.. code-block:: cpp + + template + __device__ void strided_sweep( + const DistributedTensor& tensor, + Func&& func) + { + constexpr auto y_lengths = tensor.get_tile_distribution() + .get_y_vector_lengths(); + + // Sweep with stride in first dimension + static_for<0, y_lengths[0], Stride>{}([&](auto i) { + // Create indices for this strided position + auto indices = make_multi_index(i); + + // Complete remaining dimensions normally + sweep_remaining_dims<1>(tensor, func, indices); + }); + } + +Block Sweep for Cache Optimization +---------------------------------- + +This pattern leverages shared memory to avoid :ref:`ck_tile_lds_bank_conflicts`: + +.. code-block:: cpp + + template + __device__ void block_sweep_pattern( + StaticDistributedTensor& tensor) + { + constexpr auto y_lengths = tensor.get_tile_distribution() + .get_y_vector_lengths(); + constexpr index_t num_blocks = (y_lengths[0] + BlockSize - 1) / BlockSize; + + // Process in blocks for better cache utilization + static_for<0, num_blocks, 1>{}([&](auto block_id) { + constexpr index_t block_start = block_id * BlockSize; + constexpr index_t block_end = min(block_start + BlockSize, y_lengths[0]); + + // Load block data into shared memory + __shared__ DataType block_cache[BlockSize][y_lengths[1]]; + + // Cooperative load + static_for{}([&](auto i) { + static_for<0, y_lengths[1], 1>{}([&](auto j) { + auto indices = make_multi_index(i, j); + block_cache[i - block_start][j] = tensor.get_element(indices); + }); + }); + + __syncthreads(); + + // Process from cache + static_for<0, block_end - block_start, 1>{}([&](auto i) { + static_for<0, y_lengths[1], 1>{}([&](auto j) { + DataType value = block_cache[i][j]; + value = complex_process(value); + + auto indices = make_multi_index(block_start + i, j); + tensor.set_element(indices, value); + }); + }); + }); + } + +Performance Characteristics +=========================== + +Sweep operations provide several performance benefits: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Sweep Performance Benefits" + B1["Zero runtime overhead
Compile-time unrolling"] + B2["Perfect memory coalescing
Sequential access patterns"] + B3["Automatic vectorization
Compiler optimizations"] + B4["Register reuse
X data stays in VGPR"] + end + + subgraph "Use Cases" + U1["Matrix Multiplication
Reuse A columns"] + U2["Convolution
Reuse filter weights"] + U3["Reduction
Accumulate over Y"] + U4["Broadcast
Apply X to all Y"] + end + + B1 --> Performance["High Performance"] + B2 --> Performance + B3 --> Performance + B4 --> Performance + + Performance --> U1 + Performance --> U2 + Performance --> U3 + Performance --> U4 + + style Performance fill:#d1fae5,stroke:#10b981,stroke-width:3px + + + + + +.. image:: diagrams/sweep_tile_3.svg + :alt: Diagram + :align: center + +Compiler Optimizations +---------------------- + +Using :ref:`ck_tile_load_store_traits` and :ref:`ck_tile_space_filling_curve` enables optimal memory access patterns: + +.. code-block:: cpp + + // The compiler can optimize sweep patterns effectively + template + __device__ void optimized_sweep_example( + StaticDistributedTensor& tensor) + { + // This sweep pattern: + sweep_tile(tensor, [&](auto indices) { + tensor.set_element(indices, tensor.get_element(indices) * 2.0f); + }); + + // Compiles to something like: + // #pragma unroll + // for (index_t i = 0; i < tensor.size(); ++i) { + // tensor[i] *= 2.0f; + // } + + // With: + // - Complete unrolling for small tensors + // - Vectorized loads/stores + // - No function call overhead + // - Perfect instruction scheduling + } + +Integration with CK Tile Components +=================================== + +Complete workflow example: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart TB + subgraph "Complete Workflow" + TD["TileDistribution
Define data layout"] + TW["TileWindow
Create view"] + DT["DistributedTensor
Load X data"] + ST["SweepTile
Iterate Y positions"] + R["Results
Store outputs"] + end + + TD --> TW + TW --> DT + DT --> ST + ST --> R + + style TD fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style ST fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style R fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + +.. image:: diagrams/sweep_tile_4.svg + :alt: Diagram + :align: center + +.. code-block:: cpp + + template + __global__ void complete_tile_kernel( + const DataType* input, + DataType* output, + index_t M, index_t N) + { + // 1. Define distribution + constexpr index_t BlockM = 64; + constexpr index_t BlockN = 64; + + using Distribution = TileDistribution< + Sequence, + Sequence<16, 16> + >; + + // 2. Create tile windows + auto input_window = make_tile_window( + input, make_tuple(M, N), + make_tuple(blockIdx.y * BlockM, blockIdx.x * BlockN), + Distribution{} + ); + + auto output_window = make_tile_window( + output, make_tuple(M, N), + make_tuple(blockIdx.y * BlockM, blockIdx.x * BlockN), + Distribution{} + ); + + // 3. Load input tile + auto input_tile = make_static_distributed_tensor(); + input_window.load(input_tile); + + // 4. Create output tile + auto output_tile = make_static_distributed_tensor(); + + // 5. Process with sweep + sweep_tile(input_tile, [&](auto indices) { + DataType value = input_tile.get_element(indices); + DataType result = complex_computation(value); + output_tile.set_element(indices, result); + }); + + // 6. Store results + output_window.store(output_tile); + } + + +Summary +======= + +SweepTile provides clean and efficient iteration over distributed data: + +- **Efficiency**: Load once, use many times pattern +- **Simplicity**: Clean lambda-based iteration abstraction +- **Performance**: Zero overhead with perfect access patterns +- **Flexibility**: Various sweep patterns for different algorithms + +Key benefits: + +1. **Memory Bandwidth**: Optimal reuse of loaded data +2. **Register Pressure**: Keep hot data in fastest memory +3. **Code Clarity**: Express algorithms naturally +4. **Compiler Optimization**: Enable aggressive optimizations + +The sweep pattern is fundamental to high-performance GPU kernels, turning complex iteration patterns into simple, efficient operations. Combined with TileDistribution and TileWindow, sweep operations complete the toolkit for clean and performant GPU computing. diff --git a/docs/conceptual/ck_tile/swizzling_example.rst b/docs/conceptual/ck_tile/swizzling_example.rst new file mode 100644 index 0000000000..f74038c954 --- /dev/null +++ b/docs/conceptual/ck_tile/swizzling_example.rst @@ -0,0 +1,495 @@ +.. meta:: + :description: CK Tile memory swizzling with Morton ordering example + :keywords: CK Tile, swizzling, Morton ordering, Z-order curve, GPU optimization + +.. _ck_tile_swizzling_example: + +************************************** +Memory Swizzling with Morton Ordering +************************************** + +Overview +======== + +This chapter demonstrates a practical application of tensor descriptors for implementing memory swizzling patterns, specifically Morton ordering (Z-order curve) within tiles. Memory swizzling is used to optimize GPU memory access patterns and reduce :ref:`bank conflicts `. Morton ordering provides a space-filling curve that maintains spatial locality while enabling efficient parallel access. See :ref:`ck_tile_space_filling_curve` for more information about parallel access. + +Morton ordering is widely used in: + +- **GPU Texture Memory**: Optimizing cache efficiency for 2D texture access +- **Matrix Operations**: Reducing memory bank conflicts in shared memory +- **Image Processing**: Improving locality for block-based algorithms +- **Scientific Computing**: Enhancing data access patterns for stencil operations + +Understanding Morton Ordering +============================= + +Morton ordering interleaves the bits of 2D coordinates to create a 1D ordering that preserves spatial locality. For a 2D coordinate (y, x), we split each coordinate into its binary bits and interleave them: + +- y = y₁y₀ (2 bits) +- x = x₁x₀ (2 bits) +- Morton index = y₁x₁y₀x₀ (4 bits) + +This creates a Z-shaped traversal pattern within each tile: + +.. code-block:: cpp + + // Morton encoding for 2D coordinates + template + __host__ __device__ index_t morton_encode_2d(index_t y, index_t x) { + index_t result = 0; + for (index_t i = 0; i < NumBits; ++i) { + index_t bit_y = (y >> i) & 1; + index_t bit_x = (x >> i) & 1; + result |= (bit_y << (2*i + 1)) | (bit_x << (2*i)); + } + return result; + } + + // Morton decoding back to 2D coordinates + template + __host__ __device__ void morton_decode_2d( + index_t morton_idx, + index_t& y, + index_t& x) + { + y = 0; + x = 0; + for (index_t i = 0; i < NumBits; ++i) { + y |= ((morton_idx >> (2*i + 1)) & 1) << i; + x |= ((morton_idx >> (2*i)) & 1) << i; + } + } + +Morton Pattern Analysis +----------------------- + +The Morton index layout in a 4×4 tile follows this pattern: + +.. code-block:: text + + Morton Index Layout: + 0 1 4 5 + 2 3 6 7 + 8 9 12 13 + 10 11 14 15 + +Bit pattern breakdown: + +.. code-block:: text + + (0,0) = (00, 00) → 0 = 0000 + (0,1) = (00, 01) → 1 = 0001 + (0,2) = (00, 10) → 4 = 0100 + (0,3) = (00, 11) → 5 = 0101 + (1,0) = (01, 00) → 2 = 0010 + (1,1) = (01, 01) → 3 = 0011 + (1,2) = (01, 10) → 6 = 0110 + (1,3) = (01, 11) → 7 = 0111 + +Stage 1: Tiling with UnmergeTransform +====================================== + +First, we split our texture into tiles using tensor descriptors (see :ref:`ck_tile_descriptors` and :ref:`ck_tile_transforms`). This creates a hierarchical structure: (Y_blk, y_in, X_blk, x_in). + +.. code-block:: cpp + + template + struct TiledTextureDescriptor { + static constexpr index_t NumTilesY = H / TileSize; + static constexpr index_t NumTilesX = W / TileSize; + + // Original descriptor for H×W texture + using BaseDesc = TensorDescriptor< + Sequence, + Sequence // Row-major layout + >; + + // Stage 1: Split into tiles + // Transform: [H, W] → [NumTilesY, TileSize, NumTilesX, TileSize] + using TiledDesc = decltype( + transform_tensor_descriptor( + BaseDesc{}, + make_tuple( + make_unmerge_transform(Sequence{}), + make_unmerge_transform(Sequence{}) + ), + Sequence<0>{}, // Y dimension + Sequence<1>{}, // X dimension + Sequence<0, 1>{}, // Y → (Y_blk, y_in) + Sequence<2, 3>{} // X → (X_blk, x_in) + ) + ); + }; + +Example usage for an 8×8 texture with 4×4 tiles: + +.. code-block:: cpp + + // Create tiled descriptor + using TiledDesc8x8 = TiledTextureDescriptor<8, 8, 4>::TiledDesc; + + // Access pattern: iterate tile by tile + template + __device__ void process_tiled_texture(const DataType* texture) { + TiledDesc8x8 desc; + + // Process each tile + for (index_t y_blk = 0; y_blk < 2; ++y_blk) { + for (index_t x_blk = 0; x_blk < 2; ++x_blk) { + // Process elements within tile + for (index_t y_in = 0; y_in < 4; ++y_in) { + for (index_t x_in = 0; x_in < 4; ++x_in) { + // Calculate offset using descriptor + index_t offset = desc.calculate_offset({ + y_blk, y_in, x_blk, x_in + }); + + DataType value = texture[offset]; + // Process value... + } + } + } + } + } + +Stage 2: Morton Ordering with MergeTransform +============================================ + +The key insight is that MergeTransform enables Morton ordering by reordering and merging coordinate bits. The transformation involves: + +1. Split coordinates into individual bits using UnmergeTransform +2. Reorder and merge bits using MergeTransform to create the Morton index + +This leverages the coordinate transformation system described in :ref:`ck_tile_coordinate_systems`. + +Mathematical Foundation +----------------------- + +.. code-block:: cpp + + template + struct MortonTransform { + static_assert(TileSize == 4, "This example assumes 4x4 tiles"); + + // Split 4 → (2, 2) for bit extraction + using SplitTransform = UnmergeTransform>; + + // Merge bits in Morton order: (y₀, x₀, y₁, x₁) → Morton + using MortonMergeTransform = MergeTransform>; + + // The merge operation computes: + // morton_idx = y₁×8 + x₁×4 + y₀×2 + x₀ + // This matches the bit interleaving pattern! + }; + +Complete Morton Implementation +------------------------------ + +Here's a complete implementation combining both stages: + +.. code-block:: cpp + + template + struct MortonSwizzledTexture { + static constexpr index_t NumTilesY = H / TileSize; + static constexpr index_t NumTilesX = W / TileSize; + + // Manual Morton implementation for reliability + __device__ static void apply_morton_swizzling( + const DataType* input, + DataType* output) + { + // Process each tile + for (index_t tile_y = 0; tile_y < NumTilesY; ++tile_y) { + for (index_t tile_x = 0; tile_x < NumTilesX; ++tile_x) { + // Apply Morton ordering within tile + for (index_t morton_idx = 0; morton_idx < TileSize * TileSize; ++morton_idx) { + // Decode Morton index to tile coordinates + index_t y_in, x_in; + morton_decode_2d<2>(morton_idx, y_in, x_in); + + // Calculate global coordinates + index_t global_y = tile_y * TileSize + y_in; + index_t global_x = tile_x * TileSize + x_in; + + // Calculate linear indices + index_t src_idx = global_y * W + global_x; + index_t dst_idx = (tile_y * NumTilesX + tile_x) * TileSize * TileSize + morton_idx; + + output[dst_idx] = input[src_idx]; + } + } + } + } + }; + +Memory Access Pattern Analysis +============================== + +An analysis of the benefits of Morton ordering for different access patterns: + +.. code-block:: cpp + + template + struct AccessPatternAnalyzer { + // Analyze spatial locality + __host__ static void analyze_morton_locality() { + printf("Morton Order Spatial Locality Analysis:\n"); + printf("Adjacent indices and their 2D distance:\n"); + + for (index_t i = 0; i < TileSize * TileSize - 1; ++i) { + index_t y1, x1, y2, x2; + morton_decode_2d<2>(i, y1, x1); + morton_decode_2d<2>(i + 1, y2, x2); + + index_t manhattan_dist = abs(y2 - y1) + abs(x2 - x1); + printf("Morton %2d→%2d: (%d,%d)→(%d,%d), distance: %d\n", + i, i+1, y1, x1, y2, x2, manhattan_dist); + } + } + + // Compare cache line usage + __host__ static void analyze_cache_efficiency() { + constexpr index_t CacheLineSize = 128; // bytes + constexpr index_t ElementSize = sizeof(float); + constexpr index_t ElementsPerCacheLine = CacheLineSize / ElementSize; + + printf("\nCache Efficiency Analysis:\n"); + printf("Cache line size: %d bytes (%d floats)\n", + CacheLineSize, ElementsPerCacheLine); + + // Row-major access + index_t row_major_lines = 0; + for (index_t y = 0; y < TileSize; ++y) { + for (index_t x = 0; x < TileSize; x += ElementsPerCacheLine) { + row_major_lines++; + } + } + + // Morton access + index_t morton_lines = 0; + index_t current_line = -1; + for (index_t i = 0; i < TileSize * TileSize; ++i) { + index_t y, x; + morton_decode_2d<2>(i, y, x); + index_t linear_idx = y * TileSize + x; + index_t cache_line = linear_idx / ElementsPerCacheLine; + + if (cache_line != current_line) { + morton_lines++; + current_line = cache_line; + } + } + + printf("Row-major: %d cache lines\n", row_major_lines); + printf("Morton: %d cache lines\n", morton_lines); + } + }; + +GPU Kernel Implementation +========================= + +A complete GPU kernel using Morton ordering for optimized memory access: + +.. code-block:: cpp + + template + __global__ void morton_optimized_kernel( + const DataType* __restrict__ input, + DataType* __restrict__ output, + index_t H, index_t W) + { + // Shared memory with Morton layout + __shared__ DataType smem[BlockSize * BlockSize]; + + // Thread and block indices + const index_t tid_x = threadIdx.x; + const index_t tid_y = threadIdx.y; + const index_t bid_x = blockIdx.x; + const index_t bid_y = blockIdx.y; + + // Global position + const index_t global_x = bid_x * BlockSize + tid_x; + const index_t global_y = bid_y * BlockSize + tid_y; + + // Load to shared memory with coalescing + if (global_x < W && global_y < H) { + smem[tid_y * BlockSize + tid_x] = input[global_y * W + global_x]; + } + __syncthreads(); + + // Process tiles with Morton ordering + constexpr index_t TilesPerBlock = BlockSize / TileSize; + + // Each thread processes one element in Morton order + const index_t tile_id = (tid_y / TileSize) * TilesPerBlock + (tid_x / TileSize); + const index_t morton_in_tile = (tid_y % TileSize) * TileSize + (tid_x % TileSize); + + // Decode Morton index + index_t y_in_tile, x_in_tile; + morton_decode_2d<2>(morton_in_tile, y_in_tile, x_in_tile); + + // Calculate position in shared memory + const index_t tile_y = tile_id / TilesPerBlock; + const index_t tile_x = tile_id % TilesPerBlock; + const index_t smem_y = tile_y * TileSize + y_in_tile; + const index_t smem_x = tile_x * TileSize + x_in_tile; + + // Process with Morton access pattern + DataType value = smem[smem_y * BlockSize + smem_x]; + + // Apply computation... + value = compute_function(value); + + // Store result + if (global_x < W && global_y < H) { + output[global_y * W + global_x] = value; + } + } + +Bank Conflict Reduction +======================= + +Morton ordering is particularly effective for reducing shared memory bank conflicts (complementing the XOR preshuffle technique described in :ref:`ck_tile_lds_index_swapping`): + +.. code-block:: cpp + + template + struct BankConflictAnalysis { + static constexpr index_t NumBanks = 32; + static constexpr index_t BankWidth = 4; // bytes + + template + __host__ static void analyze_bank_conflicts( + const char* pattern_name, + AccessPattern access_func) + { + index_t bank_access[NumBanks] = {0}; + + // Simulate warp access + for (index_t tid = 0; tid < WarpSize; ++tid) { + index_t offset = access_func(tid); + index_t bank = (offset * sizeof(float) / BankWidth) % NumBanks; + bank_access[bank]++; + } + + // Find maximum conflict + index_t max_conflict = 0; + for (index_t bank = 0; bank < NumBanks; ++bank) { + max_conflict = max(max_conflict, bank_access[bank]); + } + + printf("%s: %d-way bank conflict\n", pattern_name, max_conflict); + } + + __host__ static void compare_access_patterns() { + printf("Bank Conflict Analysis for 4x4 Tile Access:\n"); + + // Row-major access + analyze_bank_conflicts("Row-major", [](index_t tid) { + return (tid / 4) * 4 + (tid % 4); + }); + + // Morton access + analyze_bank_conflicts("Morton", [](index_t tid) { + index_t y, x; + morton_decode_2d<2>(tid % 16, y, x); + return y * 4 + x; + }); + } + }; + +Practical Applications +====================== + +Real-world usage of Morton ordering in CK Tile: + +**1. Texture Cache Optimization** + +.. code-block:: cpp + + template + struct TextureCacheOptimized { + static constexpr index_t TextureTileSize = 8; + + __device__ static DataType sample_2d_morton( + const DataType* texture, + float u, float v, + index_t width, index_t height) + { + // Convert normalized coordinates to texel coordinates + index_t x = u * width; + index_t y = v * height; + + // Determine tile + index_t tile_x = x / TextureTileSize; + index_t tile_y = y / TextureTileSize; + + // Position within tile + index_t x_in_tile = x % TextureTileSize; + index_t y_in_tile = y % TextureTileSize; + + // Convert to Morton index + index_t morton_idx = morton_encode_2d<3>(y_in_tile, x_in_tile); + + // Calculate final offset + index_t tile_offset = (tile_y * (width / TextureTileSize) + tile_x) + * TextureTileSize * TextureTileSize; + + return texture[tile_offset + morton_idx]; + } + }; + +**2. Matrix Multiplication with Swizzled Tiles** + +For complete GEMM optimization techniques, see :ref:`ck_tile_gemm_optimization`. + +.. code-block:: cpp + + template + struct SwizzledGEMM { + __device__ static void load_tile_morton( + const DataType* matrix, + DataType* tile, + index_t row_offset, + index_t col_offset, + index_t ld) + { + // Load tile with Morton ordering for better LDS bank utilization + #pragma unroll + for (index_t i = 0; i < TileM * TileN; ++i) { + index_t row_in_tile, col_in_tile; + morton_decode_2d<3>(i, row_in_tile, col_in_tile); + + if (row_in_tile < TileM && col_in_tile < TileN) { + index_t global_row = row_offset + row_in_tile; + index_t global_col = col_offset + col_in_tile; + tile[i] = matrix[global_row * ld + global_col]; + } + } + } + }; + +Summary +======= + +Morton ordering with CK Tile provides memory optimization capabilities: + +- **Spatial Locality**: Z-order curve maintains 2D locality in 1D memory layout +- **Bank Conflict Reduction**: Distributed access patterns across memory banks +- **Cache Efficiency**: Better utilization of cache lines for 2D access patterns +- **Mathematical Framework**: Tensor descriptors express swizzling cleanly +- **Practical Implementation**: Bit manipulation provides reliable results + +Key implementation insights: + +1. **MergeTransform** is essential for expressing Morton bit interleaving +2. **Manual bit manipulation** provides reliable and efficient implementation +3. **Tiling + Morton** combines hierarchical locality with local optimization +4. **GPU-specific tuning** adapts patterns to hardware characteristics + +The tensor descriptor approach provides the mathematical framework for expressing these complex memory patterns, while practical implementations often use direct bit manipulation for efficiency and reliability. + +For more examples of practical CK Tile usage, see :ref:`ck_tile_convolution_example`. For the underlying buffer and tensor abstractions, see :ref:`ck_tile_buffer_views` and :ref:`ck_tile_tensor_views`. diff --git a/docs/conceptual/ck_tile/tensor_coordinates.rst b/docs/conceptual/ck_tile/tensor_coordinates.rst new file mode 100644 index 0000000000..4e9240b83c --- /dev/null +++ b/docs/conceptual/ck_tile/tensor_coordinates.rst @@ -0,0 +1,459 @@ +.. meta:: + :description: CK Tile tensor coordinates and MultiIndex documentation + :keywords: CK Tile, MultiIndex, tensor coordinates, GPU programming + +.. _ck_tile_tensor_coordinates: + +******************* +Tensor Coordinates +******************* + +Overview +======== + +Before diving into transforms and adaptors (see :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors`), it's essential to understand the basic coordinate system in CK Tile. MultiIndex is a container that extends the C++ array with additional operations for multi-dimensional indexing. It is the fundamental building block used throughout the system. + +MultiIndex serves as the common currency between different coordinate spaces (see :ref:`ck_tile_coordinate_systems`), enabling seamless transformation and navigation through complex tensor layouts. Every transform, adaptor, and descriptor in CK Tile operates on these coordinate containers. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "MultiIndex Structure" + MI["MultiIndex
Container for N integers"] + D0["Dimension 0"] + D1["Dimension 1"] + D2["Dimension 2"] + DN["Dimension N-1"] + end + + subgraph "Usage Context" + T["Transforms
"] + A["Adaptors
"] + TV["Tensors
"] + end + + MI --> D0 + MI --> D1 + MI --> D2 + MI --> DN + + T --> MI + A --> MI + TV --> MI + + style MI fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px + style D0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style D1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style D2 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style DN fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style T fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style A fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style TV fill:#ffebee,stroke:#d32f2f,stroke-width:2px + + + +.. image:: diagrams/tensor_coordinates_1.svg + :alt: Diagram + :align: center + +MultiIndex Implementation +========================= + +The C++ implementation provides both compile-time and runtime flexibility: + +.. code-block:: cpp + + // Basic MultiIndex structure + template + struct MultiIndex { + static constexpr index_t kNDim = NDim; + + // Storage for coordinate values + array data_; + + // Constructors + __host__ __device__ constexpr MultiIndex() : data_{} {} + + __host__ __device__ constexpr MultiIndex( + const array& values) : data_(values) {} + + // Element access + __host__ __device__ constexpr index_t& operator[](index_t i) { + return data_[i]; + } + + __host__ __device__ constexpr const index_t& operator[](index_t i) const { + return data_[i]; + } + + // Size query + __host__ __device__ static constexpr index_t size() { + return NDim; + } + }; + +Creating and Using MultiIndex +============================= + +CK Tile provides convenient factory functions for creating MultiIndex objects: + +.. code-block:: cpp + + #include + + __device__ void example_multiindex_usage() { + // Create 3D coordinate with runtime values + auto coord = make_multi_index(1, 2, 3); + + // Access dimensions + auto x = coord[0]; // Returns 1 + auto y = coord[1]; // Returns 2 + auto z = coord[2]; // Returns 3 + + // For compile-time coordinates, use number<> + auto coord_static = make_multi_index( + number<1>{}, number<2>{}, number<3>{} + ); + + // Create from tuple + auto shape = make_tuple(128, 256, 64); + auto coord2 = to_multi_index(shape); + + // Modify coordinate + auto new_coord = coord; + new_coord[0] = 5; // Set X to 5 + + // Use in tensor access + auto tensor = make_naive_tensor_view( + data_ptr, shape, strides + ); + + // Create tensor coordinate for access + auto tensor_coord = make_tensor_coordinate( + tensor.get_tensor_descriptor(), coord + ); + } + +For more advanced coordinate operations and movement patterns, see :ref:`ck_tile_coordinate_movement`. + +Compile-Time Optimization +------------------------- + +CK Tile leverages C++ templates for zero-overhead abstractions: + +.. code-block:: cpp + + // Compile-time MultiIndex operations + template + __host__ __device__ constexpr auto make_static_multi_index() { + return MultiIndex{array{Is...}}; + } + + // Example: Matrix access pattern + template + __device__ void optimized_matrix_access(float* matrix) { + // Compile-time coordinates + constexpr auto origin = make_static_multi_index<0, 0>(); + constexpr auto corner = make_static_multi_index(); + + // Loop unrolling with compile-time indices + #pragma unroll + for (index_t i = 0; i < M; ++i) { + #pragma unroll + for (index_t j = 0; j < N; ++j) { + auto coord = make_multi_index(i, j); + // Compiler can optimize based on known bounds + process_element(matrix[i * N + j]); + } + } + } + +MultiIndex in Coordinate Flow +============================= + +MultiIndex serves as the interface between user code and the transformation pipeline: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart TB + subgraph CF ["Coordinate Flow"] + direction LR + UI["User Input
[1, 2, 3]"] --> MI["MultiIndex
Storage"] + MI --> TR["Transform
Processing"] + TR --> MO["MultiIndex
Output"] + MO --> TA["Tensor Access
element(coord)"] + end + + subgraph EX ["Example: 3D Tensor Access"] + direction LR + T3D["3D Tensor
shape=[4,5,6]"] --> COORD["MultiIndex(3, [1,2,3])"] + COORD --> ELEM["Element at
position [1,2,3]"] + end + + style UI fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style MI fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px + style MO fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px + style COORD fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + +.. image:: diagrams/tensor_coordinates_2.svg + :alt: Diagram + :align: center + +Common Usage Patterns +===================== + +Pattern 1: Tensor Iteration +--------------------------- + +.. code-block:: cpp + + template + __device__ void iterate_2d_tensor(DataType* tensor) { + // Iterate through tensor using MultiIndex + for (index_t i = 0; i < M; ++i) { + for (index_t j = 0; j < N; ++j) { + auto coord = make_multi_index(i, j); + + // Use coordinate for structured access + DataType& element = tensor[coord[0] * N + coord[1]]; + + // Process element + element = process_value(element); + } + } + } + +Pattern 2: Boundary Checking +---------------------------- + +.. code-block:: cpp + + template + __device__ bool is_valid_coordinate( + const MultiIndex& coord, + const MultiIndex& shape) + { + for (index_t i = 0; i < NDim; ++i) { + if (coord[i] < 0 || coord[i] >= shape[i]) { + return false; + } + } + return true; + } + + // Usage in kernel + __global__ void safe_tensor_kernel(float* tensor, index_t H, index_t W) { + auto coord = make_multi_index( + blockIdx.y * blockDim.y + threadIdx.y, + blockIdx.x * blockDim.x + threadIdx.x + ); + + auto shape = make_multi_index(H, W); + + if (is_valid_coordinate(coord, shape)) { + tensor[coord[0] * W + coord[1]] = compute_value(coord); + } + } + +Pattern 3: Transform Chaining +----------------------------- + +.. code-block:: cpp + + // Apply multiple transformations to coordinates + template + __device__ auto apply_transform_chain( + const MultiIndex<2>& input_coord, + const Transform1& t1, + const Transform2& t2) + { + // First transformation + auto intermediate = t1.calculate_bottom_index(input_coord); + + // Second transformation + auto final = t2.calculate_bottom_index(intermediate); + + return final; + } + +Advanced MultiIndex Operations +============================== + +Arithmetic Operations +--------------------- + +.. code-block:: cpp + + template + struct MultiIndexOps { + // Element-wise addition + __device__ static MultiIndex add( + const MultiIndex& a, + const MultiIndex& b) + { + MultiIndex result; + #pragma unroll + for (index_t i = 0; i < NDim; ++i) { + result[i] = a[i] + b[i]; + } + return result; + } + + // Scalar multiplication + __device__ static MultiIndex scale( + const MultiIndex& coord, + index_t factor) + { + MultiIndex result; + #pragma unroll + for (index_t i = 0; i < NDim; ++i) { + result[i] = coord[i] * factor; + } + return result; + } + + // Dot product (for linear indexing) + __device__ static index_t dot( + const MultiIndex& coord, + const MultiIndex& strides) + { + index_t result = 0; + #pragma unroll + for (index_t i = 0; i < NDim; ++i) { + result += coord[i] * strides[i]; + } + return result; + } + }; + +Specialized Coordinates +----------------------- + +.. code-block:: cpp + + // Thread coordinate helper + struct ThreadCoordinate { + __device__ static auto get_thread_coord_1d() { + return make_multi_index( + blockIdx.x * blockDim.x + threadIdx.x + ); + } + + __device__ static auto get_thread_coord_2d() { + return make_multi_index( + blockIdx.y * blockDim.y + threadIdx.y, + blockIdx.x * blockDim.x + threadIdx.x + ); + } + + __device__ static auto get_thread_coord_3d() { + return make_multi_index( + blockIdx.z * blockDim.z + threadIdx.z, + blockIdx.y * blockDim.y + threadIdx.y, + blockIdx.x * blockDim.x + threadIdx.x + ); + } + }; + +Integration with Tensor Operations +================================== + +MultiIndex is the foundation for all tensor operations in CK Tile (see :ref:`ck_tile_tensor_views` and :ref:`ck_tile_buffer_views` for tensor abstractions): + +.. code-block:: cpp + + template + __device__ void tensor_operation_example(TensorView& tensor) { + // Get tensor shape as MultiIndex + auto shape = tensor.get_tensor_descriptor().get_lengths(); + + // Create coordinate for center element + MultiIndex center; + #pragma unroll + for (index_t i = 0; i < TensorView::kNDim; ++i) { + center[i] = shape[i] / 2; + } + + // Access center element + auto center_value = tensor(center); + + // Create stencil pattern using MultiIndex + constexpr auto offsets = make_tuple( + make_multi_index(-1, 0), // North + make_multi_index( 1, 0), // South + make_multi_index( 0, -1), // West + make_multi_index( 0, 1) // East + ); + + // Apply stencil + auto sum = center_value; + static_for<0, 4, 1>{}([&](auto i) { + auto neighbor = MultiIndexOps<2>::add(center, get(offsets)); + if (is_valid_coordinate(neighbor, shape)) { + sum += tensor(neighbor); + } + }); + } + +Performance Considerations +========================== + +MultiIndex is designed for zero-overhead abstraction (see :ref:`ck_tile_gpu_basics` for GPU performance fundamentals): + +1. **Compile-Time Resolution**: When dimensions are known at compile time, all operations are inlined +2. **Register Allocation**: Small fixed-size arrays typically stay in registers +3. **Vectorization**: Compiler can vectorize operations on MultiIndex arrays +4. **Memory Layout**: Contiguous storage enables efficient cache usage + +.. code-block:: cpp + + // Performance-optimized coordinate operations + template + struct OptimizedCoordOps { + // Fused multiply-add for linear indexing + __device__ __forceinline__ static index_t + compute_offset(const MultiIndex& coord, + const MultiIndex& strides) + { + index_t offset = 0; + + // Unroll for small dimensions + if constexpr (NDim <= 4) { + #pragma unroll + for (index_t i = 0; i < NDim; ++i) { + offset = __fma_rn(coord[i], strides[i], offset); + } + } else { + // Partial unrolling for larger dimensions + #pragma unroll 4 + for (index_t i = 0; i < NDim; ++i) { + offset += coord[i] * strides[i]; + } + } + + return offset; + } + }; + +Summary +======= + +MultiIndex is the foundation of CK Tile's coordinate system: + +- **Simple Abstraction**: Container for N integers representing position +- **Universal Usage**: Every transform and adaptor operates on MultiIndex +- **Type-Safe**: Compile-time size and bounds checking in C++ +- **Zero-Overhead**: Template metaprogramming ensures no runtime cost +- **Flexible**: Supports both compile-time and runtime coordinates + +Understanding MultiIndex is crucial before moving to transforms and adaptors, as they all build upon this fundamental coordinate representation. MultiIndex is the common language that allows all CK Tile components to work together seamlessly. + +For the complete picture of how MultiIndex fits into the CK Tile coordinate system, see :ref:`ck_tile_coordinate_systems`. For practical usage in tile distribution, see :ref:`ck_tile_tile_distribution`. diff --git a/docs/conceptual/ck_tile/tensor_views.rst b/docs/conceptual/ck_tile/tensor_views.rst new file mode 100644 index 0000000000..0c46e1e593 --- /dev/null +++ b/docs/conceptual/ck_tile/tensor_views.rst @@ -0,0 +1,482 @@ +.. _ck_tile_tensor_views: + +Tensor Views - Multi-Dimensional Structure +========================================== + +Overview +-------- + +While :ref:`BufferView ` provides the foundation for raw memory access, TensorView adds multi-dimensional structure to flat memory regions. This abstraction bridges the gap between how developers conceptualize data and how that data is physically stored in linear memory. TensorView enables coordinate-based access patterns that match the natural structure of algorithms while maintaining the performance characteristics necessary for efficient GPU computation. + +TensorView presents different logical views of the same underlying memory without copying data. A single memory region can be viewed as a row-major matrix, a column-major matrix, or a transposed matrix, using different TensorView configurations. This zero-copy abstraction enables flexible transformations and access patterns while maintaining optimal memory bandwidth utilization. + +TensorView Architecture +----------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Memory Foundation" + Memory["Flat Memory Array
0 1 2 3 4 5 6 7 8 9 10 11"] + end + + subgraph "Access Layer" + BufferView["BufferView
Linear Memory Access"] + Descriptor["TensorDescriptor
Shape & Stride Info"] + end + + subgraph "Tensor Layer" + TensorView["TensorView
Multi-dimensional Access"] + end + + subgraph "Logical View" + Matrix["2D Matrix View
[3×4]
[[0,1,2,3]
[4,5,6,7]
[8,9,10,11]]"] + end + + Memory --> BufferView + Memory --> Descriptor + BufferView --> TensorView + Descriptor --> TensorView + TensorView --> Matrix + + style Memory fill:#d1fae5,stroke:#10b981,stroke-width:2px + style BufferView fill:#dbeafe,stroke:#3b82f6,stroke-width:2px + style Descriptor fill:#fed7aa,stroke:#f59e0b,stroke-width:2px + style TensorView fill:#fce7f3,stroke:#ec4899,stroke-width:2px + style Matrix fill:#e9d5ff,stroke:#9333ea,stroke-width:2px + + + + + + +.. image:: diagrams/tensor_views_1.svg + :alt: Diagram + :align: center + +The Foundation: BufferView and TensorDescriptor +------------------------------------------------ + +TensorView builds upon two fundamental components that work in concert to provide structured access to memory. The :ref:`BufferView ` component handles the low-level memory access, providing type-safe operations with address space awareness. The :ref:`TensorDescriptor ` component encodes the multi-dimensional structure, including shape information and stride patterns that determine how coordinates map to memory offsets. + +This separation of concerns enables optimizations. The BufferView can optimize for the specific memory space while the TensorDescriptor can encode complex access patterns without concern for the underlying memory type. Together, they provide a complete abstraction for multi-dimensional data access. + +C++ Implementation +------------------ + +**File**: ``include/ck_tile/core/tensor/tensor_view.hpp`` + +Creating TensorViews +~~~~~~~~~~~~~~~~~~~~ + +The creation of a TensorView involves combining a BufferView with a TensorDescriptor. This process can be done explicitly for maximum control or through convenience functions for common patterns: + +.. code-block:: cpp + + #include + #include + #include + + // The actual C++ template signature from tensor_view.hpp: + // template + // struct tensor_view + + __device__ void example_tensor_creation() + { + // Create a 3x4 matrix in global memory + float data[12] = {0,1,2,3,4,5,6,7,8,9,10,11}; + + // Method 1: Create buffer and descriptor separately + auto buffer = make_buffer_view(data, 12); + auto desc = make_tensor_descriptor( + make_tuple(3, 4), // shape: 3 rows, 4 columns + make_tuple(4, 1) // strides: row stride=4, col stride=1 + ); + + // Create tensor view + auto tensor = make_tensor_view(buffer, desc); + + // Method 2: Use convenience function for packed layout + auto tensor2 = make_naive_tensor_view_packed( + data, // pointer + make_tuple(3, 4) // shape (strides calculated automatically) + ); + + // Access element at (1, 2) + float value = tensor(make_tuple(1, 2)); // Returns 6 + + // Update element + tensor(make_tuple(2, 1)) = 99.0f; + } + +Coordinate-Based Access +~~~~~~~~~~~~~~~~~~~~~~~ + +The fundamental operation of TensorView is translating multi-dimensional coordinates into memory accesses. This translation happens through an advanced pipeline that maintains efficiency while providing flexibility: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "User Input" + Coord["Coordinate
(1, 2)"] + end + + subgraph "TensorView Processing" + Shape["Shape Check
row < 3?
col < 4?"] + Stride["Apply Strides
offset = 1×4 + 2×1"] + Buffer["BufferView Access
buffer[6]"] + end + + subgraph "Result" + Value["Value: 6"] + end + + Coord --> Shape + Shape -->|Valid| Stride + Stride --> Buffer + Buffer --> Value + + style Coord fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style Shape fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + style Stride fill:#dcfce7,stroke:#10b981,stroke-width:2px + style Buffer fill:#dbeafe,stroke:#3b82f6,stroke-width:2px + style Value fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + + +.. image:: diagrams/tensor_views_2.svg + :alt: Diagram + :align: center + +Memory Layouts and Strides +-------------------------- + +A key feature of TensorView is its ability to represent different memory layouts through stride manipulation. This capability enables zero-copy transformations that would otherwise require expensive memory operations: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Row-Major Layout (C-style)" + RM["Memory: [0,1,2,3,4,5,6,7,8,9,10,11]
Shape: (3,4)
Strides: (4,1)"] + RMMatrix["[[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]]"] + RM --> RMMatrix + end + + subgraph "Column-Major Layout (Fortran-style)" + CM["Memory: [0,3,6,9,1,4,7,10,2,5,8,11]
Shape: (3,4)
Strides: (1,3)"] + CMMatrix["[[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]]"] + CM --> CMMatrix + end + + subgraph "Custom Stride (Transposed View)" + TV["Memory: [0,1,2,3,4,5,6,7,8,9,10,11]
Shape: (4,3)
Strides: (1,4)"] + TVMatrix["[[0, 4, 8]
[1, 5, 9]
[2, 6, 10]
[3, 7, 11]]"] + TV --> TVMatrix + end + + style RM fill:#e0f2fe,stroke:#0284c7,stroke-width:2px + style CM fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + style TV fill:#f3e8ff,stroke:#9333ea,stroke-width:2px + + + + + + +.. image:: diagrams/tensor_views_3.svg + :alt: Diagram + :align: center + +Row-Major vs Column-Major Layouts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The choice of memory layout has profound implications for performance. Row-major layout, where consecutive elements in a row are stored contiguously, optimizes for row-wise traversal. Column-major layout optimizes for column-wise traversal. CK's TensorView abstraction allows algorithms to work with their natural access patterns regardless of the underlying storage: + +.. code-block:: cpp + + __device__ void example_memory_layouts() + { + float data[12] = {0,1,2,3,4,5,6,7,8,9,10,11}; + + // Row-major layout (default) + auto row_major = make_naive_tensor_view_packed( + data, make_tuple(3, 4) + ); + // Strides: (4, 1) - moving one row advances by 4 elements + + // Column-major layout through custom strides + auto col_major = make_tensor_view( + make_buffer_view(data, 12), + make_tensor_descriptor( + make_tuple(3, 4), // shape + make_tuple(1, 3) // strides: row stride=1, col stride=3 + ) + ); + + // Transposed view (no data copy!) + auto transposed = make_tensor_view( + make_buffer_view(data, 12), + make_tensor_descriptor( + make_tuple(4, 3), // transposed shape + make_tuple(1, 4) // transposed strides + ) + ); + + // All three views access the same memory, just differently + // row_major(1,2) == col_major(2,1) == transposed(2,1) + } + +Advanced Operations +------------------- + +Slicing and Subviews +~~~~~~~~~~~~~~~~~~~~ + +TensorView supports advanced slicing operations that create new views of subsets of the data. These operations are essential for algorithms that process data in blocks or tiles. See :ref:`ck_tile_tile_window` for production use. + +.. code-block:: cpp + + __device__ void example_slicing_operations() + { + // Create a larger tensor + float data[100]; + auto tensor = make_naive_tensor_view_packed( + data, make_tuple(10, 10) + ); + + // Create a subview using transforms + // This would typically be done with tile_window in production code + auto subview = make_tensor_view( + tensor.get_buffer_view(), + transform_tensor_descriptor( + tensor.get_tensor_descriptor(), + make_tuple( + make_pass_through_transform(number<5>{}), // 5 rows + make_pass_through_transform(number<5>{}) // 5 columns + ), + make_tuple(number<2>{}, number<3>{}) // offset (2,3) + ) + ); + + // subview now represents a 5x5 region starting at (2,3) + } + +Vectorized Access +~~~~~~~~~~~~~~~~~ + +GPUs achieve maximum memory bandwidth through vectorized operations. TensorView provides native support for vector loads and stores. See :ref:`ck_tile_load_store_traits` for more details. + +.. code-block:: cpp + + __device__ void example_vectorized_access() + { + float data[256]; + auto tensor = make_naive_tensor_view_packed( + data, make_tuple(16, 16) + ); + + // Create coordinate for vectorized access + auto coord = make_tensor_coordinate( + tensor.get_tensor_descriptor(), + make_tuple(4, 0) // row 4, starting at column 0 + ); + + // Load 4 consecutive elements as float4 + using float4 = vector_type::type; + auto vec4 = tensor.get_vectorized_elements(coord, 0); + + // Process vector data + vec4.x *= 2.0f; + vec4.y *= 2.0f; + vec4.z *= 2.0f; + vec4.w *= 2.0f; + + // Store back + tensor.set_vectorized_elements(coord, 0, vec4); + } + +Performance Considerations +-------------------------- + +Memory Access Patterns +~~~~~~~~~~~~~~~~~~~~~~ + +The efficiency of TensorView operations depends on memory access patterns. Understanding these patterns is important for achieving optimal performance. See :ref:`ck_tile_gpu_basics` for hardware considerations. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Memory Access Patterns" + Seq["Sequential Access
(Good cache usage)"] + Stride["Strided Access
(May cause cache misses)"] + Random["Random Access
(Poor cache usage)"] + end + + subgraph "Optimization Strategies" + Opt1["Use row-major for row iteration"] + Opt2["Use col-major for column iteration"] + Opt3["Minimize stride between accesses"] + Opt4["Vectorize when possible"] + end + + Seq --> Opt1 + Stride --> Opt2 + Stride --> Opt3 + Random --> Opt4 + + style Seq fill:#d1fae5,stroke:#10b981,stroke-width:2px + style Stride fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + style Random fill:#fee2e2,stroke:#ef4444,stroke-width:2px + + + + + + +.. image:: diagrams/tensor_views_4.svg + :alt: Diagram + :align: center + +Compile-Time Optimization +~~~~~~~~~~~~~~~~~~~~~~~~~ + +CK's TensorView leverages compile-time optimization to achieve zero-overhead abstraction. When tensor dimensions and strides are known at compile time, the entire coordinate-to-offset calculation can be resolved during compilation: + +.. code-block:: cpp + + // Compile-time known dimensions enable optimization + constexpr auto shape = make_tuple(number<256>{}, number<256>{}); + constexpr auto strides = make_tuple(number<256>{}, number<1>{}); + + auto tensor = make_tensor_view( + buffer, + make_tensor_descriptor(shape, strides) + ); + + // This access compiles to a single memory instruction + constexpr auto coord = make_tuple(number<5>{}, number<10>{}); + auto value = tensor(coord); // Offset calculated at compile time + +TensorView vs BufferView +------------------------ + +Understanding when to use TensorView versus BufferView is crucial for writing efficient code: + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "BufferView" + BV1["Linear indexing only"] + BV2["buffer[5]"] + BV3["No shape information"] + BV4["Direct memory access"] + end + + subgraph "TensorView" + TV1["Multi-dimensional indexing"] + TV2["tensor(1, 2)"] + TV3["Shape-aware operations"] + TV4["Coordinate transformations"] + end + + subgraph "Use Cases" + UC1["BufferView: Low-level memory ops"] + UC2["TensorView: Matrix/tensor algorithms"] + end + + BV1 --> UC1 + TV1 --> UC2 + + style BV1 fill:#dbeafe,stroke:#3b82f6,stroke-width:2px + style TV1 fill:#fce7f3,stroke:#ec4899,stroke-width:2px + + + + + + +.. image:: diagrams/tensor_views_5.svg + :alt: Diagram + :align: center + +BufferView excels at raw memory operations where linear access is natural or where the overhead of coordinate calculation would be prohibitive. TensorView is best suited for algorithms that operate in terms of multi-dimensional coordinates, such as matrix operations, image processing, or tensor contractions. + +Integration with Tile Distribution +---------------------------------- + +TensorView serves as the foundation for :ref:`tile distribution's ` higher-level abstractions. When combined with :ref:`tile windows ` and distribution patterns, TensorView enables the automatic generation of efficient access patterns: + +.. code-block:: cpp + + // TensorView provides the base abstraction + auto tensor_view = make_naive_tensor_view_packed( + global_memory, make_tuple(M, N) + ); + + // Tile window builds on TensorView for distributed access + auto tile_window = make_tile_window( + tensor_view, + tile_shape, + origin, + distribution + ); + + // The distribution automatically generates optimal access patterns + auto distributed_tensor = tile_window.load(); + +Summary +------- + +TensorView bridges the gap between logical multi-dimensional data structures and physical memory layout. Through its advanced design, TensorView provides: + +**Multi-dimensional Indexing**: Natural coordinate-based access to data, matching how algorithms conceptualize their operations. This abstraction eliminates error-prone manual index calculations while maintaining performance. + +**Flexible Memory Layouts**: Support for row-major, column-major, and custom stride patterns enables algorithms to work with data in its most natural form. Zero-copy transformations like transposition become stride manipulations. + +**Zero-Copy Views**: The ability to create different logical views of the same physical memory enables flexible transformations without the overhead of data movement. This capability is essential for efficient GPU programming where memory bandwidth is often the limiting factor. + +**Type Safety**: Dimensions and memory spaces are encoded in the type system, catching errors at compile time rather than runtime. This safety comes without performance overhead thanks to template metaprogramming. + +**Seamless Integration**: TensorView works harmoniously with :ref:`BufferView ` for low-level access and serves as the foundation for higher-level abstractions like :ref:`tile windows ` and :ref:`distributed tensors `. + +The abstraction enables writing dimension-agnostic algorithms while maintaining high performance through compile-time optimizations. + +Next Steps +---------- + +Continue to :ref:`ck_tile_coordinate_systems` to understand the mathematical foundation of coordinate transformations in CK Tile. diff --git a/docs/conceptual/ck_tile/terminology.rst b/docs/conceptual/ck_tile/terminology.rst new file mode 100644 index 0000000000..7d5fc87fe9 --- /dev/null +++ b/docs/conceptual/ck_tile/terminology.rst @@ -0,0 +1,383 @@ +.. _ck_tile_terminology: + +Terminology Reference - Key Concepts and Definitions +==================================================== + +Overview +-------- + +The Composable Kernel framework introduces concepts and abstractions that form the foundation of its approach to high-performance GPU computing. This terminology reference serves as a comprehensive guide to the language of CK, providing detailed explanations of each term along with practical examples of their usage in C++ code. + +The terminology of CK reflects its layered architecture, with concepts building upon one another in a logical progression. From the fundamental notion of tiles and distributions to the compile-time coordinate transformation systems, each term represents a carefully designed abstraction that serves a specific purpose in the overall framework. This reference is organized to mirror this conceptual hierarchy, starting with core concepts and progressing through increasingly specialized terminology. + +As you explore this reference, you'll notice that many terms are interconnected, reflecting the holistic nature of the CK design. A tile is not just a block of data but a fundamental unit of work distribution. A distribution is not merely a pattern but a mathematical framework for optimal resource utilization. These interconnections are intentional and understanding them is crucial for effective use of the framework. + +Core Concepts +------------- + +Tile +~~~~ +The concept of a tile represents the fundamental unit of data organization in the CK framework. A tile is a contiguous block of data that is processed as a cohesive unit by a coordinated group of threads. This abstraction serves multiple critical purposes in achieving high performance on GPU architectures. By organizing data into tiles, the framework ensures that memory accesses exhibit spatial locality, enabling efficient use of cache hierarchies. The tile size is chosen to balance several competing factors: it must be large enough to amortize the overhead of memory transactions, yet small enough to fit within the limited on-chip memory resources. Furthermore, tiles are designed to align with the :ref:`GPU's execution model `, ensuring that threads within a warp access contiguous memory locations for optimal bandwidth utilization. + +**C++ Usage**: ``using TileShape = sequence<256, 256>;`` + +Distribution +~~~~~~~~~~~~ +The distribution pattern represents one of the most compile-time abstractions in the CK framework, defining the precise mapping between logical data elements and the physical processing resources that will operate on them. A distribution is far more than an assignment scheme—it embodies a strategy for achieving optimal performance on GPU hardware. The distribution determines which threads access which data elements, how those accesses are ordered to maximize memory bandwidth, and how intermediate results are shared between cooperating threads. By encoding these decisions at compile time, distributions enable the generation of highly optimized code that respects hardware constraints while maintaining algorithmic clarity. For a detailed exploration of distribution concepts, see :ref:`ck_tile_distribution`. + +**C++ Type**: ``tile_distribution<...>`` + +Encoding +~~~~~~~~ +An encoding in CK represents a compile-time specification that captures the strategy for distributing tensor data across GPU processing elements. This specification is not merely a configuration but a mathematical description of the transformation between coordinate spaces. The encoding defines the hierarchical decomposition of work, the mapping between thread indices and data elements, and the patterns by which threads cooperate to process their assigned data. By expressing these concepts as compile-time constants, encodings enable aggressive compiler optimizations while ensuring that distribution strategies can be verified for correctness before execution. + +**C++ Type**: ``tile_distribution_encoding<...>`` + +Coordinate Spaces +----------------- + +For a comprehensive mathematical treatment of coordinate systems, see :ref:`ck_tile_coordinate_systems`. + +P-Space (Partition Space) +~~~~~~~~~~~~~~~~~~~~~~~~~ +The Partition Space, or P-space, represents the fundamental abstraction for identifying processing elements within the GPU's execution hierarchy. This coordinate space captures the multi-level organization of GPU computation, from individual threads to warps to thread blocks. P-space typically manifests as either a one-dimensional space containing only lane identifiers for simple distributions, or a two-dimensional space incorporating both warp and lane identifiers for more complex hierarchical distributions. The significance of P-space extends beyond mere thread identification—it forms the foundation for all work distribution decisions, determining which processing elements will collaborate on specific data tiles and how they will coordinate their efforts. + +The dimensions of P-space directly reflect the hardware's execution model. In a one-dimensional P-space, threads are identified solely by their lane ID within a warp, suitable for algorithms where inter-warp coordination is minimal. Two-dimensional P-space adds warp-level coordination, enabling advanced tiling strategies that leverage both intra-warp and inter-warp parallelism. The values in P-space are always hardware thread indices, providing a direct mapping to the physical execution resources. + +**C++ Example**: + +.. code-block:: cpp + + // Get current thread's P coordinates + auto p_idx = Distribution::_get_partition_index(); + +Y-Space (Yield Space) +~~~~~~~~~~~~~~~~~~~~~ +The Yield Space, or Y-space, embodies the logical structure of computation within each tile, representing the pattern by which threads traverse their assigned data. Unlike P-space which identifies threads, Y-space defines what each thread does with its assigned work. This abstraction enables the expression of complex access patterns—from simple linear traversals to advanced space-filling curves—in a hardware-independent manner. The dimensionality of Y-space varies with the algorithm's requirements, typically ranging from two dimensions for matrix operations to four or more for complex tensor contractions. + +Y-space serves as the primary iteration space for computational kernels. When a thread processes its assigned tile, it iterates through Y-space coordinates, with each coordinate mapping to specific data elements within the tile. This abstraction enables critical optimizations: the Y-space traversal order can be designed to maximize data reuse, minimize register pressure, or optimize for specific hardware characteristics, all without changing the fundamental algorithm. + +**C++ Example**: + +.. code-block:: cpp + + // Iterate over Y-space + sweep_tile(tensor, [](auto y_idx) { /*...*/ }); + +X-Space (Physical Tensor Space) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The Physical Tensor Space, or X-space, represents the ground truth of data organization—the actual coordinates within the global tensor. This space directly corresponds to how data is laid out in memory, with dimensions matching those of the tensor being processed. For a matrix, X-space is two-dimensional with row and column coordinates. For a 4D convolution tensor, X-space encompasses batch, channel, height, and width dimensions. X-space serves as the target of the coordinate transformation pipeline, where abstract thread and pattern coordinates are converted into concrete memory addresses. + +The relationship between X-space and physical memory is direct but not necessarily trivial. While X-space coordinates identify logical positions within a tensor, the actual memory layout may involve padding, striding, or other transformations for alignment and performance. The CK framework handles these low-level details transparently, allowing algorithms to work with logical X-space coordinates while ensuring efficient physical memory access. + +**C++ Example**: + +.. code-block:: cpp + + // Calculate X coordinates from P+Y + auto x_idx = distribution.calculate_index(p_idx); + +R-Space (Replication Space) +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The Replication Space, or R-space, introduces a advanced mechanism for expressing redundant computation patterns that enhance performance through data sharing. Unlike the other coordinate spaces which map to unique data elements, R-space enables multiple processing elements to compute the same values, facilitating efficient communication patterns. This replication serves multiple purposes: it can reduce global memory traffic by computing values locally rather than loading them, enable efficient reduction operations by providing private workspace for each thread group, and facilitate complex data exchange patterns that would otherwise require expensive synchronization. + +R-space dimensions are optional and algorithm-specific. A matrix multiplication might use R-space to replicate portions of the input matrices across thread groups, enabling each group to compute partial products independently. The framework automatically manages the complexities of replication, including the allocation of private storage and the coordination of replicated computations. + +**C++ Example**: + +.. code-block:: cpp + + // R-dimensions in encoding + using Encoding = tile_distribution_encoding< + sequence<2>, // rs_lengths: 2-way replication + /*...*/ + >; + +D-Space (Data Space) +~~~~~~~~~~~~~~~~~~~~ +The Data Space, or D-space, represents the final stage of the coordinate transformation pipeline—the linearization of multi-dimensional tile data for efficient storage in thread-local registers. This one-dimensional space serves a critical role in managing the GPU's most precious resource: register files. By transforming the potentially complex Y-space coordinates into a linear D-space index, the framework enables efficient register allocation and access patterns that minimize register bank conflicts and maximize instruction-level parallelism. + +The transformation from Y-space to D-space is more than a simple flattening operation. It incorporates optimized strategies for register layout that consider the GPU's register file organization, the kernel's register pressure, and the access patterns of the computation. This transformation ensures that frequently accessed elements are kept in registers, that register bank conflicts are minimized, and that the compiler can generate efficient code for register access. + +**C++ Example**: + +.. code-block:: cpp + + // Y-to-D descriptor linearizes storage + auto d_idx = ys_to_d_descriptor.calculate_offset(y_idx); + +Dimension Types +--------------- + +H-Dimensions (Hierarchical Dimensions) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The concept of Hierarchical Dimensions, or H-dimensions, represents one of the most key aspects of the CK framework's approach to work distribution. These dimensions encode a multi-level decomposition strategy that mirrors the hierarchical nature of GPU hardware, from individual vector operations up through threads, warps, and thread blocks. Each H-dimension group captures how a single tensor dimension is partitioned across these hardware levels, enabling fine-grained control over data access patterns and computational efficiency. + +The structure of H-dimensions follows a specific pattern that reflects the GPU's execution hierarchy. Each H-dimension is expressed as a sequence of factors, where each factor corresponds to a specific level of the hierarchy. Consider the example ``sequence<4, 2, 8, 4>``. This seemingly simple sequence encodes a advanced distribution strategy: the rightmost factor (4) represents vector width, indicating that each memory operation processes 4 elements simultaneously. Moving left, the factor 8 indicates that 8 threads within a warp collaborate on the data. The factor 2 specifies that 2 warps within a block work together. Finally, the leftmost factor 4 indicates that each thread performs 4 iterations, enabling instruction-level parallelism and register reuse. + +This hierarchical decomposition enables critical optimizations. By explicitly encoding the distribution strategy at compile time, the framework can generate code that perfectly matches the hardware's capabilities. The vector width aligns with the GPU's memory transaction size. The thread count per warp matches the hardware's SIMD width. The warp count per block balances parallelism with resource constraints. The repetition factor enables loop unrolling and software pipelining. Together, these factors create a distribution strategy that achieves near-optimal performance. + +**C++ Example**: + +.. code-block:: cpp + + using HsLengthss = tuple< + sequence<4, 2, 8, 4>, // H0: M dimension + sequence<4, 2, 8, 4> // H1: N dimension + >; + +RH-Dimensions (R + H Dimensions Combined) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The RH-dimensions represent the unified coordinate space that combines both replication (R) and hierarchical (H) dimensions into a single, coherent framework. This combined space serves as the internal representation used by the coordinate transformation machinery, enabling seamless handling of both replicated and non-replicated data patterns. The unification of these dimensions simplifies the mathematical framework while maintaining the flexibility to express complex distribution strategies. + +Within the RH-dimension framework, coordinates are identified by two components: major and minor indices. The major index identifies which dimension group a coordinate belongs to, with 0 reserved for R-dimensions and subsequent values (1, 2, ...) identifying H-dimension groups. The minor index specifies the position within the identified group. This two-level addressing scheme enables efficient navigation through the combined coordinate space while maintaining clear separation between replication and hierarchical decomposition strategies. + +The power of RH-dimensions becomes apparent when considering complex algorithms that require both data replication and hierarchical distribution. By providing a unified coordinate system, the framework can express transformations that simultaneously handle replicated data sharing and hierarchical work distribution, all within a single mathematical formalism. This unification is key to achieving both expressiveness and efficiency in the CK framework. + +Transformations +--------------- + +Adaptor +~~~~~~~ +An adaptor in the CK framework represents a advanced chain of coordinate transformations that bridges different coordinate spaces. Rather than simple one-to-one mappings, adaptors embody complex mathematical transformations that can involve permutations, embeddings, projections, and non-linear mappings. These transformations are composed at compile time, enabling the generation of highly optimized code that performs the complete transformation in a single step without intermediate representations. For detailed information about adaptors and their implementation, see :ref:`ck_tile_adaptors`. + +The framework provides several specialized adaptor types, each serving a specific role in the coordinate transformation pipeline. The ``ps_ys_to_xs_adaptor`` performs the critical transformation from processing element and yield space coordinates to physical tensor coordinates, implementing the core logic of tile distribution. This adaptor encodes decisions about how threads are assigned to data, how data is traversed within each thread's assignment, and how these patterns map to the global tensor layout. Similarly, the ``ys_to_d_adaptor`` handles the transformation from multi-dimensional yield space to linearized data space, optimizing the layout of data in thread-local registers. + +The power of adaptors lies in their composability. Complex transformations can be built by chaining simpler adaptors, with the framework automatically optimizing the composition. This design enables the expression of advanced access patterns—such as transposed access, strided access, or space-filling curves—through the composition of elementary transformations. The compile-time nature of this composition ensures zero runtime overhead while maintaining mathematical clarity. + +**C++ Type**: ``tensor_adaptor<...>`` + +Descriptor +~~~~~~~~~~ +A descriptor in CK provides a complete specification of tensor layout, encompassing not just the logical structure of the data but also all transformations and physical memory layout details. This comprehensive specification serves as the contract between different components of the system, ensuring that all parts of a kernel have a consistent view of how data is organized and accessed. Descriptors combine multiple aspects of tensor representation: the logical shape and dimensions, the physical memory layout including padding and alignment, the coordinate transformations for different access patterns, and optimization hints for the compiler. For comprehensive coverage of descriptors, see :ref:`ck_tile_descriptors`. + +The sophistication of descriptors enables them to represent complex data layouts that arise in real-world applications. A descriptor might specify that a logically 4D tensor is physically stored with padding for alignment, uses a custom stride pattern for the channel dimension, and should be accessed using a space-filling curve for optimal cache utilization. All these details are encoded in the descriptor's type, enabling compile-time verification and optimization. + +Descriptors play a crucial role in achieving performance portability. By abstracting the details of data layout behind a well-defined interface, descriptors enable algorithms to be written once and automatically adapted to different data layouts. This abstraction is particularly valuable when dealing with different hardware architectures that may have different alignment requirements, cache line sizes, or memory access patterns. + +**C++ Type**: ``tensor_descriptor<...>`` + +Operations +---------- + +Load Tile +~~~~~~~~~ +The load tile operation represents a fundamental building block of GPU kernel design in the CK framework, orchestrating the complex process of transferring data from global memory to thread-local registers. This operation is far more advanced than a simple memory copy—it implements the complete distribution strategy encoded in the tile distribution, ensuring that each thread loads exactly the data it needs for its portion of the computation. The load operation automatically handles memory coalescing to maximize bandwidth utilization, coordinates between threads to avoid redundant loads, manages boundary conditions for tiles that extend beyond tensor bounds, and optimizes the access pattern based on the specific distribution strategy. + +The efficiency of the load tile operation stems from its deep integration with the distribution framework. By knowing at compile time exactly which threads will access which data elements, the operation can generate optimal memory access patterns that fully utilize the GPU's memory subsystem. For matrix multiplication, this might mean loading data in a pattern that ensures perfect coalescing. For convolution, it might involve complex patterns that minimize the number of redundant loads while respecting the GPU's cache hierarchy. + +**C++ Function**: ``tile_window.load()`` + +Store Tile +~~~~~~~~~~ +The store tile operation provides the complementary functionality to load tile, transferring computed results from thread-local registers back to global memory. Like its counterpart, the store operation implements optimized strategies that go beyond simple memory writes. It ensures that writes are coalesced for maximum bandwidth efficiency, coordinates between threads to handle overlapping write regions correctly, manages atomic operations when multiple threads write to the same location, and optimizes write patterns to minimize memory traffic. + +The store operation must handle additional complexities compared to loads. While loads can often ignore synchronization issues (reading stale data is usually harmless), stores must ensure correctness when multiple threads write to overlapping regions. The framework provides different store modes for different scenarios: exclusive stores where each element is written by exactly one thread, atomic stores where multiple threads may update the same element, and reduction stores where partial results are accumulated. The choice of store mode is encoded in the distribution strategy and verified at compile time. + +**C++ Function**: ``tile_window.store(tile)`` + +Sweep Tile +~~~~~~~~~~ +The sweep tile operation embodies a key programming paradigm for distributed tensor computation, providing a high-level iteration abstraction over the complex distribution patterns. Rather than requiring manual index calculations and nested loops, sweep tile automatically visits each element in a distributed tensor exactly once, invoking a user-provided function with the appropriate coordinates. This abstraction hides the complexity of the distribution while enabling advanced optimizations such as automatic loop unrolling, software pipelining, and register rotation. + +The implementation of sweep tile leverages the compile-time knowledge of the distribution pattern to generate highly optimized iteration code. For simple distributions, this might result in a single unrolled loop. For complex hierarchical distributions, it might generate nested loops with carefully chosen iteration orders that maximize data reuse and minimize register pressure. The beauty of the abstraction is that these optimizations happen transparently—the user simply provides the computation to perform on each element, and the framework handles the rest. + +**C++ Function**: ``sweep_tile(tensor, lambda)`` + +Shuffle Tile +~~~~~~~~~~~~ +The shuffle tile operation provides efficient intra-warp communication, enabling threads within a warp to exchange data without going through shared memory. This operation leverages the GPU's hardware shuffle instructions, which allow any thread in a warp to read registers from any other thread in the same warp. Shuffle operations are particularly valuable for reduction operations, transpose operations within a warp, and collaborative loading patterns where threads cooperate to load contiguous data and then redistribute it according to the computation pattern. + +The framework provides various shuffle patterns optimized for different use cases. Butterfly shuffles enable efficient reductions and FFT-like operations. Broadcast shuffles allow one thread to share data with all others in the warp. Rotation shuffles enable cyclic data exchange patterns. The shuffle tile operation automatically selects the appropriate hardware instructions based on the data type and shuffle pattern, ensuring optimal performance while maintaining portability across different GPU architectures. + +**C++ Function**: ``shuffle_tile(tensor, shuffle_pattern)`` + +Memory Concepts +--------------- + +Coalescing +~~~~~~~~~~ +The property where adjacent threads access adjacent memory locations, maximizing memory bandwidth utilization. + +Bank Conflict +~~~~~~~~~~~~~ +A performance degradation that occurs when multiple threads in a warp access different addresses in the same memory bank. For detailed information about bank conflicts and mitigation strategies, see :ref:`ck_tile_lds_bank_conflicts`. + +Vectorization +~~~~~~~~~~~~~ +The technique of loading/storing multiple elements in a single memory transaction. + +**C++ Example**: + +.. code-block:: cpp + + // Vector load of 4 elements + using float4 = vector_type::type; + float4 data = tensor_view.template get_vectorized_elements<4>(x_idx); + +Distribution Components +----------------------- + +Window +~~~~~~ +A view into a subset of a tensor that respects the distribution pattern. For detailed information about tile windows and their usage, see :ref:`ck_tile_tile_window`. + +**C++ Type**: ``tile_window<...>`` + +Static Distributed Tensor +~~~~~~~~~~~~~~~~~~~~~~~~~ +A thread-local tensor stored in registers, distributed according to a tile distribution. For in-depth coverage of static distributed tensors, see :ref:`ck_tile_static_distributed_tensor`. + +**C++ Type**: ``static_distributed_tensor<...>`` + +Spans +~~~~~ +Iteration ranges over distributed dimensions, used by sweep operations. + +**C++ Type**: ``tile_distributed_span<...>`` + +GPU Hardware Terms +------------------ + +Warp +~~~~ +A group of threads (32 on AMD GPUs) that execute in lockstep. + +Lane +~~~~ +An individual thread within a warp (0-31). + +Block +~~~~~ +A group of warps that can cooperate through shared memory. + +Grid +~~~~ +The complete set of blocks launched for a kernel. + +Template Parameters +------------------- + +sequence<...> +~~~~~~~~~~~~~ +A compile-time integer sequence used to specify dimensions and lengths. + +**Example**: ``sequence<256, 256>`` for a 256×256 tile + +tuple<...> +~~~~~~~~~~ +A heterogeneous collection of types, often used for grouping sequences. + +**Example**: ``tuple, sequence<4,4>>`` + +number +~~~~~~~~~ +A compile-time integer constant. + +**Example**: ``number<16>`` represents the value 16 + +Optimization Terms +------------------ + +Register Spilling +~~~~~~~~~~~~~~~~~ +When a kernel uses more registers than available, causing data to spill to slower memory. + +Occupancy +~~~~~~~~~ +The ratio of active warps to maximum possible warps on a GPU multiprocessor. + +Memory Bandwidth Utilization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The percentage of theoretical memory bandwidth achieved by a kernel. + +Instruction-Level Parallelism (ILP) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The ability to execute multiple independent instructions simultaneously. + +Common Patterns +--------------- + +GEMM (General Matrix Multiplication) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A fundamental operation where C = αA×B + βC. For a complete optimization case study, see :ref:`ck_tile_gemm_optimization`. + +Reduction +~~~~~~~~~ +An operation that combines multiple values into a single result (e.g., sum, max). + +Broadcast +~~~~~~~~~ +An operation that replicates a value across multiple processing elements. + +Transpose +~~~~~~~~~ +An operation that swaps dimensions of a tensor. + +Performance Metrics +------------------- + +FLOPS (Floating-Point Operations Per Second) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Measure of computational throughput. + +Bandwidth +~~~~~~~~~ +Rate of data transfer, typically measured in GB/s. + +Latency +~~~~~~~ +Time delay between issuing an operation and its completion. + +Throughput +~~~~~~~~~~ +Rate of operation completion, often measured in operations per second. + +Usage Examples +-------------- + +Creating a Distribution +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Define encoding + using MyEncoding = tile_distribution_encoding< + sequence<>, // No replication + tuple, // M dimension + sequence<4,2,8,4>>, // N dimension + tuple, sequence<1,2>>, // P mappings + tuple, sequence<2,2>>, // P minor + sequence<1,1,2,2>, // Y major + sequence<0,3,0,3> // Y minor + >; + + // Create distribution + auto distribution = make_static_tile_distribution(MyEncoding{}); + +Using Tile Window +~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Create window + auto window = make_tile_window( + tensor_view, + TileShape{}, + origin, + distribution + ); + + // Load-compute-store pattern + auto tile = window.load(); + sweep_tile(tile, compute_func); + window.store(tile); + +Related Documentation +--------------------- + +- :ref:`ck_tile_introduction` - Introduction and motivation +- :ref:`ck_tile_buffer_views` - Raw memory access +- :ref:`ck_tile_distribution` - Core distribution concepts + + diff --git a/docs/conceptual/ck_tile/thread_mapping.rst b/docs/conceptual/ck_tile/thread_mapping.rst new file mode 100644 index 0000000000..cff4f727ff --- /dev/null +++ b/docs/conceptual/ck_tile/thread_mapping.rst @@ -0,0 +1,551 @@ +.. meta:: + :description: CK Tile thread mapping - connecting mathematical abstractions to GPU hardware + :keywords: CDNA, RDNA, ROCm, CK, Composable Kernel, thread mapping, GPU programming + +.. _ck_tile_thread_mapping: + +******************************************************************** +Thread Mapping - Connecting to Hardware +******************************************************************** + +This section explains how threads get their unique IDs and how those map to specific data, and connecting mathematical abstractions to physical hardware. + +Thread mapping is the bridge between the mathematical abstraction and the physical hardware that executes the code. Thread mapping works closely with :ref:`ck_tile_tile_distribution` to ensure optimal performance. + +Thread Identification and Partition Indices +=========================================== + +Before threads can process data, they need to know who they are and what work they're responsible for. + +Hardware Thread Identification +------------------------------ + +In GPU hardware, threads are organized hierarchically: + +.. code-block:: cpp + + // CUDA/HIP thread identification + __device__ void get_thread_coordinates() + { + // Grid-level coordinates (which block) + int block_x = blockIdx.x; + int block_y = blockIdx.y; + int block_z = blockIdx.z; + + // Block-level coordinates (which thread in block) + int thread_x = threadIdx.x; + int thread_y = threadIdx.y; + int thread_z = threadIdx.z; + + // Warp identification + int warp_id = threadIdx.x / 32; // 32 threads per warp + int lane_id = threadIdx.x % 32; // Position within warp + + // Global thread ID calculation + int global_thread_id = blockIdx.x * blockDim.x + threadIdx.x; + } + +C++ Thread Mapping in CK +------------------------ + +Composable Kernel abstracts thread identification into partition indices, building on the :ref:`ck_tile_coordinate_systems` foundation: + +.. code-block:: cpp + + // From tile_partition.hpp + template + struct tile_partition + { + CK_TILE_DEVICE static constexpr index_t get_thread_idx() + { + return threadIdx.x; + } + + CK_TILE_DEVICE static constexpr index_t get_block_idx() + { + return blockIdx.x; + } + + // Convert to multi-dimensional partition index + template + CK_TILE_DEVICE static constexpr auto get_partition_index() + { + constexpr auto thread_layout = ThreadLayout{}; + + // Convert linear thread ID to multi-dimensional index + return thread_layout.template get_index(get_thread_idx()); + } + }; + + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "GPU Device" + subgraph "Thread Block" + subgraph "Warp 0" + T0["Thread 0
lane_id=0"] + T1["Thread 1
lane_id=1"] + T2["..."] + T31["Thread 31
lane_id=31"] + end + + subgraph "Warp 1" + T32["Thread 32
lane_id=0"] + T33["Thread 33
lane_id=1"] + T34["..."] + T63["Thread 63
lane_id=31"] + end + + W2["Warp 2"] + W3["..."] + W7["Warp 7"] + end + end + + subgraph "Thread Identification" + TID["Thread ID = blockIdx.x * blockDim.x + threadIdx.x"] + WID["Warp ID = threadIdx.x / 32"] + LID["Lane ID = threadIdx.x % 32"] + end + + subgraph "P-space Mapping" + P["P-coordinates
NDimP=1: [thread_id]
NDimP=2: [warp_id, lane_id]"] + end + + T0 --> TID + TID --> WID + TID --> LID + WID --> P + LID --> P + + style T0 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style T32 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style P fill:#fff3e0,stroke:#f57c00,stroke-width:3px + + + + + +.. image:: diagrams/thread_mapping_1.svg + :alt: Diagram + :align: center + + +Thread Hierarchy Structure +-------------------------- + +The hardware organizes threads in a specific hierarchy. See :ref:`ck_tile_gpu_basics` for hardware details. + +**Block Level**: Groups of warps working together + +- Warps per block defined by encoding, for example, 2×2 warps +- Shared memory and synchronization scope +- Block-level coordination possible + +**Warp Level**: Groups of threads executing in lockstep + +- Threads per warp defined by encoding, for example, 8×8 threads +- SIMD execution (all threads execute same instruction) +- Warp-level primitives (shuffle, vote, etc.) + +**Thread Level**: Individual execution units + +- Vector size per thread, for example, 4×4 elements +- Independent register space +- Vector operations on multiple elements + +Thread ID Mapping +----------------- + +Each thread gets a unique ID that maps to its position in the hierarchy. For example, in an RMSNorm configuration: + +- **Repeat (M, N)**: (4, 4) - Number of iterations +- **Warps per block (M, N)**: (2, 2) - 4 warps total +- **Threads per warp (M, N)**: (8, 8) - 64 threads per warp +- **Vector size (M, N)**: (4, 4) - 16 elements per thread + +This gives us: + +- **Threads per block**: 256 (4 warps × 64 threads/warp) +- **Elements per thread**: 16 (4×4 vector) +- **Total elements**: 4096 per block + +Thread-to-Data Mapping +====================== + +Once threads know their IDs, they need to map those IDs to specific data elements. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Thread to Data Mapping" + subgraph "Thread Grid" + T00["Thread[0,0]
Warp 0"] + T01["Thread[0,1]
Warp 0"] + T10["Thread[1,0]
Warp 1"] + T11["Thread[1,1]
Warp 1"] + end + + subgraph "Data Tiles" + D00["Data[0:4, 0:4]
16 elements"] + D01["Data[0:4, 4:8]
16 elements"] + D10["Data[4:8, 0:4]
16 elements"] + D11["Data[4:8, 4:8]
16 elements"] + end + + subgraph "Memory Access" + MA["Coalesced Access
Adjacent threads → Adjacent memory"] + end + end + + T00 --> D00 + T01 --> D01 + T10 --> D10 + T11 --> D11 + + D00 --> MA + D01 --> MA + + style T00 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style D00 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style MA fill:#fff3e0,stroke:#f57c00,stroke-width:2px + + + + + +.. image:: diagrams/thread_mapping_2.svg + :alt: Diagram + :align: center + +Data Distribution Pattern +------------------------- + +The RMSNorm operation distributes tensor data across threads in a structured pattern: + +**Hierarchical Data Distribution:** + +- **Block Level**: Multiple iterations (repeat factor) +- **Warp Level**: Warps process different regions +- **Thread Level**: Threads within warp handle adjacent data +- **Vector Level**: Each thread processes multiple elements + +Thread Work Assignment +---------------------- + +Each thread is assigned a specific rectangular region of the tensor. For example: + +- Thread in Warp[0,0] Thread[0,0] might process: + + - Data region (M): [0:4) + - Data region (N): [0:4) + - Total elements: 16 + +- Thread in Warp[0,0] Thread[0,1] might process: + + - Data region (M): [0:4) + - Data region (N): [4:8) + - Total elements: 16 + +This pattern ensures adjacent threads access adjacent memory for optimal coalescing. The :ref:`ck_tile_load_store_traits` system further optimizes these access patterns. + +Thread Cooperation Patterns +=========================== + +Threads don't work in isolation. Threads cooperate at different levels to achieve optimal performance. + +Warp-Level Cooperation +---------------------- + +Threads within a warp execute in lockstep (SIMD): + +- **Synchronization**: Automatic SIMD execution +- **Data sharing**: Warp shuffle instructions +- **Collective ops**: Warp-level reductions +- **Memory access**: Coalesced patterns + +Block-Level Cooperation +----------------------- + +Threads within a block can share data and synchronize: + +- **Shared memory**: All threads in block can access (see :ref:`ck_tile_lds_bank_conflicts` for optimization) +- **Synchronization**: ``__syncthreads()`` barriers +- **Data exchange**: Through shared memory +- **Collective operations**: Block-wide reductions + +Vector-Level Processing +----------------------- + +Each thread processes multiple elements: + +- **Register efficiency**: Multiple elements in registers +- **Memory coalescing**: Vectorized loads/stores +- **Instruction efficiency**: SIMD operations on vectors +- **Bandwidth utilization**: Maximum memory throughput + +Memory Access Patterns +====================== + +The thread mapping directly affects memory access. + +C++ Implementation of Memory Access +----------------------------------- + +Here's how CK implements memory access patterns: + +.. code-block:: cpp + + // Coalesced memory access pattern + template + __device__ void coalesced_load(const DataType* __restrict__ src, + DataType* __restrict__ dst, + index_t tid) + { + // Each thread loads VectorSize elements + // Adjacent threads access adjacent memory + constexpr index_t stride = blockDim.x; + + // Vectorized load for efficiency + using vector_t = vector_type_t; + + // Calculate aligned address + const vector_t* src_vec = reinterpret_cast( + src + tid * VectorSize); + + // Single vectorized load instruction + vector_t data = *src_vec; + + // Store to registers + reinterpret_cast(dst)[0] = data; + } + + // CK's distributed tensor load implementation + template + __device__ void load_tile_window(DistributedTensor& dist_tensor, + const auto& tile_window) + { + // Get thread's partition index + constexpr auto partition = tile_partition::get_partition_index(); + + // Each thread loads its assigned data + tile_window.load(dist_tensor, partition); + + // Hardware automatically coalesces adjacent thread accesses + } + +Memory Access Optimization Techniques +------------------------------------- + +CK uses several techniques to optimize memory access: + +.. code-block:: cpp + + // 1. Vector loads for maximum bandwidth + template + using vector_load_t = conditional_t>>; + + // 2. Swizzling to avoid bank conflicts + // See :ref:`ck_tile_lds_index_swapping` and :ref:`ck_tile_swizzling_example` + template + __device__ index_t swizzle_offset(index_t tid, index_t offset) + { + // Rotate access pattern to avoid conflicts + return (offset + (tid / BankSize)) % BankSize; + } + + // 3. Prefetching for latency hiding + __device__ void prefetch_next_tile(const float* src, index_t offset) + { + // Prefetch to L2 cache + __builtin_prefetch(src + offset, 0, 3); + } + +Memory Efficiency Benefits +-------------------------- + +The structured thread mapping provides several memory efficiency benefits: + +**Memory Coalescing Benefits:** + +- **Adjacent access**: Threads in same warp access adjacent memory locations +- **Cache efficiency**: Related data loaded together into cache lines +- **Bandwidth utilization**: Maximum memory bandwidth achieved +- **Reduced latency**: Fewer memory transactions needed + +**Performance Characteristics:** + +- **Predictable patterns**: Access patterns known at compile time +- **Vectorization**: Hardware can optimize vector operations +- **Reduced overhead**: No complex address calculations at runtime +- **Scalability**: Pattern scales efficiently with thread count + +Practical Thread Mapping Example +================================ + +Complete C++ Kernel Example +--------------------------- + +The following example shows how thread mapping works in a CK kernel: + +.. code-block:: cpp + + // RMSNorm kernel using CK's thread mapping + template + __global__ void rmsnorm_kernel(const DataType* __restrict__ x, + DataType* __restrict__ y, + const DataType* __restrict__ weight, + ComputeType epsilon, + index_t hidden_size) + { + // 1. Thread identification + const index_t tid = threadIdx.x; + const index_t bid = blockIdx.x; + + // 2. Create tile distribution encoding + // This would be defined based on your specific RMSNorm pattern + using Encoding = tile_distribution_encoding< + sequence<>, // No replication + tuple, sequence<4, 2>>, // H dimensions + tuple, sequence<2>>, // P to RH major + tuple, sequence<0>>, // P to RH minor + sequence<1, 2>, // Y to RH major + sequence<0, 0> // Y to RH minor + >; + constexpr auto tile_dist = make_static_tile_distribution(Encoding{}); + + // 3. Get thread's partition index from distribution + const auto partition_idx = tile_dist._get_partition_index(); + + // 4. Shared memory for reduction + __shared__ ComputeType shared_sum[BlockSize]; + + // 5. Create tensor view and tile window + // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_tile_window` + auto x_view = make_naive_tensor_view( + x + bid * hidden_size, + make_tuple(hidden_size), + make_tuple(number<1>{}) + ); + + auto x_window = make_tile_window( + x_view, + make_tuple(hidden_size), + make_tuple(number<0>{}), + tile_dist); + + // 6. Each thread processes its assigned elements + ComputeType thread_sum = 0; + static_for<0, VectorSize, 1>{}([&](auto i) { + // Access pattern would depend on your tile window setup + // This is conceptual - actual implementation varies + thread_sum += val * val; + }); + + // 7. Warp-level reduction + thread_sum = warp_reduce_sum(thread_sum); + + // 8. Block-level reduction + if (tid % WarpSize == 0) { + shared_sum[tid / WarpSize] = thread_sum; + } + __syncthreads(); + + // 9. Final reduction by first warp + if (tid < BlockSize / WarpSize) { + thread_sum = shared_sum[tid]; + thread_sum = warp_reduce_sum(thread_sum); + } + + // 10. Compute RMS and normalize + if (tid == 0) { + shared_sum[0] = rsqrt(thread_sum / hidden_size + epsilon); + } + __syncthreads(); + + const ComputeType rms_recip = shared_sum[0]; + + // 11. Write normalized output + auto y_window = make_tile_window( + make_tensor_view(y + bid * hidden_size), + tile_dist); + + static_for<0, VectorSize, 1>{}([&](auto i) { + auto idx = tile_dist.get_tensor_coordinate(partition_idx, i); + ComputeType val = static_cast(x_window.get(idx)); + ComputeType w = static_cast(weight[idx[1]]); + y_window.set(idx, static_cast(val * rms_recip * w)); + }); + } + +Key Thread Mapping Concepts in Action +------------------------------------- + +1. **Thread-to-Data Assignment**: Each thread gets a unique ``partition_idx`` +2. **Vectorized Access**: Each thread processes ``VectorSize`` elements +3. **Warp Cooperation**: Threads within a warp perform reductions +4. **Block Synchronization**: All threads synchronize for final result +5. **Coalesced Memory**: Adjacent threads access adjacent memory + +Key Takeaways +============= + +Thread mapping is the bridge between mathematical abstractions and physical hardware execution: + +**Thread Identification:** + +1. **Hierarchical Organization**: Threads organized in blocks → warps → threads → vectors + + - Each level has specific cooperation capabilities + - Hardware provides efficient primitives at each level + - Thread IDs map directly to data regions + - Predictable and efficient execution patterns + +2. **Data Assignment**: Each thread gets a specific rectangular region + + - Work distributed evenly across threads + - Memory access patterns optimized for coalescing + - Vector operations maximize throughput + - Scalable across different hardware configurations + +3. **Cooperation Patterns**: Threads cooperate at multiple levels + + - Warp-level SIMD execution for efficiency + - Block-level shared memory and synchronization + - Vector-level processing for maximum throughput + - Hierarchical coordination for complex operations + +**Performance Benefits:** + +- **Memory Coalescing**: Adjacent threads access adjacent memory for optimal bandwidth +- **Cache Efficiency**: Related data loaded together, reducing memory latency +- **Vectorization**: Hardware can optimize multiple operations per thread +- **Predictable Patterns**: Compile-time optimization of access patterns + +**Why This Matters:** + +Thread mapping connects encodings, transformations, and distributions to hardware execution. + +The RMSNorm example shows how a real operation uses these concepts to achieve optimal performance on GPU hardware. Every thread knows exactly what data to process, how to access it efficiently, and how to cooperate with other threads. + + +Related Topics + +- :ref:`ck_tile_descriptors` - Complete tensor specifications that thread mapping uses +- :ref:`ck_tile_coordinate_movement` - Advanced coordinate operations for thread navigation +- :ref:`ck_tile_sweep_tile` - How threads iterate over distributed data +- :ref:`ck_tile_gemm_optimization` - Real-world application of thread mapping in GEMM kernels +- :ref:`ck_tile_space_filling_curve` - Optimal traversal patterns for thread access diff --git a/docs/conceptual/ck_tile/tile_distribution.rst b/docs/conceptual/ck_tile/tile_distribution.rst new file mode 100644 index 0000000000..c57a87e5ce --- /dev/null +++ b/docs/conceptual/ck_tile/tile_distribution.rst @@ -0,0 +1,627 @@ +.. _ck_tile_distribution: + +Tile Distribution - The Core API +================================ + +Overview +-------- + +At the heart of Composable Kernel's approach to efficient GPU computation lies TileDistribution, a compile-time abstraction that transforms how developers approach parallel programming on GPUs. Rather than requiring programmers to manually manage thread coordination, memory access patterns, and data distribution, TileDistribution provides an mathematical framework that automatically maps logical computational coordinates to physical execution resources. + +The architectural foundation of tile distribution in CK rests upon the :ref:`coordinate transformation system ` that bridges multiple abstract spaces. This system manages the interaction between four primary coordinate dimensions, each serving a distinct purpose in the overall computation model. The X dimensions represent the physical tensor coordinates, capturing the actual layout of data in memory. The Y dimensions encode the tile access patterns, defining how threads traverse their assigned data. The P dimensions map to processing elements, representing the hierarchical organization of threads, warps, and blocks in the :ref:`GPU's execution model `. Additionally, the optional R dimensions enable replication strategies for algorithms that benefit from redundant computation to reduce communication overhead. + +This multi-dimensional mapping framework enables CK to express arbitrarily complex data access patterns through a mathematically formalism. The power of this approach becomes evident when considering how traditional GPU programming requires developers to manually calculate memory addresses, ensure coalescing constraints, :ref:`avoid bank conflicts `, and manage thread cooperation. TileDistribution handles all these concerns within a unified abstraction that can be analyzed, optimized, and verified at compile time. + +The ``tile_distribution`` template class integrates three essential components that work together to deliver optimal performance. The ``PsYs2XsAdaptor`` component performs :ref:`coordinate transformations ` from processing and pattern dimensions to physical tensor coordinates, implementing the mathematical mappings that ensure efficient memory access. The ``Ys2DDescriptor`` component handles the linearization of Y dimensions, transforming multi-dimensional tile patterns into register allocation schemes that maximize register reuse and minimize register pressure. The ``StaticTileDistributionEncoding`` captures the hierarchical decomposition of work across the GPU's compute resources, encoding decisions about how work is partitioned across thread blocks, warps, and individual threads. + +This design adapts to diverse computational scenarios without manual intervention. The same high-level code can execute on GPUs with different numbers of streaming multiprocessors, varying warp sizes, or distinct memory hierarchies. The compile-time nature of the abstraction ensures that all coordination logic is resolved during compilation, resulting in machine code that is comparable hand-optimized implementations. This adaptability enables a single implementation to achieve improved performance across a wide range of tensor sizes, shapes, and computational patterns without the combinatorial explosion of specialized kernels. + +Complete Tile Distribution System Overview +------------------------------------------ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Logical View" + T["Tensor
Multi-dimensional data"] + TD["TileDistribution
Work assignment"] + TW["TileWindow
Data view"] + end + + subgraph "Coordinate Spaces" + X["X: Physical tensor coords"] + Y["Y: Tile pattern coords"] + P["P: Processing element coords"] + R["R: Replication coords (optional)"] + end + + subgraph "GPU Execution" + W["Warps
32 threads each"] + L["Lanes
Thread within warp"] + REG["Registers
Thread-local storage"] + end + + T --> TD + TD --> TW + + TD --> X + TD --> Y + TD --> P + TD --> R + + P --> W + P --> L + TW --> REG + + style TD fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style P fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style REG fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_1.svg + :alt: Diagram + :align: center + +Coordinate System Architecture +------------------------------ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "Input" + TC["Thread Coordinates
(warpId, laneId)"] + end + + subgraph "Transformation Pipeline" + P2Y["P → Y
Thread to pattern"] + Y2X["Y → X
Pattern to physical"] + Y2D["Y → D
Pattern to register"] + end + + subgraph "Output" + MC["Memory Coordinates
Global addresses"] + RI["Register Indices
Local storage"] + end + + TC --> P2Y + P2Y --> Y2X + P2Y --> Y2D + Y2X --> MC + Y2D --> RI + + style TC fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style MC fill:#d1fae5,stroke:#10b981,stroke-width:2px + style RI fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_2.svg + :alt: Diagram + :align: center + +What is Tile Distribution? +-------------------------- + +In GPU programming, distributing work across thousands of parallel threads is an important challenge. Consider a 256×256 matrix multiplication operation and 64 GPU threads organized in warps. The question becomes how to divide this computational work in a way that maximizes memory bandwidth utilization, minimizes bank conflicts, and ensures coalesced memory accesses. + +The traditional approach without a tile distribution framework requires programmers to manually calculate global memory addresses for each thread, implement complex index arithmetic that accounts for thread hierarchy (threads within warps, warps within blocks), handle edge cases for non-divisible matrix dimensions, and create different implementations for various matrix sizes. This manual approach is not only error-prone but also fails to adapt to different GPU architectures and their specific memory access patterns. + +TileDistribution solves these challenges through a systematic approach to work distribution. It automatically assigns work to threads based on a hierarchical decomposition of the problem space, generates memory access patterns that respect GPU hardware constraints, provides a uniform interface that works across different tensor sizes and shapes, and ensures optimal thread cooperation by automatically managing data movement to thread-local registers. + +TileDistribution abstracts the mapping between logical problem coordinates and physical execution resources. Given a thread's position in the GPU's execution hierarchy (specified by warp ID and lane ID within the warp), TileDistribution computes two critical pieces of information: the global memory addresses that this thread should access, and the specific access pattern that ensures efficient memory transactions. This abstraction is implemented in C++ through the following core structure: + +.. code-block:: cpp + + template + struct tile_distribution + { + // Core functionality: map thread coordinates to data + CK_TILE_HOST_DEVICE static auto _get_partition_index() + { + if constexpr(NDimP == 1) + return array{get_lane_id()}; + else if constexpr(NDimP == 2) + return array{get_warp_id(), get_lane_id()}; + } + + // Calculate which tensor elements this thread accesses + template + CK_TILE_HOST_DEVICE static auto calculate_tile_Ys_index(const PartitionIndex& ps_idx) + { + return detail::calculate_tile_Ys_index( + StaticTileDistributionEncoding{}, ps_idx); + } + }; + +Problem Space Mapping +--------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + + graph TB + subgraph "Problem Space (256×256 Matrix)" + M["Full Matrix
65,536 elements"] + T1["Tile 1
32×32"] + T2["Tile 2
32×32"] + TN["Tile N
32×32"] + end + + subgraph "Thread Assignment" + W0["Warp 0
32 threads"] + W1["Warp 1
32 threads"] + L0["Lane 0-31
Individual threads"] + end + + subgraph "Memory Pattern" + MP["Coalesced Access
Sequential addresses
No bank conflicts"] + end + + M --> T1 + M --> T2 + M --> TN + + T1 --> W0 + T1 --> W1 + W0 --> L0 + L0 --> MP + + style M fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style MP fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_3.svg + :alt: Diagram + :align: center + +Creating a TileDistribution +--------------------------- + +Creating and using a TileDistribution: + +.. code-block:: cpp + + // SPDX-License-Identifier: MIT + // Copyright (c) Advanced Micro Devices, Inc. All rights reserved. + + #include "ck_tile/host.hpp" + #include "ck_tile/core.hpp" + #include + #include + #include + + namespace ck_tile { + + struct TileDistributionExample + { + CK_TILE_DEVICE void operator()(float* global_data, + ck_tile::index_t global_shape_0, + ck_tile::index_t global_shape_1) const + { + if(threadIdx.x == 0 && blockIdx.x == 0) { + printf("\n=== Tile Distribution Example (Device Kernel) ===\n"); + } + block_sync_lds(); + + // Create a tile distribution encoding + // This defines how a tensor is distributed across threads + auto encoding = tile_distribution_encoding< + sequence<>, // rs_lengths=[] - No replication dimensions + tuple< + sequence<2, 2>, // hs_lengthss=[[2, 2], [2, 2]] - Hierarchical lengths for each X dimension + sequence<2, 2>>, + tuple, sequence<2>>, // ps_to_rhss_major=[[1], [2]] - P to RH major mappings + tuple, sequence<0>>, // ps_to_rhss_minor=[[0], [0]] - P to RH minor mappings + sequence<1, 2>, // ys_to_rhs_major=[1, 2] - Y to RH major mappings + sequence<1, 1>>{}; // ys_to_rhs_minor=[1, 1] - Y to RH minor mappings + + // Create the tile distribution from the encoding + auto distribution = make_static_tile_distribution(encoding); + + // Calculate sizes from the distribution encoding + // x0_size = np.prod(distribution.encoding.hs_lengthss[0]) + constexpr auto hs_lengths_0 = encoding.hs_lengthss_[number<0>{}]; // sequence<2, 2> + constexpr auto hs_lengths_1 = encoding.hs_lengthss_[number<1>{}]; // sequence<2, 2> + + constexpr index_t x0_size = reduce_on_sequence(hs_lengths_0, multiplies{}, number<1>{}); + constexpr index_t x1_size = reduce_on_sequence(hs_lengths_1, multiplies{}, number<1>{}); + + // Print distribution info (only from thread 0) + if(threadIdx.x == 0 && blockIdx.x == 0) { + printf("\n- Tile distribution created:\n"); + printf(" X dimensions: %d\n", distribution.get_num_of_dimension_x()); + printf(" Y dimensions: %d\n", distribution.get_num_of_dimension_y()); + printf(" P dimensions: %d\n", distribution.get_num_of_dimension_p()); + printf(" X lengths: [%d, %d]\n", x0_size, x1_size); + } + block_sync_lds(); + + // Create packed tensor view (contiguous row-major) using helper + auto global_view = make_naive_tensor_view_packed( + global_data, + make_tuple(global_shape_0, global_shape_1)); + + // Window configuration + auto window_lengths = make_tuple(x0_size, x1_size); + + // Get current thread's warp and thread indices + index_t warp_id = threadIdx.x / get_warp_size(); + index_t thread_id = threadIdx.x % get_warp_size(); + + // Window origin - small offset from origin + auto window_origin = make_tuple(1, 3); // Small offset from origin + + // Create tile window + auto tile_window = make_tile_window( + global_view, + window_lengths, + {1, 3}, // Window origin as initializer list + distribution + ); + + // Load distributed tensor + auto distributed_tensor = tile_window.load(); + + // Collect values by sweeping through the distributed tensor + constexpr index_t max_elements = x0_size*x1_size; + float collected_values[max_elements]; + index_t value_count = 0; + + // Sweep through the distributed tensor and collect values using sweep_tile API + sweep_tile(distributed_tensor, [&](auto idx) { + if(value_count(threadIdx.x) == sel) { + printf("Partition index: (warp=%d, thread=%d)\n", static_cast(warp_id), static_cast(thread_id)); + printf("Collected values: "); + for(index_t i = 0; i < value_count; i++) { + printf("%.0f", collected_values[i]); + if(i < value_count - 1) printf(", "); + } + printf("\n\n"); + } + block_sync_lds(); + } + } + }; + } + + int main() + { + // Host-side allocation & initialization of pattern data + // Reproduce the compile-time sizes used in the kernel: hs_lengths = [2,2] => x sizes=4; global = 4+5 = 9 + constexpr ck_tile::index_t global_shape_0 = 9; // x0_size(4) + 5 + constexpr ck_tile::index_t global_shape_1 = 9; // x1_size(4) + 5 + constexpr ck_tile::index_t total_elems = global_shape_0 * global_shape_1; // 81 + + std::vector h_global_data(total_elems); + for(ck_tile::index_t i = 0; i < global_shape_0; ++i) { + for(ck_tile::index_t j = 0; j < global_shape_1; ++j) { + h_global_data[i * global_shape_1 + j] = static_cast(i * 100 + j); + } + } + + ck_tile::DeviceMem d_global_data(sizeof(float) * total_elems); + d_global_data.ToDevice(h_global_data.data()); + + std::cout << "\nGlobal data (host print, to be used by device) shape=(" + << static_cast(global_shape_0) << "," << static_cast(global_shape_1) << ")\n\n"; + for(ck_tile::index_t i = 0; i < global_shape_0; ++i) { + for(ck_tile::index_t j = 0; j < global_shape_1; ++j) { + std::cout << h_global_data[i * global_shape_1 + j]; + if(j + 1 < global_shape_1) std::cout << "\t"; + } + std::cout << '\n'; + } + std::cout << '\n'; + + constexpr ck_tile::index_t kBlockSize = 128; + constexpr ck_tile::index_t kBlockPerCu = 1; + constexpr ck_tile::index_t kGridSize = 1; + + using Kernel = ck_tile::TileDistributionExample; + float ave_time = launch_kernel(ck_tile::stream_config{nullptr, true, 0, 0, 1}, + ck_tile::make_kernel( + Kernel{}, + kGridSize, + kBlockSize, + 0, + static_cast(d_global_data.GetDeviceBuffer()), + global_shape_0, + global_shape_1)); + + std::cout << "Kernel execution completed. Average time: " << ave_time << " ms" << std::endl; + + return 0; + } + +Hierarchical Decomposition +-------------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Level 1: Block Distribution" + B["Thread Block
256 threads"] + BT1["Block Tile 1
64×64"] + BT2["Block Tile 2
64×64"] + end + + subgraph "Level 2: Warp Distribution" + W["Warp
32 threads"] + WT1["Warp Tile 1
16×16"] + WT2["Warp Tile 2
16×16"] + end + + subgraph "Level 3: Thread Distribution" + T["Thread"] + TT["Thread Tile
2×2"] + end + + B --> BT1 + BT1 --> W + W --> WT1 + WT1 --> T + T --> TT + + style B fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style W fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style T fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_4.svg + :alt: Diagram + :align: center + +Advanced Example: Matrix Multiplication Distribution +---------------------------------------------------- + +.. code-block:: cpp + + // Real GEMM kernel pattern using TileDistribution + template + __global__ void gemm_kernel( + const AType* __restrict__ a_ptr, + const BType* __restrict__ b_ptr, + CType* __restrict__ c_ptr, + index_t M, index_t N, index_t K) + { + // Define the tile distribution encoding at compile time + using Encoding = tile_distribution_encoding< + sequence<>, // R: no replication + tuple, // H for M dimension + sequence<4, 2, 8, 4>>, // H for N dimension + tuple, sequence<1, 2>>, // P to RH major + tuple, sequence<2, 2>>, // P to RH minor + sequence<1, 1, 2, 2>, // Y to RH major + sequence<0, 3, 0, 3> // Y to RH minor + >; + + // Create the distribution + constexpr auto distribution = make_static_tile_distribution(Encoding{}); + + // Create tensor views + auto a_view = make_tensor_view( + a_ptr, + make_naive_tensor_descriptor_packed(make_tuple(M, K))); + + // Create tile window for this thread block + auto a_window = make_tile_window( + a_view, + make_tuple(number<256>{}, number<64>{}), // window size + {blockIdx.x * 256, 0}, // origin + distribution); + + // Load data to distributed tensor (registers) + auto a_reg = make_static_distributed_tensor(distribution); + + a_window.load(a_reg); + + // Computation happens in registers + // Results written back through another window + } + +Work Distribution Pattern +------------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart TB + subgraph "Matrix C (128×128)" + C["16,384 elements"] + end + + subgraph "Thread Grid (32×32)" + TG["1,024 threads"] + end + + subgraph "Per Thread" + PT["4×4 tile
16 elements"] + end + + subgraph "Memory Access" + MA["Coalesced reads
Efficient writes
No conflicts"] + end + + C --> TG + TG --> PT + PT --> MA + + style C fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style TG fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style PT fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style MA fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_5.svg + :alt: Diagram + :align: center + +Memory Access Patterns +---------------------- + +One of the key benefits of TileDistribution is generating optimal memory access patterns. The encoding parameters control how threads access memory: + +- **H-dimensions**: Define hierarchical decomposition (Repeat, WarpPerBlock, ThreadPerWarp, Vector) +- **P-to-RH mappings**: Control how thread IDs map to the hierarchy +- **Y-to-RH mappings**: Define the access pattern within each thread's tile + +Transformation Pipeline +----------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Input" + TID["Thread ID
(0-1023)"] + end + + subgraph "Stage 1" + P["P-coordinates
(warp, lane)"] + end + + subgraph "Stage 2" + Y["Y-coordinates
(tile position)"] + end + + subgraph "Stage 3" + X["X-coordinates
(tensor indices)"] + end + + subgraph "Output" + ADDR["Memory addresses
Register indices"] + end + + TID --> P + P --> Y + Y --> X + X --> ADDR + + style TID fill:#e0e7ff,stroke:#4338ca,stroke-width:2px + style ADDR fill:#d1fae5,stroke:#10b981,stroke-width:2px + + + + + +.. image:: diagrams/tile_distribution_6.svg + :alt: Diagram + :align: center + +Performance Comparison +---------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Manual Implementation" + M1["Calculate indices manually"] + M2["Handle boundary conditions"] + M3["Ensure coalescing"] + M4["Manage bank conflicts"] + M5["~200 lines of code"] + end + + subgraph "With TileDistribution" + T1["make_tile_distribution()"] + T2["Automatic optimization"] + T3["~10 lines of code"] + end + + subgraph "Performance" + P1["Same performance"] + P2["Fewer bugs"] + P3["Portable across GPUs"] + end + + M1 --> M5 + T1 --> T3 + + M5 --> P1 + T3 --> P1 + P1 --> P2 + P2 --> P3 + + style M5 fill:#fee2e2,stroke:#ef4444,stroke-width:2px + style T3 fill:#d1fae5,stroke:#10b981,stroke-width:2px + style P3 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px + + + + + + +.. image:: diagrams/tile_distribution_7.svg + :alt: Diagram + :align: center + +Summary +------- + +The automatic work distribution capabilities of TileDistribution eliminate one of the most error-prone aspects of GPU programming. TileDistribution's mathematical framework ensures that every thread knows which data elements it should process and automatically handles complex index arithmetic. + +Memory access pattern optimization is a performance benefit of the TileDistribution approach. GPUs achieve their computational throughput only when memory accesses follow specific patterns that enable hardware optimizations such as coalescing and broadcast. TileDistribution automatically generates these patterns, such that threads within a warp access contiguous memory locations, that bank conflicts in shared memory are reduced, and that the memory subsystem operates efficiently. This optimization happens transparently, without manual memory pattern analysis. + +By encoding the natural hierarchy of threads, warps, and blocks directly into the distribution strategy, the framework ensures that each level of the hierarchy operates optimally. This hierarchical approach enables tiling strategies that would be impractical to implement manually, such as multi-level tiling that simultaneously optimizes for L1 cache, L2 cache, and register file usage. + +The zero-overhead nature of TileDistribution, achieved through use of C++ template metaprogramming and compile-time computation, ensures that the abstraction's benefits come without runtime cost. Every aspect of the distribution strategy is resolved at compile time, resulting in machine code that is comparable to hand-written implementations. The compiler's ability to see through the abstraction enables optimizations that aren't typically available to runtime-based approaches. + +The same source code can execute efficiently on GPUs with different warp sizes, different numbers of registers per thread, or different shared memory capacities. This portability includes performance portability, with the framework adapting its strategies to match the characteristics of the target architecture. + +TileDistribution provides a solid foundation for the CK ecosystem. This abstraction provides a programming model that insulates developers from the complexity of the underlying hardware while enabling them to use hardware capabilities. + +Next Steps +---------- + +See :ref:`ck_tile_terminology` for a glossary of key concepts and terminology used in CK Tile. diff --git a/docs/conceptual/ck_tile/tile_window.rst b/docs/conceptual/ck_tile/tile_window.rst new file mode 100644 index 0000000000..87d2f39b01 --- /dev/null +++ b/docs/conceptual/ck_tile/tile_window.rst @@ -0,0 +1,701 @@ +.. _ck_tile_tile_window: + +Tile Window - Data Access Gateway +================================= + +Overview +-------- + +While :ref:`TileDistribution ` determines the mapping between threads and tensor coordinates, TileWindow provides the mechanism for loading and storing data with memory access patterns. This abstraction encapsulates coalesced memory accesses, vectorization, and boundary handling into an interface. + +TileWindow implements a distribution-aware windowing mechanism that views a subset of a larger tensor through the lens of a tile distribution. This windowing is a distribution-aware view that automatically generates memory access patterns for the underlying hardware. The system combines knowledge of the :ref:`tensor's layout `, the distribution pattern, and the :ref:`GPU's memory subsystem ` characteristics to generate optimized load and store operations. + +TileWindow Architecture +----------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Components" + TV["TensorView
Data source"] + TD["TileDistribution
Thread mapping"] + TW["TileWindow
Access gateway"] + LT["LoadStoreTraits
Access optimizer"] + DT["DistributedTensor
Register storage"] + end + + subgraph "Operations" + Load["Load
Global → Registers"] + Compute["Compute
In registers"] + Store["Store
Registers → Global"] + end + + subgraph "Optimizations" + Coal["Coalescing
Adjacent access"] + Vec["Vectorization
Multi-element ops"] + Bank["Bank conflict
avoidance"] + SFC["Space-filling
curve traversal"] + end + + TV --> TW + TD --> TW + TW --> LT + LT --> DT + + TW --> Load + Load --> Compute + Compute --> Store + + Load --> Coal + Load --> Vec + Load --> SFC + Store --> Bank + + style TW fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style LT fill:#fff3e0,stroke:#f57c00,stroke-width:2px + style DT fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + +.. image:: diagrams/tile_window_1.svg + :alt: Diagram + :align: center + +What is a TileWindow? +--------------------- + +The challenge in GPU programming lies in the gap between logical tensor operations and the physical realities of memory access. While :ref:`TileDistribution ` solves the problem of work assignment by mapping threads to :ref:`tensor coordinates `, it does not address how threads access the data at those coordinates. TileWindow serves as the critical bridge between logical work assignment and physical memory operations. + +TileWindow implements a distribution-aware windowing mechanism that transforms abstract coordinate mappings into concrete memory access patterns. The abstraction takes into account the data elements each thread needs and also how to access them in a way that maximizes memory bandwidth utilization. This involves optimized techniques such as memory coalescing, where adjacent threads access adjacent memory locations, and vectorization, where multiple elements are loaded or stored in a single transaction. + +**C++ Implementation Overview:** + +.. code-block:: cpp + + // From ck_tile/core/tensor/tile_window.hpp + #include + #include + #include + + template + struct tile_window_with_static_distribution + { + using TensorView = remove_cvref_t; + using Distribution = remove_cvref_t; + using DataType = typename TensorView::DataType; + + // Core components that define the window + TensorView tensor_view_; // View into the underlying tensor + Distribution distribution_; // How to distribute data across threads + array origin_; + + // Window-specific information + static constexpr auto window_lengths = WindowLengths{}; + static constexpr index_t num_of_dimension = TensorView::get_num_of_dimension(); + + // Constructor + CK_TILE_HOST_DEVICE constexpr tile_window_with_static_distribution( + const TensorView& tensor_view, + const WindowLengths& /*window_lengths*/, + const array& origin, + const Distribution& distribution) + : tensor_view_{tensor_view}, + distribution_{distribution}, + origin_{origin} + {} + + // Load operation with automatic coalescing + template + CK_TILE_DEVICE void load(DistributedTensor& dst_tensor) const + { + // Sophisticated load implementation that: + // 1. Calculates optimal access pattern + // 2. Handles vectorization automatically + // 3. Ensures coalesced memory access + // 4. Manages boundary conditions + } + }; + +LoadStoreTraits - The Access Pattern Engine +------------------------------------------- + +Behind every TileWindow operation lies :ref:`LoadStoreTraits `, a compile-time analysis engine that determines an optimized way to access memory. This component bridges the gap between the logical distribution pattern and the physical memory subsystem, analyzing the distribution to find opportunities for vectorization and coalescing. + +LoadStoreTraits performs several analyses: + +- **Vector dimension identification**: Finds which Y dimension has stride 1 for optimal vectorization +- **Access pattern calculation**: Determines the number and order of memory operations +- **Space-filling curve construction**: Creates an optimal traversal order for cache efficiency + +**C++ LoadStoreTraits Analysis:** + +.. code-block:: cpp + + // LoadStoreTraits analyzes the distribution pattern + template + struct load_store_traits + { + static constexpr index_t ndim_y = Distribution::ndim_y; + + // Analyze which Y dimension has stride 1 (best for vectorization) + static constexpr index_t vector_dim_y = []() { + // Complex compile-time analysis to find optimal dimension + return find_vector_dimension(); + }(); + + // Calculate vectorization potential + static constexpr index_t scalar_per_vector = []() { + // Determine how many elements can be loaded in one instruction + return calculate_vector_size(); + }(); + + // Space-filling curve for optimal traversal + using sfc_type = space_filling_curve; + static constexpr sfc_type sfc_ys = make_space_filling_curve(); + + // Get Y indices for a given access + CK_TILE_DEVICE constexpr auto get_y_indices(index_t i_access) const + { + return sfc_ys.get_index(i_access); + } + }; + +Space-Filling Curves for Memory Access +-------------------------------------- + +TileWindow uses :ref:`space-filling curves ` to determine the order in which memory is accessed. Space-filling curves provide cache-friendly traversal patterns that help maximize hardware utilization. The "snake" pattern minimizes the distance between consecutive accesses, keeping data in cache longer. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Linear Access Pattern" + L1["0,1,2,3"] + L2["4,5,6,7"] + L3["8,9,10,11"] + L4["12,13,14,15"] + end + + subgraph "Snake Access Pattern" + S1["0,1,2,3"] + S2["7,6,5,4"] + S3["8,9,10,11"] + S4["15,14,13,12"] + end + + L1 --> L2 + L2 --> L3 + L3 --> L4 + + S1 --> S2 + S2 --> S3 + S3 --> S4 + + style S1 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style S2 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + +.. image:: diagrams/tile_window_2.svg + :alt: Diagram + :align: center + +**C++ Space-Filling Curve Implementation:** + +.. code-block:: cpp + + // Space-filling curve for optimal memory traversal + template + struct space_filling_curve + { + array tensor_lengths; + array dim_access_order; + bool snake_curved; + + // Get coordinates for the i-th access + CK_TILE_DEVICE constexpr auto get_index(index_t i_access) const + { + array indices; + + // Snake pattern logic for cache-friendly access + if (snake_curved) { + // Implement snake curve traversal + // Minimizes distance between consecutive accesses + } + + return indices; + } + }; + +TileWindow Data Flow +-------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + flowchart LR + subgraph "Step 1: Create Window" + T["Tensor
[256, 256]"] + O["Origin
(64, 64)"] + W["Window Size
[32, 32]"] + end + + subgraph "Step 2: Apply Distribution" + TD["TileDistribution
Thread mapping"] + TW["TileWindow
Created"] + end + + subgraph "Step 3: Load Data" + GM["Global Memory
Window region"] + REG["Registers
Distributed tensor"] + end + + T --> TW + O --> TW + W --> TW + TD --> TW + + TW --> GM + GM -->|"load()"| REG + + style TW fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style REG fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + +.. image:: diagrams/tile_window_3.svg + :alt: Diagram + :align: center + +Creating and Using TileWindow +----------------------------- + +.. code-block:: cpp + + using namespace ck_tile; + + // Create a tensor view for input data (see :ref:`ck_tile_tensor_views`) + auto tensor_view = make_naive_tensor_view( + data_ptr, + make_tuple(256, 256), // Shape + make_tuple(256, 1) // Strides + ); + + // Define window parameters + constexpr auto window_size = make_tuple(32, 32); + auto window_origin = make_tuple(64, 64); + + // Create distribution for the window + auto distribution = make_static_tile_distribution< + tile_distribution_encoding< + sequence<>, // No replication + tuple, sequence<4, 2>>, // 8x8 threads + tuple, sequence<1>>, // Thread mapping + tuple, sequence<0>>, // Minor indices + sequence<1, 1>, // Y-space: 2x2 per thread + sequence<1, 1> // Y-space minor + > + >{}; + + // Create the tile window + auto window = make_tile_window( + tensor_view, + window_size, + window_origin, + distribution + ); + + // Load data into distributed tensor (see :ref:`ck_tile_static_distributed_tensor`) + auto distributed_data = make_static_distributed_tensor(distribution); + window.load(distributed_data); + +The Load Operation Deep Dive +---------------------------- + +Calls to ``window.load()`` trigger the following sequence of operations: + +1. **Distributed tensor creation**: Automatically creates a :ref:`distributed tensor ` sized for the distribution +2. **Coordinate calculation**: Uses precomputed coordinates for efficiency +3. **Vectorized access**: Groups elements for vector loads based on :ref:`LoadStoreTraits ` analysis +4. **Memory coalescing**: Ensures adjacent threads access adjacent memory +5. **Boundary handling**: Manages edge cases automatically + +**C++ Load Implementation Details:** + +.. code-block:: cpp + + template + CK_TILE_DEVICE void load(DistributedTensor& dst_tensor) const + { + // Get LoadStoreTraits for optimal access pattern + using Traits = load_store_traits; + + // Iterate through all accesses determined by space-filling curve + static_for<0, Traits::num_access, 1>{}([&](auto i_access) { + // Get Y-space indices for this access + const auto y_indices = Traits::get_y_indices(i_access); + + // Calculate global coordinates + const auto x_indices = distribution_.calculate_x_from_y(y_indices); + const auto global_indices = add_arrays(origin_, x_indices); + + // Perform vectorized load if possible + if constexpr (Traits::scalar_per_vector > 1) { + // Vector load path + using VectorType = vector_type_t; + const auto vector_data = tensor_view_.template get_vectorized_elements( + global_indices, Traits::vector_dim_y); + dst_tensor.template set_vectorized_elements(y_indices, vector_data); + } else { + // Scalar load path + const auto scalar_data = tensor_view_.get_element(global_indices); + dst_tensor.set_element(y_indices, scalar_data); + } + }); + } + +Load Operation Architecture +--------------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Load Analysis" + Analyze["Analyze access pattern
Detect coalescing opportunities"] + end + + subgraph "Vectorization" + V1["Scalar: 4 loads"] + V2["Vector2: 2 loads"] + V4["Vector4: 1 load"] + end + + subgraph "Memory Transaction" + Coal["Coalesced access
32 threads → 1 transaction"] + NonCoal["Non-coalesced
32 threads → 32 transactions"] + end + + subgraph "Result" + Reg["Thread registers
Local data"] + end + + Analyze --> V1 + Analyze --> V2 + Analyze --> V4 + + V4 --> Coal + V1 --> NonCoal + + Coal --> Reg + NonCoal --> Reg + + style V4 fill:#d1fae5,stroke:#10b981,stroke-width:2px + style Coal fill:#d1fae5,stroke:#10b981,stroke-width:2px + style NonCoal fill:#fee2e2,stroke:#ef4444,stroke-width:2px + + + +.. image:: diagrams/tile_window_4.svg + :alt: Diagram + :align: center + +Memory Access Patterns +---------------------- + +One of TileWindow's key features is generating optimal memory access patterns. The system analyzes the distribution to ensure: + +- **Coalescing**: Adjacent threads access adjacent memory locations +- **Vectorization**: Multiple elements loaded in single instructions +- **Bank conflict avoidance**: Shared memory accesses avoid :ref:`conflicts ` +- **Cache optimization**: Access patterns maximize cache reuse + +**C++ Memory Pattern Analysis:** + +.. code-block:: cpp + + // Analyze memory access pattern for a distribution + template + struct memory_access_analyzer + { + static constexpr bool is_coalesced() + { + // Check if threads in a warp access consecutive memory + return Distribution::check_coalescing_pattern(); + } + + static constexpr index_t vector_size() + { + // Determine optimal vector size (1, 2, 4, 8) + return Distribution::calculate_vector_size(); + } + + static constexpr bool has_bank_conflicts() + { + // Analyze shared memory access pattern + return Distribution::detect_bank_conflicts(); + } + }; + +Window Movement and Updates +--------------------------- + +TileWindow supports efficient window movement for sliding window algorithms. The precomputed coordinate system makes updates more efficient: + +.. code-block:: cpp + + // Sliding window pattern + for (index_t row = 0; row < tensor_height; row += stride) { + for (index_t col = 0; col < tensor_width; col += stride) { + // Update window position - O(1) operation + window.set_window_origin(make_tuple(row, col)); + + // Load from new position + window.load(distributed_data); + + // Process data + process_tile(distributed_data); + + // Store results if needed + output_window.store(distributed_data); + } + } + +Store Operations with Vectorization +----------------------------------- + +Store operations use the same compile-time analysis as loads. The :ref:`LoadStoreTraits ` helps make stores as efficient as loads, with similar vectorization and coalescing benefits: + +.. code-block:: cpp + + template + CK_TILE_DEVICE void store(const DistributedTensor& src_tensor) const + { + using Traits = load_store_traits; + + // Same optimized pattern as load, but in reverse + static_for<0, Traits::num_access, 1>{}([&](auto i_access) { + const auto y_indices = Traits::get_y_indices(i_access); + const auto x_indices = distribution_.calculate_x_from_y(y_indices); + const auto global_indices = add_arrays(origin_, x_indices); + + if constexpr (Traits::scalar_per_vector > 1) { + // Vectorized store + const auto vector_data = src_tensor.template get_vectorized_elements( + y_indices, Traits::vector_dim_y); + tensor_view_.template set_vectorized_elements( + global_indices, vector_data, Traits::vector_dim_y); + } else { + // Scalar store + const auto scalar_data = src_tensor.get_element(y_indices); + tensor_view_.set_element(global_indices, scalar_data); + } + }); + } + +Complete Load-Compute-Store Pipeline +------------------------------------ + +.. code-block:: cpp + + template + __global__ void gemm_kernel_with_windows( + const AType* __restrict__ a_ptr, + const BType* __restrict__ b_ptr, + CType* __restrict__ c_ptr, + index_t M, index_t N, index_t K) + { + // Create tensor views + auto a_tensor = make_naive_tensor_view( + a_ptr, make_tuple(M, K), make_tuple(K, 1)); + auto b_tensor = make_naive_tensor_view( + b_ptr, make_tuple(K, N), make_tuple(N, 1)); + auto c_tensor = make_naive_tensor_view( + c_ptr, make_tuple(M, N), make_tuple(N, 1)); + + // Define tile sizes + constexpr index_t tile_m = 32; + constexpr index_t tile_n = 32; + constexpr index_t tile_k = 8; + + // Create distributions + auto a_dist = make_static_tile_distribution<...>(); + auto b_dist = make_static_tile_distribution<...>(); + auto c_dist = make_static_tile_distribution<...>(); + + // Calculate tile position + const index_t block_m = blockIdx.y * tile_m; + const index_t block_n = blockIdx.x * tile_n; + + // Create tile windows + auto a_window = make_tile_window( + a_tensor, + make_tuple(tile_m, tile_k), + make_tuple(block_m, 0), + a_dist); + + auto b_window = make_tile_window( + b_tensor, + make_tuple(tile_k, tile_n), + make_tuple(0, block_n), + b_dist); + + auto c_window = make_tile_window( + c_tensor, + make_tuple(tile_m, tile_n), + make_tuple(block_m, block_n), + c_dist); + + // Create distributed tensors for register storage + // See :ref:`ck_tile_static_distributed_tensor` for details + auto a_reg = make_static_distributed_tensor(a_dist); + auto b_reg = make_static_distributed_tensor(b_dist); + auto c_reg = make_static_distributed_tensor(c_dist); + + // Initialize accumulator + c_reg.clear(); + + // Main GEMM loop + for(index_t k = 0; k < K; k += tile_k) { + // Update window positions + a_window.set_window_origin(make_tuple(block_m, k)); + b_window.set_window_origin(make_tuple(k, block_n)); + + // Load tiles - LoadStoreTraits ensures optimal pattern + a_window.load(a_reg); + b_window.load(b_reg); + + // Compute + gemm(a_reg, b_reg, c_reg); + } + + // Store result - using same optimized pattern + c_window.store(c_reg); + } + +Performance Characteristics +--------------------------- + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph LR + subgraph "Memory Access Optimization" + V["Vectorization
4x fewer transactions"] + C["Coalescing
32x bandwidth efficiency"] + P["Precomputation
Zero overhead addressing"] + S["Space-filling
Optimal cache usage"] + end + + subgraph "Hardware Utilization" + BW["Memory Bandwidth
Near 100% utilization"] + L["Latency Hiding
Overlapped operations"] + R["Register Reuse
Minimal spills"] + end + + V --> BW + C --> BW + P --> L + S --> R + + style V fill:#e3f2fd,stroke:#1976d2,stroke-width:2px + style C fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + style BW fill:#d1fae5,stroke:#10b981,stroke-width:3px + + + +.. image:: diagrams/tile_window_5.svg + :alt: Diagram + :align: center +Best Practices +-------------- + +Window Size Selection +~~~~~~~~~~~~~~~~~~~~~ + +Choose window sizes that balance register usage with data reuse: + +.. code-block:: cpp + + // Optimal window size calculation + template + constexpr auto calculate_optimal_window_size() + { + // Consider register constraints + constexpr index_t elements_per_thread = RegistersPerThread / sizeof(DataType); + + // Common tile sizes that work well + constexpr array common_sizes = {8, 16, 32, 64, 128}; + + // Find largest size that fits in registers + for (auto size : common_sizes) { + if (size * size <= elements_per_thread) { + return size; + } + } + return 8; // Minimum reasonable size + } + +Access Pattern Optimization +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Design distributions for optimal memory access: + +.. code-block:: cpp + + // Create distribution optimized for coalescing + // This example shows a 32x32 tile distributed across threads + using OptimalDistribution = tile_distribution_encoding< + sequence<>, // RsLengths: No replication + tuple, sequence<4, 8>>, // HsLengthss: Hierarchical decomposition + tuple, sequence<1, 2>>, // Ps2RHssMajor: P to RH major mapping + tuple, sequence<1, 0>>, // Ps2RHssMinor: P to RH minor mapping + sequence<1, 1, 2, 2>, // Ys2RHsMajor: Y to RH major mapping + sequence<0, 1, 0, 1> // Ys2RHsMinor: Y to RH minor mapping + >; + +Summary +------- + +TileWindow provides: + +- **Automatic optimization**: Generates optimal memory access patterns through :ref:`LoadStoreTraits ` +- **Distribution awareness**: Works seamlessly with :ref:`TileDistribution ` +- **Space-filling curves**: Optimizes traversal order for cache efficiency (see :ref:`ck_tile_space_filling_curve`) +- **Vectorization**: Automatic multi-element operations +- **Precomputation**: Zero-overhead :ref:`coordinate transformations ` +- **Flexible windowing**: Supports various access patterns and window configurations +- **Safety**: Automatic boundary handling + +Key benefits: + +1. **Performance**: Improves memory bandwidth through coalescing and vectorization +2. **Productivity**: Reduces reliance manual memory management code +3. **Correctness**: Automatic boundary checking and handling +4. **Composability**: Integrates with other CK abstractions +5. **Intelligence**: LoadStoreTraits analyzes and optimizes access + +The TileWindow abstraction helps build high-performance GPU kernels, providing an interface for complex memory access patterns while helping maintain performance gains. The compile-time analysis performed by LoadStoreTraits ensures that memory operations are as efficient as possible, while the space-filling curve traversal maximizes cache utilization. + +Next Steps +---------- + + +- :ref:`ck_tile_load_store_traits` - Deep dive into access pattern optimization +- :ref:`ck_tile_space_filling_curve` - Advanced traversal patterns +- :ref:`ck_tile_static_distributed_tensor` - Register-based tensor storage +- :ref:`ck_tile_lds_index_swapping` - Advanced shared memory optimization +- :ref:`ck_tile_sweep_tile` - Efficient tile-based algorithms diff --git a/docs/conceptual/ck_tile/transforms.rst b/docs/conceptual/ck_tile/transforms.rst new file mode 100644 index 0000000000..63b830563e --- /dev/null +++ b/docs/conceptual/ck_tile/transforms.rst @@ -0,0 +1,769 @@ +.. _ck_tile_transforms: + +Individual Transform Operations +=============================== + +The transformation engine is built from individual transform types that each handle specific coordinate conversions. + +What Are Transforms? +-------------------- + +Transform operations convert coordinates between different dimensional spaces. Each transform operates between two :ref:`coordinate spaces `: + +- **Lower Dimension Space**: The source coordinate system +- **Upper Dimension Space**: The target coordinate system + +Transform Direction +~~~~~~~~~~~~~~~~~~~ + +Transforms work bidirectionally: + +- **Forward Transform**: Converts coordinates from the lower dimension to the upper dimension +- **Inverse Transform**: Converts coordinates back from the upper dimension to the lower dimension + +Zero-Copy Logical Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Critical Understanding**: All transform operations happen in **logical coordinate space** only. This is a zero-copy system and there is **no data copying or movement** involved. + +- **Data Storage**: The actual tensor data remains stored in memory in linear fashion, exactly as specified by the original tensor shape and strides at creation time. See :ref:`ck_tile_buffer_views` for more information about raw memory access. +- **Logical Mapping**: Transforms create different logical views of the same underlying data and only change how access coordinates are interpreted. See :ref:`ck_tile_tensor_views` for more information about tensor views. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "Tensor Coordinate Transformation" + US["Lower Dimension Space
Source coordinate system"] + LS["Upper Dimension Space
Target coordinate system"] + + DATA["Linear Data in Memory
Layout determined by tensor
shape & strides"] + end + + US -->|"Forward Transform"| LS + LS -->|"Inverse Transform"| US + + DATA -.->|"Same data,
different views"| US + DATA -.->|"Same data,
different views"| LS + + style US fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style LS fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_1.svg + :alt: Diagram + :align: center + +Index Calculation Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The transform system provides two operations for coordinate conversion: + +- **calculate_lower_index()**: Takes a coordinate from the **upper dimension space** and transforms it to get the corresponding index or coordinate in the **lower dimension space**. This calculates where to find the actual tensor element using the transformed coordinate system. + +- **calculate_upper_index()**: Takes a coordinate from the **lower dimension space** and transforms it back to get the corresponding coordinate in the **upper dimension space**. This performs the inverse transformation to recover the original coordinate representation. + +These operations enable bidirectional navigation between different coordinate representations of the same underlying tensor data. + +Transform System Architecture +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + + subgraph "Transform Types" + EMB["EmbedTransform
Linear → Multi-D Strided"] + UNM["MergeTransform
Multi-D → Linear"] + MRG["UnmergeTransform
Linear → Multi-D"] + REP["ReplicateTransform
0D → Multi-D Broadcast"] + OFF["OffsetTransform
Translation"] + PAS["PassThroughTransform
Identity"] + PAD["PadTransform
Boundaries"] + end + + subgraph "Operations" + FWD["Forward
calculate_lower_index()"] + BWD["Backward
calculate_upper_index()"] + UPD["Update
update_lower_index()"] + end + + EMB --> FWD + UNM --> FWD + MRG --> FWD + REP --> FWD + OFF --> FWD + PAS --> FWD + PAD --> FWD + + style FWD fill:#e8f5e9,stroke:#388e3c,stroke-width:2px + + + + + +.. image:: diagrams/transforms_2.svg + :alt: Diagram + :align: center + +MergeTransform +-------------- + +MergeTransform collapses multiple dimensions from the lower coordinate space into a single dimension in the upper coordinate space, effectively reducing the dimensionality of the tensor representation while preserving data relationships. This transform is fundamental to the :ref:`tile distribution system `. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "MergeTransform: Multi-D → Linear" + LS["Lower Coordinate Space
2D: [4, 5]
Coord: (2, 3)"] + US["Upper Coordinate Space
1D Linear
Index: 13"] + + DATA["Same Tensor Data
Layout: row-major
Size: 20 elements"] + end + + LS -->|"Forward Transform
2×5 + 3 = 13"| US + US -->|"Inverse Transform
13÷5=2, 13%5=3"| LS + + DATA -.->|"Multi-dimensional
view"| LS + DATA -.->|"Linear
view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_3.svg + :alt: Diagram + :align: center + + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create MergeTransform for 4x5 tensor (20 elements total) + auto transform = make_merge_transform(make_tuple(4, 5)); + + // Forward: Lower (2D) → Upper (1D) - Manual calculation + int row = 2, col = 3; + int linear_index = row * 5 + col; // Result: 13 + printf("2D coord [%d, %d] → Linear index %d\n", row, col, linear_index); + printf("Calculation: %d×5 + %d = %d\n", row, col, linear_index); + + // Inverse: Upper (1D) → Lower (2D) - Using transform + multi_index<1> upper_coord; + upper_coord[number<0>{}] = 13; + + multi_index<2> lower_coord; + transform.calculate_lower_index(lower_coord, upper_coord); + + printf("Linear index %d → 2D coord [%d, %d]\n", + static_cast(upper_coord[number<0>{}]), + static_cast(lower_coord[number<0>{}]), + static_cast(lower_coord[number<1>{}])); + printf("Calculation: 13 ÷ 5 = %d remainder %d\n", + static_cast(lower_coord[number<0>{}]), + static_cast(lower_coord[number<1>{}])); + +UnmergeTransform +---------------- + +UnmergeTransform expands coordinates from a single dimension in the lower coordinate space into multiple dimensions in the upper coordinate space, effectively increasing the dimensionality of the tensor representation while preserving all data relationships. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "UnmergeTransform: Linear → Multi-D" + LS["Lower Coordinate Space
1D Linear
Index: 14"] + US["Upper Coordinate Space
3D: [3, 4, 2]
Coord: (1, 3, 0)"] + + DATA["Same Tensor Data
Layout: row-major
Size: 24 elements"] + end + + LS -->|"Forward Transform
14 = 1×8 + 3×2 + 0"| US + US -->|"Inverse Transform
linearize back"| LS + + DATA -.->|"Linear
view"| LS + DATA -.->|"Multi-dimensional
view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_4.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create UnmergeTransform for 3x4x2 tensor (24 elements total) + auto transform = make_unmerge_transform(make_tuple(3, 4, 2)); + + // Forward: Lower (1D) → Upper (3D) - Manual calculation + int linear_index = 14; + int dim0 = linear_index / (4 * 2); // 14 / 8 = 1 + int remainder = linear_index % (4 * 2); // 14 % 8 = 6 + int dim1 = remainder / 2; // 6 / 2 = 3 + int dim2 = remainder % 2; // 6 % 2 = 0 + + printf("Linear index %d → 3D coord [%d, %d, %d]\n", + linear_index, dim0, dim1, dim2); + printf("Calculation: 14 = %d×8 + %d×2 + %d\n", dim0, dim1, dim2); + + // Inverse: Upper (3D) → Lower (1D) - Using transform + multi_index<3> upper_coord; + upper_coord[number<0>{}] = 1; + upper_coord[number<1>{}] = 3; + upper_coord[number<2>{}] = 0; + + multi_index<1> lower_coord; + transform.calculate_lower_index(lower_coord, upper_coord); + + printf("3D coord [%d, %d, %d] → Linear index %d\n", + static_cast(upper_coord[number<0>{}]), + static_cast(upper_coord[number<1>{}]), + static_cast(upper_coord[number<2>{}]), + static_cast(lower_coord[number<0>{}])); + printf("Calculation: %d×8 + %d×2 + %d = %d\n", + static_cast(upper_coord[number<0>{}]), + static_cast(upper_coord[number<1>{}]), + static_cast(upper_coord[number<2>{}]), + static_cast(lower_coord[number<0>{}])); + +EmbedTransform +-------------- + +EmbedTransform expands linear indices from the lower coordinate space into multi-dimensional coordinates in the upper coordinate space using configurable strides, enabling flexible strided tensor layouts and sub-tensor views within larger buffers. + + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "EmbedTransform: Linear → Multi-D Strided" + LS["Lower Coordinate Space
1D Linear
Index: 14"] + US["Upper Coordinate Space
2D: [2, 3]
Coord: (1, 2)"] + + DATA["Linear Buffer in Memory"] + end + + LS -->|"Forward Transform
Strides: [12, 1]
14 ÷ 12 = 1, 14 % 12 = 2"| US + US -->|"Inverse Transform
1×12 + 2×1 = 14"| LS + + DATA -.->|"Linear
index view"| LS + DATA -.->|"Multi-dimensional
strided view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_5.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create embed transform for 2x3 tensor with strides [12, 1] + // This is commonly used in :ref:`descriptors ` + auto transform = make_embed_transform(make_tuple(2, 3), make_tuple(12, 1)); + + // Forward: Linear → 2D (Manual calculation) + int linear_idx = 14; + int row = linear_idx / 12; // 14 / 12 = 1 + int remainder = linear_idx % 12; // 14 % 12 = 2 + int col = remainder / 1; // 2 / 1 = 2 + printf("Linear index %d → 2D coord [%d, %d]\n", linear_idx, row, col); + + // Inverse: 2D → Linear (Using transform) + multi_index<2> upper_coord; + upper_coord[number<0>{}] = 1; + upper_coord[number<1>{}] = 2; + + multi_index<1> lower_coord; + transform.calculate_lower_index(lower_coord, upper_coord); + printf("2D coord [%d, %d] → Linear index %d\n", + static_cast(upper_coord[number<0>{}]), + static_cast(upper_coord[number<1>{}]), + static_cast(lower_coord[number<0>{}])); + +ReplicateTransform +------------------ + +ReplicateTransform creates a higher-dimensional tensor by replicating (broadcasting) a lower-dimensional tensor. It's essentially a broadcasting operation that takes a tensor with fewer dimensions and logically replicates it across new dimensions without data duplication. An example is taking a scalar (0-dimensional) input and broadcasting it across multiple dimensions, enabling efficient broadcasting patterns where a single value appears at every position in a multi-dimensional coordinate space. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "ReplicateTransform: 0D → Multi-D Broadcasting" + LS["Lower Coordinate Space
0D: Scalar
Empty coordinate []"] + US["Upper Coordinate Space
2D: [3, 4]
All coords: (i, j)"] + + DATA["Single Scalar Value"] + end + + LS -->|"Forward Transform
[] → (i,j) for any i,j"| US + US -->|"Inverse Transform
(i,j) → [] for any i,j"| LS + + DATA -.->|"One scalar
value"| LS + DATA -.->|"Broadcasted view
at all positions"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_6.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create replicate transform for 3x4 broadcasting + auto transform = make_replicate_transform(make_tuple(3, 4)); + + // Inverse: Upper (2D) → Lower (0D) - Using transform + // Any 2D coordinate maps to empty scalar coordinate + multi_index<2> upper_coord; + upper_coord[number<0>{}] = 1; + upper_coord[number<1>{}] = 2; + + multi_index<0> lower_coord; // Empty coordinate (0 dimensions) + transform.calculate_lower_index(lower_coord, upper_coord); + printf("2D [%d, %d] → Empty scalar [] (always empty)\n", + static_cast(upper_coord[number<0>{}]), + static_cast(upper_coord[number<1>{}])); + + // Forward: Scalar → 2D (Conceptual - no coordinate calculation needed) + // Broadcasting: Single scalar value appears at ALL positions + printf("Broadcasting: Scalar value appears at every [i,j] where 0≤i<3, 0≤j<4\n"); + printf("Total positions: 3×4 = 12 positions, all contain same scalar value\n"); + + // Test multiple coordinates - all map to empty scalar + int test_coords[][2] = {{0, 0}, {1, 2}, {2, 3}}; + for(int i = 0; i < 3; i++) + { + multi_index<2> test_upper; + test_upper[number<0>{}] = test_coords[i][0]; + test_upper[number<1>{}] = test_coords[i][1]; + + multi_index<0> test_lower; + transform.calculate_lower_index(test_lower, test_upper); + printf("2D [%d, %d] → Empty scalar []\n", + test_coords[i][0], test_coords[i][1]); + } + +OffsetTransform +--------------- + +OffsetTransform shifts coordinates by a fixed offset, creating a translated view of the coordinate space. It performs translation operations where each coordinate in the upper space is mapped to a coordinate in the lower space by adding a constant offset. + + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "OffsetTransform: 1D → 1D Translation" + LS["Lower Coordinate Space
1D: [0, 63]
Coord: index + offset"] + US["Upper Coordinate Space
1D: [0, 47]
Coord: index"] + + DATA["Linear Buffer in Memory"] + end + + LS -->|"Forward Transform
idx → idx + 16"| US + US -->|"Inverse Transform
idx + 16 → idx"| LS + + DATA -.->|"Lower
view"| LS + DATA -.->|"Upper
view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_7.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Create offset transform for coordinate translation + // CK Tile formula: lower = upper + offset + auto transform = make_offset_transform(48, 16); + + // Using Transform: Original → Translated (adds offset) + multi_index<1> upper_coord; + upper_coord[number<0>{}] = 5; // Original index 5 + + multi_index<1> lower_coord; + transform.calculate_lower_index(lower_coord, upper_coord); + printf("Original index %d → Translated index %d\n", + static_cast(upper_coord[number<0>{}]), + static_cast(lower_coord[number<0>{}])); + printf("Calculation: %d + 16 = %d\n", + static_cast(upper_coord[number<0>{}]), + static_cast(lower_coord[number<0>{}])); + + // Manual Reverse: Translated → Original (subtract offset) + int translated_idx = 21; + int original_idx = translated_idx - 16; + printf("Translated index %d → Original index %d\n", translated_idx, original_idx); + + // Test multiple coordinates + int test_indices[] = {0, 10, 20, 47}; + for(int i = 0; i < 4; i++) + { + multi_index<1> test_upper; + test_upper[number<0>{}] = test_indices[i]; + + multi_index<1> test_lower; + transform.calculate_lower_index(test_lower, test_upper); + printf("Original %d → Translated %d\n", + test_indices[i], static_cast(test_lower[number<0>{}])); + } + +PassThroughTransform - Identity +------------------------------- + +No-op transform that passes coordinates unchanged. The PassThrough transform is the simplest coordinate transformation in CK Tile, implementing a perfect identity mapping where input coordinates are passed through unchanged to the output. This transform is essential as a placeholder in transformation chains and for dimensions that require no modification. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "PassThroughTransform: 1D → 1D Identity" + LS["Lower Coordinate Space
1D: [0, 59]
Coord: index"] + US["Upper Coordinate Space
1D: [0, 59]
Coord: index"] + + DATA["Linear Buffer in Memory"] + end + + LS -.->|"Perfect Identity
idx → idx"| US + US -.->|"Perfect Identity
idx → idx"| LS + + DATA -->|"Same buffer
same view"| LS + DATA -->|"Same buffer
same view"| US + + style LS fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px + style US fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_8.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // Identity transform - no change + int length = 60; + + auto transform = make_pass_through_transform(length); + + printf("Length: %d\n", length); + + // Forward: Upper → Lower (identity) + multi_index<1> upper_coord; + upper_coord[number<0>{}] = 25; + + multi_index<1> lower_coord; + transform.calculate_lower_index(lower_coord, upper_coord); + + printf("\nForward: [%d] → [%d] (unchanged)\n", + static_cast(upper_coord[number<0>{}]), + static_cast(lower_coord[number<0>{}])); + + // Reverse: Lower → Upper (identity) + multi_index<1> lower_input; + lower_input[number<0>{}] = 42; + + multi_index<1> upper_result; + // Note: PassThrough is bidirectional identity, so we can use same function + transform.calculate_lower_index(upper_result, lower_input); + + printf("Reverse: [%d] → [%d] (unchanged)\n", + static_cast(lower_input[number<0>{}]), + static_cast(upper_result[number<0>{}])); + +PadTransform +------------ + +PadTransform adds padding to tensor dimensions, mapping coordinates from upper dimension space (with padding) to lower dimension space (original data). + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "PadTransform: 1D → 1D with Padding" + LS["Lower Coordinate Space
1D: [0, 2] (original data)"] + US["Upper Coordinate Space
1D: [0, 4] (with padding)"] + + DATA["Tensor Data in Memory"] + end + + LS -->|"Forward Transform
idx + left_pad"| US + US -->|"Inverse Transform
idx - left_pad"| LS + + DATA -.->|"Original view"| LS + DATA -.->|"Padded view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_9.svg + :alt: Diagram + :align: center + +**C++ Implementation:** + +.. code-block:: cpp + + using namespace ck_tile; + + // PadTransform for coordinate padding + int low_length = 3; // Original dimension length + int left_pad = 1; // Padding on left + int right_pad = 1; // Padding on right + + auto transform = make_pad_transform(low_length, left_pad, right_pad); + + printf("Low length: %d\n", low_length); + printf("Left pad: %d\n", left_pad); + printf("Right pad: %d\n", right_pad); + printf("Upper length: %d (total with padding)\n", low_length + left_pad + right_pad); + + // Test coordinate mapping + int test_coords[] = {0, 1, 2, 3, 4}; + for(int i = 0; i < 5; i++) + { + multi_index<1> upper; + upper[number<0>{}] = test_coords[i]; + + multi_index<1> lower; + transform.calculate_lower_index(lower, upper); + + int adjusted_idx = static_cast(lower[number<0>{}]); + bool is_valid = (adjusted_idx >= 0 && adjusted_idx < low_length); + + printf("Upper %d → Lower %d (%s)\n", + test_coords[i], adjusted_idx, + is_valid ? "valid" : "padding"); + } + +Additional Transform Types +-------------------------- + +XorTransform +~~~~~~~~~~~~ + +XorTransform applies a 2D XOR mapping for specialized memory access patterns. It performs XOR operations on coordinates to create transformed memory layouts for specific algorithmic optimizations, particularly useful for avoiding :ref:`LDS bank conflicts `. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "XorTransform: 2D → 2D XOR Mapping" + LS["Lower Coordinate Space
2D: [4, 8]
XOR-transformed coords"] + US["Upper Coordinate Space
2D: [4, 8]
Normal coords"] + + DATA["Same Tensor Data"] + end + + LS -->|"Forward Transform
apply XOR reverse"| US + US -->|"Inverse Transform
apply XOR mapping"| LS + + DATA -.->|"XOR pattern
view"| LS + DATA -.->|"Normal
view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_10.svg + :alt: Diagram + :align: center + +SliceTransform +~~~~~~~~~~~~~~ + +SliceTransform extracts a sub-region from a tensor dimension. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "SliceTransform: 1D → 1D Sub-region" + LS["Lower Coordinate Space
1D: [0, 9] (original range)"] + US["Upper Coordinate Space
1D: [0, 4] (slice range)"] + + DATA["Tensor Data in Memory"] + end + + LS -->|"Forward Transform
idx + slice_begin"| US + US -->|"Inverse Transform
idx - slice_begin"| LS + + DATA -.->|"Full tensor
view"| LS + DATA -.->|"Sub-region
view"| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + + + +.. image:: diagrams/transforms_11.svg + :alt: Diagram + :align: center + +ModuloTransform +~~~~~~~~~~~~~~~ + +ModuloTransform applies cyclic wrapping to coordinates using modulo operations. + +.. + Original mermaid diagram (edit here, then run update_diagrams.py) + + .. mermaid:: + + graph TB + subgraph "ModuloTransform: 1D → 1D Cyclic" + LS["Lower Coordinate Space
1D: [0, 3] (modulus range)"] + US["Upper Coordinate Space
1D: [0, 15] (full range)"] + + DATA["Tensor Data in Memory"] + end + + LS -->|"Forward Transform
idx * cycle_count"| US + US -->|"Inverse Transform
idx % modulus"| LS + + DATA -.->|" "| LS + DATA -.->|" "| US + + style LS fill:#e3f2fd,stroke:#1976d2,stroke-width:3px + style US fill:#fff3e0,stroke:#f57c00,stroke-width:3px + style DATA fill:#f0f9ff,stroke:#0284c7,stroke-width:2px,stroke-dasharray: 5 5 + + + +.. image:: diagrams/transforms_12.svg + :alt: Diagram + :align: center + +Summary +------- + +Individual transforms provide: + +- **Modularity**: Each transform does one thing +- **Composability**: Chain transforms for complex mappings (see :ref:`ck_tile_adaptors`) +- **Efficiency**: Compile-time optimization in C++ +- **Flexibility**: Handle any coordinate conversion need + +These transforms enable you to: + +1. Create custom tensor views +2. Implement efficient data access patterns +3. Handle padding and boundaries correctly +4. Optimize memory layouts for :ref:`GPU access ` + +The C++ implementations in Composable Kernel provide: + +- Zero-overhead abstractions through templates +- Compile-time composition and optimization +- Support for complex coordinate transformations +- Integration with GPU kernel generation +- Foundation for :ref:`tile windows ` and :ref:`load/store traits ` + +Next Steps +---------- + +- :ref:`ck_tile_adaptors` - How to chain transforms together +- :ref:`ck_tile_descriptors` - Complete tensor descriptions with transforms +- :ref:`ck_tile_tile_window` - Using transforms for efficient data loading +- :ref:`ck_tile_thread_mapping` - How transforms enable thread-to-data mapping +- :ref:`ck_tile_gemm_optimization` - Practical application in GEMM kernels diff --git a/docs/conceptual/ck_tile/update_diagrams.py b/docs/conceptual/ck_tile/update_diagrams.py new file mode 100644 index 0000000000..2fbe2ef5a9 --- /dev/null +++ b/docs/conceptual/ck_tile/update_diagrams.py @@ -0,0 +1,215 @@ +#!/usr/bin/env python3 +""" +Helper script to update SVG diagrams from commented mermaid sources in RST files. + +This script scans RST files for commented mermaid blocks (created by convert_mermaid_to_svg.py) +and regenerates the corresponding SVG files when the source has been modified. + +Usage: + python update_diagrams.py # Update all diagrams + python update_diagrams.py # Update diagrams in a specific file +""" + +import os +import re +import subprocess +import sys +import tempfile +from pathlib import Path + +# Configuration +DOCS_DIR = Path(__file__).parent +DIAGRAMS_DIR = DOCS_DIR / "diagrams" + +# Pattern to find commented mermaid blocks followed by image references +COMMENTED_MERMAID_PATTERN = re.compile( + r"\.\.\s*\n" # Comment start + r"(?: .*\n|\s*\n)*?" # Comment description lines (may have blank lines) + r"( \.\. mermaid::\s*\n" # Commented mermaid directive + r"(?: \n| .*\n|\s*\n)*?)" # Mermaid content (including blank lines) + r"\.\. image:: diagrams/([^\s]+)", # Image reference + re.MULTILINE, +) + + +def extract_mermaid_from_comment(commented_block): + """Extract mermaid code from a commented block.""" + # Remove the comment indentation (3 spaces at start of each line) + lines = commented_block.split("\n") + content_lines = [] + + for line in lines: + if line.startswith(" "): + # Remove the 3-space comment indentation + content_lines.append(line[3:]) + elif line.strip() == "": + content_lines.append("") + + # Now we have the mermaid block, extract the actual mermaid code + mermaid_content = "\n".join(content_lines) + + # Remove the ".. mermaid::" directive and extract the indented content + mermaid_match = re.search( + r"\.\. mermaid::\s*\n((?:(?:\n| .*))*)", mermaid_content + ) + if mermaid_match: + mermaid_code = mermaid_match.group(1) + # Remove RST indentation from mermaid code + code_lines = [] + for line in mermaid_code.split("\n"): + if line.startswith(" "): + code_lines.append(line[3:]) + elif line.strip() == "": + code_lines.append("") + return "\n".join(code_lines).strip() + + return None + + +def convert_mermaid_to_svg(mermaid_code, output_path): + """Convert mermaid code to SVG using mmdc.""" + # Create a temporary file for the mermaid code + with tempfile.NamedTemporaryFile( + mode="w", suffix=".mmd", delete=False, encoding="utf-8" + ) as tmp: + tmp.write(mermaid_code) + tmp_path = tmp.name + + try: + # Run mmdc to convert to SVG + subprocess.run( + [ + "mmdc", + "-i", + tmp_path, + "-o", + str(output_path), + "-t", + "neutral", + "-b", + "transparent", + ], + capture_output=True, + text=True, + check=True, + shell=True, # Required for Windows .cmd files + ) + return True, None + except subprocess.CalledProcessError as e: + return False, e.stderr + finally: + # Clean up temp file + os.unlink(tmp_path) + + +def process_file(file_path, force_update=False): + """Process a single RST file to update diagrams.""" + print(f"Checking {file_path.name}...") + + with open(file_path, "r", encoding="utf-8") as f: + content = f.read() + + # Find all commented mermaid blocks + matches = list(COMMENTED_MERMAID_PATTERN.finditer(content)) + + if not matches: + print(" No commented mermaid diagrams found.") + return 0, 0 + + updated_count = 0 + error_count = 0 + + for match in matches: + commented_mermaid = match.group(1) + svg_filename = match.group(2) + svg_path = DIAGRAMS_DIR / svg_filename + + # Extract mermaid code + mermaid_code = extract_mermaid_from_comment(commented_mermaid) + if not mermaid_code: + print(f" ⚠ Could not extract mermaid code for {svg_filename}") + error_count += 1 + continue + + # Check if SVG needs updating + needs_update = force_update or not svg_path.exists() + + if not needs_update: + # For a more sophisticated check, we could hash the mermaid code + # and compare with a stored hash, but for simplicity we just check existence + print(f" ✓ {svg_filename} exists (use --force to regenerate)") + continue + + # Generate SVG + success, error = convert_mermaid_to_svg(mermaid_code, svg_path) + + if success: + print(f" ✓ Updated: {svg_filename}") + updated_count += 1 + else: + print(f" ✗ Error updating {svg_filename}: {error}") + error_count += 1 + + return updated_count, error_count + + +def find_rst_files(): + """Find all RST files in the CK tile docs directory.""" + return list(DOCS_DIR.glob("*.rst")) + + +def main(): + """Main function.""" + print("CK Tile Diagram Updater") + print("=" * 50) + + # Verify mmdc is available + try: + subprocess.run( + ["mmdc", "--version"], capture_output=True, check=True, shell=True + ) + except (subprocess.CalledProcessError, FileNotFoundError): + print("Error: mermaid-cli (mmdc) not found. Please install it:") + print(" npm install -g @mermaid-js/mermaid-cli") + return 1 + + # Ensure diagrams directory exists + DIAGRAMS_DIR.mkdir(parents=True, exist_ok=True) + + # Parse command line arguments + force_update = "--force" in sys.argv or "-f" in sys.argv + specific_file = None + + for arg in sys.argv[1:]: + if arg not in ["--force", "-f"] and arg.endswith(".rst"): + specific_file = DOCS_DIR / arg + if not specific_file.exists(): + print(f"Error: File not found: {arg}") + return 1 + + # Get files to process + if specific_file: + files_to_process = [specific_file] + else: + files_to_process = find_rst_files() + + # Process files + total_updated = 0 + total_errors = 0 + + for file_path in files_to_process: + updated, errors = process_file(file_path, force_update) + total_updated += updated + total_errors += errors + + print("\n" + "=" * 50) + print("✓ Update complete!") + print(f" Updated: {total_updated} diagram(s)") + if total_errors > 0: + print(f" Errors: {total_errors}") + + return 0 if total_errors == 0 else 1 + + +if __name__ == "__main__": + exit(main()) diff --git a/docs/index.rst b/docs/index.rst index c28eb646b5..865914ab4c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -25,6 +25,7 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab * :doc:`Composable Kernel structure <./conceptual/Composable-Kernel-structure>` * :doc:`Composable Kernel mathematical basis <./conceptual/Composable-Kernel-math>` + * :doc:`CK Tile conceptual documentation <./conceptual/ck_tile/index>` .. grid-item-card:: Tutorials diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 33ad8d91f8..c82e07ced8 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -18,6 +18,8 @@ subtrees: title: Composable Kernel structure - file: conceptual/Composable-Kernel-math.rst title: Composable Kernel mathematical basis + - file: conceptual/ck_tile/index.rst + title: CK Tile conceptual documentation - caption: Tutorial entries: