Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/python/CuTeDSL/ampere/call_bypass_dlpack.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@

.. code-block:: bash

python examples/ampere/call_bypass_dlpack.py
python examples/python/CuTeDSL/ampere/call_bypass_dlpack.py


It's worth to mention that by-passing dlpack protocol can resolve the issue that dlpack doesn't handle shape-1
Expand Down
2 changes: 1 addition & 1 deletion examples/python/CuTeDSL/ampere/call_from_jit.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

.. code-block:: bash

python examples/ampere/call_from_jit.py
python examples/python/CuTeDSL/ampere/call_from_jit.py

Default configuration:
- Batch dimension (L): 16
Expand Down
8 changes: 4 additions & 4 deletions examples/python/CuTeDSL/ampere/elementwise_add.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,16 +118,16 @@

.. code-block:: bash

python examples/ampere/elementwise_add.py --M 3 --N 12
python examples/ampere/elementwise_add.py --M 1024 --N 512
python examples/ampere/elementwise_add.py --M 1024 --N 1024 --benchmark --warmup_iterations 2 --iterations 1000
python examples/python/CuTeDSL/ampere/elementwise_add.py --M 3 --N 12
python examples/python/CuTeDSL/ampere/elementwise_add.py --M 1024 --N 512
python examples/python/CuTeDSL/ampere/elementwise_add.py --M 1024 --N 1024 --benchmark --warmup_iterations 2 --iterations 1000

To collect performance with NCU profiler:

.. code-block:: bash

# Don't iterate too many times when profiling with ncu
ncu python examples/ampere/elementwise_add.py --M 2048 --N 2048 --benchmark --iterations 10 --skip_ref_check
ncu python examples/python/CuTeDSL/ampere/elementwise_add.py --M 2048 --N 2048 --benchmark --iterations 10 --skip_ref_check
"""


Expand Down
8 changes: 4 additions & 4 deletions examples/python/CuTeDSL/ampere/elementwise_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,16 +60,16 @@
.. code-block:: bash

# Run with addition operation
python examples/ampere/elementwise_apply.py --M 1024 --N 512 --op add
python examples/python/CuTeDSL/ampere/elementwise_apply.py --M 1024 --N 512 --op add

# Run with multiplication operation
python examples/ampere/elementwise_apply.py --M 1024 --N 512 --op mul
python examples/python/CuTeDSL/ampere/elementwise_apply.py --M 1024 --N 512 --op mul

# Run with subtraction operation
python examples/ampere/elementwise_apply.py --M 1024 --N 512 --op sub
python examples/python/CuTeDSL/ampere/elementwise_apply.py --M 1024 --N 512 --op sub

# Benchmark performance
python examples/ampere/elementwise_apply.py --M 2048 --N 2048 --op add --benchmark --warmup_iterations 2 --iterations 10
python examples/python/CuTeDSL/ampere/elementwise_apply.py --M 2048 --N 2048 --op add --benchmark --warmup_iterations 2 --iterations 10

The example demonstrates how to express complex CUDA kernels with customizable operations
while maintaining high performance through efficient memory access patterns.
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/ampere/flash_attention_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@

.. code-block:: bash

python examples/ampere/flash_attention_v2.py \
python examples/python/CuTeDSL/ampere/flash_attention_v2.py \
--dtype Float16 --head_dim 128 --m_block_size 128 --n_block_size 128 \
--num_threads 128 --batch_size 1 --seqlen_q 1280 --seqlen_k 1536 \
--num_head 16 --softmax_scale 1.0 --is_causal
Expand All @@ -81,7 +81,7 @@

.. code-block:: bash

ncu python examples/ampere/flash_attention_v2.py \
ncu python examples/python/CuTeDSL/ampere/flash_attention_v2.py \
--dtype Float16 --head_dim 128 --m_block_size 128 --n_block_size 128 \
--num_threads 128 --batch_size 1 --seqlen_q 1280 --seqlen_k 1536 \
--num_head 16 --softmax_scale 1.0 --is_causal --skip_ref_check
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/ampere/sgemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,15 @@

.. code-block:: bash

python examples/ampere/sgemm.py \
python examples/python/CuTeDSL/ampere/sgemm.py \
--mnk 8192,8192,8192 \
--a_major m --b_major n --c_major n

To collect performance with NCU profiler:

.. code-block:: bash

ncu python examples/ampere/sgemm.py \
ncu python examples/python/CuTeDSL/ampere/sgemm.py \
--mnk 8192,8192,8192 \
--a_major m --b_major n --c_major n \
--skip_ref_check --iterations 2
Expand Down
2 changes: 1 addition & 1 deletion examples/python/CuTeDSL/ampere/smem_allocator.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@

.. code-block:: bash

python examples/ampere/smem_allocator.py
python examples/python/CuTeDSL/ampere/smem_allocator.py

The example will allocate shared memory, perform tensor operations, and verify the results.
"""
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/ampere/tensorop_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@

.. code-block:: bash

python examples/ampere/tensorop_gemm.py \
python examples/python/CuTeDSL/ampere/tensorop_gemm.py \
--mnkl 8192,8192,8192,1 --atom_layout_mnk 2,2,1 \
--ab_dtype Float16 \
--c_dtype Float16 --acc_dtype Float32 \
Expand All @@ -80,7 +80,7 @@

.. code-block:: bash

ncu python examples/ampere/tensorop_gemm.py \
ncu python examples/python/CuTeDSL/ampere/tensorop_gemm.py \
--mnkl 8192,8192,8192,1 --atom_layout_mnk 2,2,1 \
--ab_dtype Float16 \
--c_dtype Float16 --acc_dtype Float32 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@

.. code-block:: bash

python examples/blackwell/dense_blockscaled_gemm_persistent.py \
python examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py \
--ab_dtype Float4E2M1FN --sf_dtype Float8E8M0FNU --sf_vec_size 16 \
--c_dtype Float16 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
Expand All @@ -95,7 +95,7 @@

.. code-block:: bash

ncu python examples/blackwell/dense_blockscaled_gemm_persistent.py \
ncu python examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py \
--ab_dtype Float4E2M1FN --sf_dtype Float8E8M0FNU --sf_vec_size 16 \
--c_dtype Float16 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/blackwell/dense_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@

.. code-block:: bash

python examples/blackwell/dense_gemm.py \
python examples/python/CuTeDSL/blackwell/dense_gemm.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand All @@ -90,7 +90,7 @@

.. code-block:: bash

ncu python examples/blackwell/dense_gemm.py \
ncu python examples/python/CuTeDSL/blackwell/dense_gemm.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@

.. code-block:: bash

python examples/blackwell/dense_gemm_persistent.py \
python examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand All @@ -86,7 +86,7 @@

.. code-block:: bash

ncu python examples/blackwell/dense_gemm_persistent.py \
ncu python examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@

.. code-block:: bash

python examples/blackwell/dense_gemm_software_pipeline.py \
python examples/python/CuTeDSL/blackwell/dense_gemm_software_pipeline.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand All @@ -89,7 +89,7 @@

.. code-block:: bash

ncu python examples/blackwell/dense_gemm_software_pipeline.py \
ncu python examples/python/CuTeDSL/blackwell/dense_gemm_software_pipeline.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 256,128 --cluster_shape_mn 2,1 \
--mnkl 8192,8192,8192,1 \
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/blackwell/fmha.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@

.. code-block:: bash

python examples/blackwell/fmha.py \
python examples/python/CuTeDSL/blackwell/fmha.py \
--qk_acc_dtype Float32 --pv_acc_dtype Float32 \
--mma_tiler_mn 128,128 \
--q_shape 4,1024,8,64 --k_shape 4,1024,8,64 \
Expand All @@ -84,7 +84,7 @@

.. code-block:: bash

ncu python examples/blackwell/fmha.py \
ncu python examples/python/CuTeDSL/blackwell/fmha.py \
--qk_acc_dtype Float32 --pv_acc_dtype Float32 \
--mma_tiler_mn 128,128 \
--q_shape 4,1024,8,64 --k_shape 4,1024,8,64 \
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/blackwell/grouped_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@

.. code-block:: bash

python examples/blackwell/grouped_gemm.py \
python examples/python/CuTeDSL/blackwell/grouped_gemm.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 128,64 --cluster_shape_mn 1,1 \
--problem_sizes_mnkl "(8192,1280,32,1),(16,384,1536,1),(640,1280,16,1),(640,160,16,1)" \
Expand All @@ -72,7 +72,7 @@

.. code-block:: bash

ncu python examples/blackwell/grouped_gemm.py \
ncu python examples/python/CuTeDSL/blackwell/grouped_gemm.py \
--ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32 \
--mma_tiler_mn 128,64 --cluster_shape_mn 1,1 \
--problem_sizes_mnkl "(8192,1280,32,1),(16,384,1536,1),(640,1280,16,1),(640,160,16,1)" \
Expand Down
4 changes: 2 additions & 2 deletions examples/python/CuTeDSL/hopper/dense_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@

.. code-block:: bash

python examples/hopper/dense_gemm.py \
python examples/python/CuTeDSL/hopper/dense_gemm.py \
--mnkl 8192,8192,8192,1 --tile_shape_mn 128,256 \
--cluster_shape_mn 1,1 --a_dtype Float16 --b_dtype Float16 \
--c_dtype Float16 --acc_dtype Float32 \
Expand All @@ -84,7 +84,7 @@

.. code-block:: bash

ncu python examples/hopper/dense_gemm.py \
ncu python examples/python/CuTeDSL/hopper/dense_gemm.py \
--mnkl 8192,8192,8192,1 --tile_shape_mn 128,256 \
--cluster_shape_mn 1,1 --a_dtype Float16 --b_dtype Float16 \
--c_dtype Float16 --acc_dtype Float32 \
Expand Down