This repository collects GPU kernel practice projects. It is meant for learning by implementing small Triton kernels by hand on RTX 4090 / RTX 5090-class GPUs. The projects are derived from reading public README files, tests, and API wrappers in production kernel repositories, but the teaching code intentionally avoids copying production kernels.
The implementation stack is PyTorch plus ordinary Triton. The exercises use concepts such as tl.program_id, tl.arange, pointer arithmetic, masks, tl.load, tl.store, tl.dot, and simple atomics. They intentionally exclude TileLang, SM90/SM100-only TMA, WGMMA, CuTe/CUTLASS template kernels, DeepGEMM production JIT backends, NVSHMEM/RDMA, and production distributed runtime dependencies.
awesome-kernel-tutorial
├── TILEKERNEL_PRACTICE
│ ├── docs
│ ├── tutorial_kernels
│ ├── tests/tutorial
│ └── benchmarks
├── FlashMLA_PRACTICE
│ ├── docs
│ ├── tutorial_kernels
│ ├── tests/tutorial
│ └── benchmarks
└── DeepGEMM_PRACTICE
├── docs
├── tutorial_kernels
├── tests/tutorial
└── benchmarks
Each practice project has complete PyTorch reference functions, Triton TODO skeletons, pytest acceptance tests, benchmark skeletons, and lecture-style tutorials. The reference functions are the correctness spec. The Triton wrappers initially raise NotImplementedError, so full tests are expected to fail until the corresponding kernels are implemented.
| Project | Tutorial | Topic |
|---|---|---|
TILEKERNEL_PRACTICE |
docs/triton_transpose.md |
2D transpose and batched transpose |
TILEKERNEL_PRACTICE |
docs/triton_fp8_quantization.md |
per-token, per-channel, per-block FP8-like quantization |
TILEKERNEL_PRACTICE |
docs/triton_moe_routing.md |
MoE routing kernels: top-k, count, normalize, expand, reduce |
FlashMLA_PRACTICE |
docs/sparse_mla_prefill.md |
sparse MLA prefill and online softmax |
FlashMLA_PRACTICE |
docs/dense_paged_mla_decoding.md |
dense/paged MLA decoding with block table and cache lengths |
DeepGEMM_PRACTICE |
docs/bf16_gemm.md |
BF16 NT GEMM, optional C accumulate |
DeepGEMM_PRACTICE |
docs/moe_grouped_gemm.md |
contiguous and masked MoE grouped GEMM |
DeepGEMM_PRACTICE |
docs/scaled_fp8_gemm.md |
simplified scaled FP8/int8-like GEMM |
Use a Python environment with PyTorch, Triton, and pytest. CUDA is required for Triton acceptance tests and benchmark scripts.
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
python -c "import triton; print(triton.__version__)"Run these commands from the repository root after cloning:
cd E:\GPUKernel\awesome-kernel-tutorial
cd .\TILEKERNEL_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false
cd ..\FlashMLA_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false
cd ..\DeepGEMM_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=falseReference tests should pass before implementing any Triton TODO kernel.
cd E:\GPUKernel\awesome-kernel-tutorial\TILEKERNEL_PRACTICE
python -B -m pytest tests/tutorial/test_transpose_ref.py tests/tutorial/test_batched_transpose_ref.py tests/tutorial/test_quantize_with_scale_ref.py tests/tutorial/test_cast_back_ref.py tests/tutorial/test_swiglu_ref.py tests/tutorial/test_stable_topk_ref.py tests/tutorial/test_topk_gate_ref.py tests/tutorial/test_build_mapping_ref.py -q -o cacheprovider.enabled=falsecd E:\GPUKernel\awesome-kernel-tutorial\FlashMLA_PRACTICE
python -B -m pytest tests/tutorial/test_sparse_mla_prefill_ref.py tests/tutorial/test_sparse_mla_prefill_full_ref.py tests/tutorial/test_scaled_attention_one_batch_ref.py tests/tutorial/test_dense_paged_mla_decode_ref.py -q -o cacheprovider.enabled=falsecd E:\GPUKernel\awesome-kernel-tutorial\DeepGEMM_PRACTICE
python -B -m pytest tests/tutorial/test_bf16_gemm_ref.py tests/tutorial/test_grouped_gemm_contiguous_ref.py tests/tutorial/test_grouped_gemm_masked_ref.py tests/tutorial/test_quantize_fp8_like_ref.py tests/tutorial/test_dequantize_fp8_like_ref.py tests/tutorial/test_scaled_fp8_gemm_ref.py tests/tutorial/test_generators.py tests/tutorial/test_utils.py -q -o cacheprovider.enabled=falseStart from the tutorial document, then read the PyTorch reference, then inspect the Triton skeleton and tests. Each tutorial contains a Function-to-test map with exact commands for every public function. A good workflow is to implement one wrapper/kernel path at a time, run only that function's acceptance test, then broaden to the full project.
When benchmarking, use triton.testing.do_bench or torch.cuda.Event with synchronization. CUDA launches are asynchronous, so ordinary Python wall-clock timing around a kernel call is usually misleading. Report latency, effective GB/s, and TFLOP/s using the formulas in each tutorial.
| Project | Reference tests | Triton acceptance tests |
|---|---|---|
TILEKERNEL_PRACTICE |
pass | fail with NotImplementedError until TODO kernels are implemented |
FlashMLA_PRACTICE |
pass | fail with NotImplementedError until TODO kernels are implemented |
DeepGEMM_PRACTICE |
pass | fail with NotImplementedError until TODO kernels are implemented |
This is intentional. The repository is an assignment scaffold, not a finished kernel library.