Skip to content

fyf-spec/awesome-kernel-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Kernel Tutorial

This repository collects GPU kernel practice projects. It is meant for learning by implementing small Triton kernels by hand on RTX 4090 / RTX 5090-class GPUs. The projects are derived from reading public README files, tests, and API wrappers in production kernel repositories, but the teaching code intentionally avoids copying production kernels.

The implementation stack is PyTorch plus ordinary Triton. The exercises use concepts such as tl.program_id, tl.arange, pointer arithmetic, masks, tl.load, tl.store, tl.dot, and simple atomics. They intentionally exclude TileLang, SM90/SM100-only TMA, WGMMA, CuTe/CUTLASS template kernels, DeepGEMM production JIT backends, NVSHMEM/RDMA, and production distributed runtime dependencies.

Repository Layout

awesome-kernel-tutorial
├── TILEKERNEL_PRACTICE
│   ├── docs
│   ├── tutorial_kernels
│   ├── tests/tutorial
│   └── benchmarks
├── FlashMLA_PRACTICE
│   ├── docs
│   ├── tutorial_kernels
│   ├── tests/tutorial
│   └── benchmarks
└── DeepGEMM_PRACTICE
    ├── docs
    ├── tutorial_kernels
    ├── tests/tutorial
    └── benchmarks

Each practice project has complete PyTorch reference functions, Triton TODO skeletons, pytest acceptance tests, benchmark skeletons, and lecture-style tutorials. The reference functions are the correctness spec. The Triton wrappers initially raise NotImplementedError, so full tests are expected to fail until the corresponding kernels are implemented.

Tutorial Index

Project Tutorial Topic
TILEKERNEL_PRACTICE docs/triton_transpose.md 2D transpose and batched transpose
TILEKERNEL_PRACTICE docs/triton_fp8_quantization.md per-token, per-channel, per-block FP8-like quantization
TILEKERNEL_PRACTICE docs/triton_moe_routing.md MoE routing kernels: top-k, count, normalize, expand, reduce
FlashMLA_PRACTICE docs/sparse_mla_prefill.md sparse MLA prefill and online softmax
FlashMLA_PRACTICE docs/dense_paged_mla_decoding.md dense/paged MLA decoding with block table and cache lengths
DeepGEMM_PRACTICE docs/bf16_gemm.md BF16 NT GEMM, optional C accumulate
DeepGEMM_PRACTICE docs/moe_grouped_gemm.md contiguous and masked MoE grouped GEMM
DeepGEMM_PRACTICE docs/scaled_fp8_gemm.md simplified scaled FP8/int8-like GEMM

Setup Check

Use a Python environment with PyTorch, Triton, and pytest. CUDA is required for Triton acceptance tests and benchmark scripts.

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
python -c "import triton; print(triton.__version__)"

Static Validation

Run these commands from the repository root after cloning:

cd E:\GPUKernel\awesome-kernel-tutorial

cd .\TILEKERNEL_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

cd ..\FlashMLA_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

cd ..\DeepGEMM_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

Reference Smoke Tests

Reference tests should pass before implementing any Triton TODO kernel.

cd E:\GPUKernel\awesome-kernel-tutorial\TILEKERNEL_PRACTICE
python -B -m pytest tests/tutorial/test_transpose_ref.py tests/tutorial/test_batched_transpose_ref.py tests/tutorial/test_quantize_with_scale_ref.py tests/tutorial/test_cast_back_ref.py tests/tutorial/test_swiglu_ref.py tests/tutorial/test_stable_topk_ref.py tests/tutorial/test_topk_gate_ref.py tests/tutorial/test_build_mapping_ref.py -q -o cacheprovider.enabled=false
cd E:\GPUKernel\awesome-kernel-tutorial\FlashMLA_PRACTICE
python -B -m pytest tests/tutorial/test_sparse_mla_prefill_ref.py tests/tutorial/test_sparse_mla_prefill_full_ref.py tests/tutorial/test_scaled_attention_one_batch_ref.py tests/tutorial/test_dense_paged_mla_decode_ref.py -q -o cacheprovider.enabled=false
cd E:\GPUKernel\awesome-kernel-tutorial\DeepGEMM_PRACTICE
python -B -m pytest tests/tutorial/test_bf16_gemm_ref.py tests/tutorial/test_grouped_gemm_contiguous_ref.py tests/tutorial/test_grouped_gemm_masked_ref.py tests/tutorial/test_quantize_fp8_like_ref.py tests/tutorial/test_dequantize_fp8_like_ref.py tests/tutorial/test_scaled_fp8_gemm_ref.py tests/tutorial/test_generators.py tests/tutorial/test_utils.py -q -o cacheprovider.enabled=false

Working Through an Assignment

Start from the tutorial document, then read the PyTorch reference, then inspect the Triton skeleton and tests. Each tutorial contains a Function-to-test map with exact commands for every public function. A good workflow is to implement one wrapper/kernel path at a time, run only that function's acceptance test, then broaden to the full project.

When benchmarking, use triton.testing.do_bench or torch.cuda.Event with synchronization. CUDA launches are asynchronous, so ordinary Python wall-clock timing around a kernel call is usually misleading. Report latency, effective GB/s, and TFLOP/s using the formulas in each tutorial.

Expected Initial Status

Project Reference tests Triton acceptance tests
TILEKERNEL_PRACTICE pass fail with NotImplementedError until TODO kernels are implemented
FlashMLA_PRACTICE pass fail with NotImplementedError until TODO kernels are implemented
DeepGEMM_PRACTICE pass fail with NotImplementedError until TODO kernels are implemented

This is intentional. The repository is an assignment scaffold, not a finished kernel library.

About

Meant for learning by implementing small Triton kernels by hand on RTX 4090 / RTX 5090-class GPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages