Awesome Kernel Tutorial

This repository collects GPU kernel practice projects. It is meant for learning by implementing small Triton kernels by hand on RTX 4090 / RTX 5090-class GPUs. The projects are derived from reading public README files, tests, and API wrappers in production kernel repositories, but the teaching code intentionally avoids copying production kernels.

The implementation stack is PyTorch plus ordinary Triton. The exercises use concepts such as tl.program_id, tl.arange, pointer arithmetic, masks, tl.load, tl.store, tl.dot, and simple atomics. They intentionally exclude TileLang, SM90/SM100-only TMA, WGMMA, CuTe/CUTLASS template kernels, DeepGEMM production JIT backends, NVSHMEM/RDMA, and production distributed runtime dependencies.

Repository Layout

awesome-kernel-tutorial
├── TILEKERNEL_PRACTICE
│   ├── docs
│   ├── tutorial_kernels
│   ├── tests/tutorial
│   └── benchmarks
├── FlashMLA_PRACTICE
│   ├── docs
│   ├── tutorial_kernels
│   ├── tests/tutorial
│   └── benchmarks
└── DeepGEMM_PRACTICE
    ├── docs
    ├── tutorial_kernels
    ├── tests/tutorial
    └── benchmarks

Each practice project has complete PyTorch reference functions, Triton TODO skeletons, pytest acceptance tests, benchmark skeletons, and lecture-style tutorials. The reference functions are the correctness spec. The Triton wrappers initially raise NotImplementedError, so full tests are expected to fail until the corresponding kernels are implemented.

Tutorial Index

Project	Tutorial	Topic
`TILEKERNEL_PRACTICE`	`docs/triton_transpose.md`	2D transpose and batched transpose
`TILEKERNEL_PRACTICE`	`docs/triton_fp8_quantization.md`	per-token, per-channel, per-block FP8-like quantization
`TILEKERNEL_PRACTICE`	`docs/triton_moe_routing.md`	MoE routing kernels: top-k, count, normalize, expand, reduce
`FlashMLA_PRACTICE`	`docs/sparse_mla_prefill.md`	sparse MLA prefill and online softmax
`FlashMLA_PRACTICE`	`docs/dense_paged_mla_decoding.md`	dense/paged MLA decoding with block table and cache lengths
`DeepGEMM_PRACTICE`	`docs/bf16_gemm.md`	BF16 NT GEMM, optional C accumulate
`DeepGEMM_PRACTICE`	`docs/moe_grouped_gemm.md`	contiguous and masked MoE grouped GEMM
`DeepGEMM_PRACTICE`	`docs/scaled_fp8_gemm.md`	simplified scaled FP8/int8-like GEMM

Setup Check

Use a Python environment with PyTorch, Triton, and pytest. CUDA is required for Triton acceptance tests and benchmark scripts.

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
python -c "import triton; print(triton.__version__)"

Static Validation

Run these commands from the repository root after cloning:

cd E:\GPUKernel\awesome-kernel-tutorial

cd .\TILEKERNEL_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

cd ..\FlashMLA_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

cd ..\DeepGEMM_PRACTICE
python -m compileall tutorial_kernels
python -B -m pytest tests --collect-only -q -o cacheprovider.enabled=false

Reference Smoke Tests

Reference tests should pass before implementing any Triton TODO kernel.

cd E:\GPUKernel\awesome-kernel-tutorial\TILEKERNEL_PRACTICE
python -B -m pytest tests/tutorial/test_transpose_ref.py tests/tutorial/test_batched_transpose_ref.py tests/tutorial/test_quantize_with_scale_ref.py tests/tutorial/test_cast_back_ref.py tests/tutorial/test_swiglu_ref.py tests/tutorial/test_stable_topk_ref.py tests/tutorial/test_topk_gate_ref.py tests/tutorial/test_build_mapping_ref.py -q -o cacheprovider.enabled=false

cd E:\GPUKernel\awesome-kernel-tutorial\FlashMLA_PRACTICE
python -B -m pytest tests/tutorial/test_sparse_mla_prefill_ref.py tests/tutorial/test_sparse_mla_prefill_full_ref.py tests/tutorial/test_scaled_attention_one_batch_ref.py tests/tutorial/test_dense_paged_mla_decode_ref.py -q -o cacheprovider.enabled=false

cd E:\GPUKernel\awesome-kernel-tutorial\DeepGEMM_PRACTICE
python -B -m pytest tests/tutorial/test_bf16_gemm_ref.py tests/tutorial/test_grouped_gemm_contiguous_ref.py tests/tutorial/test_grouped_gemm_masked_ref.py tests/tutorial/test_quantize_fp8_like_ref.py tests/tutorial/test_dequantize_fp8_like_ref.py tests/tutorial/test_scaled_fp8_gemm_ref.py tests/tutorial/test_generators.py tests/tutorial/test_utils.py -q -o cacheprovider.enabled=false

Working Through an Assignment

Start from the tutorial document, then read the PyTorch reference, then inspect the Triton skeleton and tests. Each tutorial contains a Function-to-test map with exact commands for every public function. A good workflow is to implement one wrapper/kernel path at a time, run only that function's acceptance test, then broaden to the full project.

When benchmarking, use triton.testing.do_bench or torch.cuda.Event with synchronization. CUDA launches are asynchronous, so ordinary Python wall-clock timing around a kernel call is usually misleading. Report latency, effective GB/s, and TFLOP/s using the formulas in each tutorial.

Expected Initial Status

Project	Reference tests	Triton acceptance tests
`TILEKERNEL_PRACTICE`	pass	fail with `NotImplementedError` until TODO kernels are implemented
`FlashMLA_PRACTICE`	pass	fail with `NotImplementedError` until TODO kernels are implemented
`DeepGEMM_PRACTICE`	pass	fail with `NotImplementedError` until TODO kernels are implemented

This is intentional. The repository is an assignment scaffold, not a finished kernel library.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DeepGEMM_PRACTICE		DeepGEMM_PRACTICE
FlashMLA_PRACTICE		FlashMLA_PRACTICE
TILEKERNEL_PRACTICE		TILEKERNEL_PRACTICE
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Kernel Tutorial

Repository Layout

Tutorial Index

Setup Check

Static Validation

Reference Smoke Tests

Working Through an Assignment

Expected Initial Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome Kernel Tutorial

Repository Layout

Tutorial Index

Setup Check

Static Validation

Reference Smoke Tests

Working Through an Assignment

Expected Initial Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages