A hands-on video series where we implement the Polish language model Bielik 1.5B from scratch using custom GPU kernels written in Triton. Every component - from matrix multiplication to text generation - is built step by step, optimized, and benchmarked against PyTorch.
Model: Bielik-1.5B-v3.0-Instruct (1.6B parameters, Polish)
| # | Episode | Key Result | Doc |
|---|---|---|---|
| 01 | Introduction - Bielik Architecture and Triton | Architecture overview, GQA, SwiGLU, why Triton | link |
| 02 | Matmul - Heart of the Transformer | Tiled matmul with Tensor Cores, matching PyTorch perf | link |
| 03 | Fused kernels - RMSNorm & Softmax | Fused single-pass RMSNorm and Softmax with causal mask | link |
| 04 | RoPE | RoPE - Rotary Position Embedding | link |
| 05 | Flash Attention v2 | Flash Attention | link |
| 06 | SwiGLU FFN | SwiGLU Feed Forward Network | link |
- How transformers work at the GPU instruction level
- Writing high-performance Triton kernels from scratch
- Tiling, Tensor Cores, kernel fusion, auto-tuning
- Python and basic ML/neural network knowledge
- General idea of how transformers work (helpful but not required)
- An NVIDIA GPU with CUDA support
embers/
├── kernels/ # Triton GPU kernels
│ ├── matmul/ # Matrix multiplication variants
├── benchmarks/ # Performance benchmarks
│ ├── matmul/ # Bechmarks for matmul kernels
└── docs/ # Episodes docs
# Clone the repository
git clone https://github.com/qooba/bielik-anatomy-triton
cd bielik-anatomy-triton
# Install dependencies
pip install -r requirements.txt