Back to Series Overview | Next: Matmul - Heart of the Transformer
The opening episode of the Bielik Anatomy series. We look at the architecture of Bielik 1.5B - a Polish language model with 1.6 billion parameters - and explain why Triton is the tool of choice for writing custom GPU kernels.
- What is Bielik - a 1.6B-parameter Polish LLM based on a Qwen-like architecture, created by the SpeakLeash community
- Architecture walkthrough - embedding layer, 32 decoder layers, final RMSNorm, language model head
- Grouped Query Attention (GQA) - 12 query heads with only 2 KV heads, achieving 55% parameter reduction
- SwiGLU activation - the gated feed-forward network used in modern LLMs
- Why Triton over CUDA - Python-like syntax, automatic block management, compiler optimizations, cross-GPU portability
- Triton vs PyTorch - RMSNorm example: 1 fused kernel in Triton vs 3-4 separate kernels in PyTorch, 2-3x faster
- Series roadmap - basic kernels, attention, FFN, text generation
- Bielik uses modern architectural improvements (GQA, SwiGLU, RMSNorm) that we will implement one by one
- Triton lets you write high-performance GPU kernels without the boilerplate of raw CUDA
- The series is a learning journey - building every component from scratch
Back to Series Overview | Next: Matmul - Heart of the Transformer