Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.
The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).
Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.
Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.
The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.
You may join skyzh's Discord server and study with the tiny-llm community.
Week + Chapter | Topic | Code | Test | Doc |
---|---|---|---|---|
1.1 | Attention | β | β | β |
1.2 | RoPE | β | β | β |
1.3 | Grouped Query Attention | β | π§ | π§ |
1.4 | RMSNorm and MLP | β | π§ | π§ |
1.5 | Transformer Block | β | π§ | π§ |
1.6 | Load the Model | β | π§ | π§ |
1.7 | Generate Responses (aka Decoding) | β | β | π§ |
2.1 | KV Cache | β | π§ | π§ |
2.2 | Quantized Matmul and Linear - CPU | β | π§ | π§ |
2.3 | Quantized Matmul and Linear - GPU | β | π§ | π§ |
2.4 | Flash Attention and Other Kernels | π§ | π§ | π§ |
2.5 | Continuous Batching | π§ | π§ | π§ |
2.6 | Speculative Decoding | π§ | π§ | π§ |
2.7 | Prompt/Prefix Cache | π§ | π§ | π§ |
3.1 | Paged Attention - Part 1 | π§ | π§ | π§ |
3.2 | Paged Attention - Part 2 | π§ | π§ | π§ |
3.3 | Prefill-Decode Separation | π§ | π§ | π§ |
3.4 | Scheduler | π§ | π§ | π§ |
3.5 | Parallelism | π§ | π§ | π§ |
3.6 | AI Agent | π§ | π§ | π§ |
3.7 | Streaming API Server | π§ | π§ | π§ |
Other topics not covered: quantized/compressed kv cache