Skip to content
/ tiny-llm Public

(🚧 WIP) a course of LLM inference serving on Apple Silicon for systems engineers.

License

Notifications You must be signed in to change notification settings

skyzh/tiny-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

tiny-llm - LLM Serving in a Week

CI (main)

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Join skyzh's Discord Server

Roadmap

Week + Chapter Topic Code Test Doc
1.1 Attention βœ… βœ… βœ…
1.2 RoPE βœ… βœ… βœ…
1.3 Grouped Query Attention βœ… 🚧 🚧
1.4 RMSNorm and MLP βœ… 🚧 🚧
1.5 Transformer Block βœ… 🚧 🚧
1.6 Load the Model βœ… 🚧 🚧
1.7 Generate Responses (aka Decoding) βœ… βœ… 🚧
2.1 KV Cache βœ… 🚧 🚧
2.2 Quantized Matmul and Linear - CPU βœ… 🚧 🚧
2.3 Quantized Matmul and Linear - GPU βœ… 🚧 🚧
2.4 Flash Attention and Other Kernels 🚧 🚧 🚧
2.5 Continuous Batching 🚧 🚧 🚧
2.6 Speculative Decoding 🚧 🚧 🚧
2.7 Prompt/Prefix Cache 🚧 🚧 🚧
3.1 Paged Attention - Part 1 🚧 🚧 🚧
3.2 Paged Attention - Part 2 🚧 🚧 🚧
3.3 Prefill-Decode Separation 🚧 🚧 🚧
3.4 Scheduler 🚧 🚧 🚧
3.5 Parallelism 🚧 🚧 🚧
3.6 AI Agent 🚧 🚧 🚧
3.7 Streaming API Server 🚧 🚧 🚧

Other topics not covered: quantized/compressed kv cache