A lightweight LLM inference engine built from scratch, implementing core concepts from vLLM and Flash Attention.
mini-llm-engine/
├── pyproject.toml
├── requirements.txt
├── src/
│ ├── engine.py # Main LLMEngine class
│ ├── model/
│ │ ├── attention.py # Multi-Head Attention + GQA
│ │ ├── embeddings.py # Rotary Position Embeddings (RoPE)
│ │ └── transformer.py # TransformerBlock, RMSNorm, SwiGLU
│ ├── kv_cache/
│ │ ├── cache.py # Basic KV Cache
│ │ └── paged_attention.py # PagedAttention + BlockManager
│ ├── scheduler/
│ │ └── batch_scheduler.py # Continuous Batching Scheduler
│ └── kernels/
│ └── flash_attention.py # Flash Attention in Triton
├── examples/
│ └── basic_usage.py
└── tests/
| Component | File | Features |
|---|---|---|
| KV Cache | src/kv_cache/cache.py |
Pre-allocated tensor storage for autoregressive decoding |
| PagedAttention | src/kv_cache/paged_attention.py |
Block-based memory management, eliminates fragmentation |
| RoPE | src/model/embeddings.py |
Rotary position encoding with cached sin/cos |
| Attention | src/model/attention.py |
MHA, GQA support, KV cache integration |
| Transformer | src/model/transformer.py |
LLaMA-style: RMSNorm, SwiGLU, pre-norm |
| Flash Attention | src/kernels/flash_attention.py |
Triton kernel with online softmax, O(N) memory |
| Scheduler | src/scheduler/batch_scheduler.py |
FCFS, iteration-level batching, preemption |
- Efficient KV Caching: Basic contiguous cache and PagedAttention for memory efficiency
- PagedAttention: Block-based memory management inspired by vLLM, eliminates memory fragmentation
- Flash Attention: Memory-efficient attention in Triton with O(N) memory complexity
- Continuous Batching: Iteration-level scheduling for high throughput
- GQA Support: Grouped-Query Attention for reduced KV cache memory
- RoPE: Rotary Position Embeddings for length extrapolation
- LLaMA-style Architecture: RMSNorm, SwiGLU activation, pre-normalization
pip install -e .- Python >= 3.9
- PyTorch >= 2.0.0
- Triton >= 2.1.0
- transformers >= 4.30.0
from src import LLMEngine
from src.engine import GenerationConfig
# Load a pretrained model
engine = LLMEngine.from_pretrained("meta-llama/Llama-2-7b-hf")
# Generate text
outputs = engine.generate(
["Hello, how are you?"],
generation_config=GenerationConfig(
max_new_tokens=100,
temperature=0.8,
top_k=40,
top_p=0.95,
)
)
for output in outputs:
print(output.generated_text)- Pre-allocates memory for maximum sequence length
- Updates in-place during autoregressive decoding
- Supports both basic contiguous and paged memory layouts
- Divides KV cache into fixed-size blocks (default: 16 tokens)
- Maps logical positions to physical blocks via block tables
- Enables memory sharing across sequences (prefix caching)
- Eliminates memory fragmentation for variable-length sequences
- Implements tiled attention computation in Triton
- Uses online softmax to avoid materializing N×N attention matrix
- Achieves O(N) memory complexity vs O(N²) for standard attention
- Includes causal masking support
- Schedules at iteration level (each token generation step)
- Dynamically adds new requests and removes completed ones
- Supports preemption for memory pressure handling
- FCFS scheduling with priority for running sequences