Skip to content

xuyangwang0825/mini-llm-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini LLM Inference Engine

A lightweight LLM inference engine built from scratch, implementing core concepts from vLLM and Flash Attention.

Project Structure

mini-llm-engine/
├── pyproject.toml
├── requirements.txt
├── src/
│   ├── engine.py                    # Main LLMEngine class
│   ├── model/
│   │   ├── attention.py             # Multi-Head Attention + GQA
│   │   ├── embeddings.py            # Rotary Position Embeddings (RoPE)
│   │   └── transformer.py           # TransformerBlock, RMSNorm, SwiGLU
│   ├── kv_cache/
│   │   ├── cache.py                 # Basic KV Cache
│   │   └── paged_attention.py       # PagedAttention + BlockManager
│   ├── scheduler/
│   │   └── batch_scheduler.py       # Continuous Batching Scheduler
│   └── kernels/
│       └── flash_attention.py       # Flash Attention in Triton
├── examples/
│   └── basic_usage.py
└── tests/

Key Components

Component File Features
KV Cache src/kv_cache/cache.py Pre-allocated tensor storage for autoregressive decoding
PagedAttention src/kv_cache/paged_attention.py Block-based memory management, eliminates fragmentation
RoPE src/model/embeddings.py Rotary position encoding with cached sin/cos
Attention src/model/attention.py MHA, GQA support, KV cache integration
Transformer src/model/transformer.py LLaMA-style: RMSNorm, SwiGLU, pre-norm
Flash Attention src/kernels/flash_attention.py Triton kernel with online softmax, O(N) memory
Scheduler src/scheduler/batch_scheduler.py FCFS, iteration-level batching, preemption

Features

  • Efficient KV Caching: Basic contiguous cache and PagedAttention for memory efficiency
  • PagedAttention: Block-based memory management inspired by vLLM, eliminates memory fragmentation
  • Flash Attention: Memory-efficient attention in Triton with O(N) memory complexity
  • Continuous Batching: Iteration-level scheduling for high throughput
  • GQA Support: Grouped-Query Attention for reduced KV cache memory
  • RoPE: Rotary Position Embeddings for length extrapolation
  • LLaMA-style Architecture: RMSNorm, SwiGLU activation, pre-normalization

Installation

pip install -e .

Requirements

  • Python >= 3.9
  • PyTorch >= 2.0.0
  • Triton >= 2.1.0
  • transformers >= 4.30.0

Usage

from src import LLMEngine
from src.engine import GenerationConfig

# Load a pretrained model
engine = LLMEngine.from_pretrained("meta-llama/Llama-2-7b-hf")

# Generate text
outputs = engine.generate(
    ["Hello, how are you?"],
    generation_config=GenerationConfig(
        max_new_tokens=100,
        temperature=0.8,
        top_k=40,
        top_p=0.95,
    )
)

for output in outputs:
    print(output.generated_text)

Architecture Details

KV Cache

  • Pre-allocates memory for maximum sequence length
  • Updates in-place during autoregressive decoding
  • Supports both basic contiguous and paged memory layouts

PagedAttention

  • Divides KV cache into fixed-size blocks (default: 16 tokens)
  • Maps logical positions to physical blocks via block tables
  • Enables memory sharing across sequences (prefix caching)
  • Eliminates memory fragmentation for variable-length sequences

Flash Attention

  • Implements tiled attention computation in Triton
  • Uses online softmax to avoid materializing N×N attention matrix
  • Achieves O(N) memory complexity vs O(N²) for standard attention
  • Includes causal masking support

Continuous Batching

  • Schedules at iteration level (each token generation step)
  • Dynamically adds new requests and removes completed ones
  • Supports preemption for memory pressure handling
  • FCFS scheduling with priority for running sequences

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages