GitHub - amulil/cleanvllm: A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.

A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.

🎯 Purpose - Why This Matters

Learn AI Infrastructure Fundamentals: This project provides a clean, educational implementation of vLLM's core concepts in a single Python file, making it easy to understand how modern LLM inference engines work under the hood.

Perfect for Learning: Whether you're a student, researcher, or engineer wanting to understand vLLM internals, this simplified implementation helps you grasp the fundamental concepts without getting lost in production complexity.

🚀 Quick Start - Run vLLM Inference

# 1. Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Run vLLM inference (dependencies will be installed automatically)
uv run qwen3_0_6B.py

Optional: Enable Flash Attention

For better performance (especially on newer GPUs), you can install Flash Attention 2 (compatible with RTX 30/40 series and A100/H100):

# Important: Activate the environment first to install into .venv, or use 'uv run'
source .venv/bin/activate
uv pip install "flash-attn>=2.0.0" --no-build-isolation

Note: This requires CUDA compilation tools and may take a few minutes.

That's it! You're now running vLLM inference!

🎓 Interactive Learning System

Want to understand the core concepts behind this codebase? We provide an interactive learning system built with Vue and Flask!

Challenge Mode | Global Architecture

It includes:

🎮 Challenge Mode: 6 levels of code-reading challenges.
🗺️ Global Architecture: Interactive diagrams explaining how components interact.
⚙️ Dynamic Inference Flow: Step-by-step visualization of the Prefill and Decode phases.
💡 Core Concepts: Visual explanations of KV Cache, Paged Attention, and MoE.

How to run it:

cd learning_system
pip install flask flask-cors
python app.py

Then open http://127.0.0.1:5050 in your browser. You can switch between English and Chinese in the top right corner!

📖 Detailed Usage

Basic Usage

Update the model path in qwen3_0_6B.py:

path = os.path.expanduser("~/path/to/your/qwen3model")

Run the script:

uv run qwen3_0_6B.py

Benchmark

See bench_cleanvllm.py for benchmark.

python bench_cleanvllm.py --engine all --model ~/model/qwen/Qwen3-0.6B --markdown

Test Configuration:

Hardware: RTX 4090 (24GB)
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: 1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
Qwen-0.6B (CleanvLLM)	262,144	52.52	4991.13
Qwen3-0.6B (vLLM)	262,144	52.07	5034.32
Qwen3-30B-A3B-Instruct-2507 (CleanvLLM)	262,144	133.59	1962.28
Qwen3-30B-A3B-Instruct-2507 (vLLM)	262,144	167.51	1564.93

🚧 TODO List

Upcoming Features

More Model Variants: Support for additional Qwen model sizes and configurations
Performance Optimizations: Further kernel optimizations and memory efficiency improvements

Current Support

Multi-GPU Support: Enhanced tensor parallelism for distributed inference
qwen3_30B_A3B.py: Support for larger Qwen3-30B-A3B model
qwen3_0_6B.py: Complete implementation for Qwen3-0.6B model
Basic vLLM Features: PagedAttention, KV caching, continuous batching
Flash Attention: Auto-detection and fallback support

🙏 Acknowledgments

This project is inspired by and based on the concepts from vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs. We are grateful to the vLLM team and community for their pioneering work in LLM inference optimization.

Also based on the excellent nano-vLLM project. Thanks to the original authors for their outstanding work!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
learning_system		learning_system
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh-CN.md		README_zh-CN.md
bench_cleanvllm.py		bench_cleanvllm.py
learning_system_architecture_preview.png		learning_system_architecture_preview.png
learning_system_preview.png		learning_system_preview.png
logo.png		logo.png
pyproject.toml		pyproject.toml
qwen3_0_6B.py		qwen3_0_6B.py
qwen3_30B_A3B.py		qwen3_30B_A3B.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎯 Purpose - Why This Matters

🚀 Quick Start - Run vLLM Inference

Optional: Enable Flash Attention

🎓 Interactive Learning System

📖 Detailed Usage

Basic Usage

Benchmark

🚧 TODO List

Upcoming Features

Current Support

🙏 Acknowledgments

📚 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎯 Purpose - Why This Matters

🚀 Quick Start - Run vLLM Inference

Optional: Enable Flash Attention

🎓 Interactive Learning System

📖 Detailed Usage

Basic Usage

Benchmark

🚧 TODO List

Upcoming Features

Current Support

🙏 Acknowledgments

📚 Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages