Skip to content

amulil/cleanvllm

Repository files navigation

CleanvLLM Logo

English | 简体中文

A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.

🎯 Purpose - Why This Matters

Learn AI Infrastructure Fundamentals: This project provides a clean, educational implementation of vLLM's core concepts in a single Python file, making it easy to understand how modern LLM inference engines work under the hood.

Perfect for Learning: Whether you're a student, researcher, or engineer wanting to understand vLLM internals, this simplified implementation helps you grasp the fundamental concepts without getting lost in production complexity.

🚀 Quick Start - Run vLLM Inference

# 1. Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Run vLLM inference (dependencies will be installed automatically)
uv run qwen3_0_6B.py

Optional: Enable Flash Attention

For better performance (especially on newer GPUs), you can install Flash Attention 2 (compatible with RTX 30/40 series and A100/H100):

# Important: Activate the environment first to install into .venv, or use 'uv run'
source .venv/bin/activate
uv pip install "flash-attn>=2.0.0" --no-build-isolation

Note: This requires CUDA compilation tools and may take a few minutes.

That's it! You're now running vLLM inference!

🎓 Interactive Learning System

Want to understand the core concepts behind this codebase? We provide an interactive learning system built with Vue and Flask!

Interactive Learning System - Challenge Mode Interactive Learning System - Global Architecture

Challenge Mode    |    Global Architecture

It includes:

  • 🎮 Challenge Mode: 6 levels of code-reading challenges.
  • 🗺️ Global Architecture: Interactive diagrams explaining how components interact.
  • ⚙️ Dynamic Inference Flow: Step-by-step visualization of the Prefill and Decode phases.
  • 💡 Core Concepts: Visual explanations of KV Cache, Paged Attention, and MoE.

How to run it:

cd learning_system
pip install flask flask-cors
python app.py

Then open http://127.0.0.1:5050 in your browser. You can switch between English and Chinese in the top right corner!

📖 Detailed Usage

Basic Usage

  1. Update the model path in qwen3_0_6B.py:
path = os.path.expanduser("~/path/to/your/qwen3model")
  1. Run the script:
uv run qwen3_0_6B.py

Benchmark

See bench_cleanvllm.py for benchmark.

python bench_cleanvllm.py --engine all --model ~/model/qwen/Qwen3-0.6B --markdown

Test Configuration:

  • Hardware: RTX 4090 (24GB)
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100–1024 tokens
  • Output Length: 1024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
Qwen-0.6B (CleanvLLM) 262,144 52.52 4991.13
Qwen3-0.6B (vLLM) 262,144 52.07 5034.32
Qwen3-30B-A3B-Instruct-2507 (CleanvLLM) 262,144 133.59 1962.28
Qwen3-30B-A3B-Instruct-2507 (vLLM) 262,144 167.51 1564.93

🚧 TODO List

Upcoming Features

  • More Model Variants: Support for additional Qwen model sizes and configurations
  • Performance Optimizations: Further kernel optimizations and memory efficiency improvements

Current Support

  • Multi-GPU Support: Enhanced tensor parallelism for distributed inference
  • qwen3_30B_A3B.py: Support for larger Qwen3-30B-A3B model
  • qwen3_0_6B.py: Complete implementation for Qwen3-0.6B model
  • Basic vLLM Features: PagedAttention, KV caching, continuous batching
  • Flash Attention: Auto-detection and fallback support

🙏 Acknowledgments

This project is inspired by and based on the concepts from vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs. We are grateful to the vLLM team and community for their pioneering work in LLM inference optimization.

Also based on the excellent nano-vLLM project. Thanks to the original authors for their outstanding work!

📚 Star History

Star History Chart

About

A single-file educational implementation for understanding vLLM's core concepts and running LLM inference.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors