A tiny version of vllm inspired by nano-vllm.
This project aims to build a lightweight inference engine for LLMs, keeping all code in a single file tinyvllm.py for simplicity and educational purposes.
You can install tinyvllm directly from GitHub:
pip install git+https://github.com/yaoshengzhe/tinyvllm.gitOr install dependencies for development:
pip install .Before running the example, download the model from Hugging Face:
huggingface-cli download Qwen/Qwen3-0.6B --local-dir models/Qwen3-0.6B --local-dir-use-symlinks Falsepython example.pytinyvllm is designed to be compatible with macOS and CPU-only environments, although performance will be slower than on GPU.
Ensure you have Python 3.8+ installed. It is recommended to use a virtual environment.
python3 -m venv .venv
source .venv/bin/activate
pip install .-
Download the model (if not already done):
huggingface-cli download Qwen/Qwen3-0.6B --local-dir models/Qwen3-0.6B --local-dir-use-symlinks False
-
Run the example:
python example.py
On a laptop, it might take a minute to load the model and generate text.
You can run the benchmark on your laptop with reduced parameters for testing:
python benchmark.py --backend tinyvllm --num-seqs 2 --max-input-len 10 --max-output-len 10To benchmark vLLM performance (requires vllm installed):
python benchmark.py --backend vllmTo benchmark nano-vLLM performance (requires nanovllm installed):
python benchmark.py --backend nanovllmTo benchmark tinyvllm performance:
python benchmark.py --backend tinyvllmHardware: a2-ultragpu-1g (NVIDIA A100 80GB x 1) on Google Compute Engine (GCE).
| Metric | vLLM | nano-vLLM | tinyvllm |
|---|---|---|---|
| Total Sequences | 256 | 256 | TBD |
| Total Input Tokens | 142,827 | 142,827 | TBD |
| Total Gen Tokens | 133,966 | 133,966 | TBD |
| Total Duration | 16.47 s | 16.88 s | TBD |
| Gen Throughput | 8,135.63 tok/s | 7,938.19 tok/s | TBD |
| Total Throughput | 16,809.38 tok/s | 16,401.45 tok/s | TBD |
Observation:
nano-vLLMachieves ~97.5% ofvLLM's generation throughput in this test, demonstrating its efficiency as a lightweight alternative.
Click to see detailed vLLM Benchmark Logs
==================================================
Starting Benchmark...
==================================================
Adding requests: 100%
256/256 [00:00<00:00, 2912.32it/s]
Processed prompts: 100%
256/256 [00:16<00:00, 48.64it/s, est. speed input: 8721.47 toks/s, output: 8180.38 toks/s]
##################################################
BENCHMARK RESULTS
##################################################
Metric | Value
-------------------------------------------
Total Sequences | 256
Total Input Tokens | 142827
Total Gen Tokens | 133966
Total Duration | 16.47 s
-------------------------------------------
Gen Throughput | 8135.63 tok/s
Total Throughput | 16809.38 tok/s
##################################################
Click to see detailed nano-vLLM Benchmark Logs
==================================================
Initializing nanovllm Engine...
==================================================
...
##################################################
BENCHMARK RESULTS (nanovllm)
##################################################
Metric | Value
-------------------------------------------
Total Sequences | 256
Total Input Tokens | 142827
Total Gen Tokens | 133966
Total Duration | 16.88 s
-------------------------------------------
Gen Throughput | 7938.19 tok/s
Total Throughput | 16401.45 tok/s
##################################################
python -m unittest discover tests