A research fork of nano-vllm exploring how to make multimodal (vision-language) inference faster and smarter — without the complexity of a production engine like vLLM.
nano-vllm is ~1,200 lines of Python: paged KV cache, continuous batching, CUDA graphs. Small enough to actually understand. This fork asks: what breaks when you add vision, and what can you do about it?
One image introduces 667 extra prompt tokens into the prefill batch — a 26.6× multiplier over the same text-only prompt. Almost all of the latency gap between text and image inference is here.
Measured on H100 (Qwen2.5-VL-3B-Instruct):
text-only: 26 tokens → 51ms
image+text: 692 tokens → 281ms (5.5× slower, 26.6× more tokens)
nano-vllm's FCFS scheduler treats all tokens equally. When an image request and a text request arrive together, they get batched — and the text request waits for image prefill to finish even though it only needs 51ms. This is head-of-line blocking.
Fix: VisionAwareScheduler — two-lane prefill. Text requests always get their own batch first. Images are admitted only when no text is waiting.
10 mixed requests, simultaneous arrival (H100):
Text P50 TTFT: 328ms → 63ms (-81%)
Image P50 TTFT: 656ms → 720ms (+10%)
Text latency drops 81%. Images pay one extra text-batch delay. The tradeoff is explicit and bounded.
In a multi-turn conversation, the ViT runs on every turn — even when the image hasn't changed. Standard engines re-encode the same image on every request.
Fix: content-addressed embedding cache — SHA-1 keyed cache on encode_images(). Same image hits the cache; zero ViT compute.
6 encode_images() calls on the same image (H100):
Call 1: 188ms [COLD — ViT runs]
Call 2: 10ms [WARM — cache hit]
...
Call 6: 9ms [WARM — cache hit]
Speedup: 20x
5-turn conversation: 940ms → 225ms vision cost (-76%)
Inspired by SGLang's IPC handle cache (Modal blog) — same principle (cache expensive idempotent work in the hot path), applied to vision encoder outputs.
nanovllm/
├── engine/
│ ├── scheduler.py # upstream FCFS scheduler
│ ├── vlm_scheduler.py # VisionAwareScheduler (two-lane prefill)
│ └── sequence.py # extended with is_multimodal, inputs_embeds
├── models/
│ ├── qwen3.py # upstream text decoder
│ └── qwen2_vl.py # Qwen2VLVisionBridge: ViT extraction + embed cache
├── multimodal/
│ ├── hf_engine.py # HF bridge: full model via transformers
│ ├── llm.py # VlmLLM: native engine + vision bridge
│ ├── processor.py # Qwen2-VL processor wrapper
│ └── messages.py # VlmMessage / ImageInput types
notes/
├── RESULTS.md # all benchmark results with raw numbers
├── VLM_SCHEDULER_INSIGHT.md # full write-up of the scheduling problem
└── token_composition.py # prefill tax analysis script
modal_bench.py # end-to-end benchmark (runs on Modal H100)
All results on H100 SXM5 via Modal, Qwen2.5-VL-3B-Instruct, CUDA 12.4 / PyTorch 2.4.0.
| Optimization | Mechanism | Result |
|---|---|---|
| VisionAwareScheduler | Text-first two-lane prefill | Text TTFT −81% under mixed load |
| Image embedding cache | SHA-1 keyed ViT output cache | 20× speedup, 76% less vision compute per conversation |
Full numbers in notes/RESULTS.md.
pip install -e ".[mm]"
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ~/models/Qwen2.5-VL-3B-Instruct
# HF bridge — validated, runs on any GPU
python examples/vlm_hf_generate.py \
--model ~/models/Qwen2.5-VL-3B-Instruct \
--image ./assets/logo.png \
--prompt "Describe this image."
# Text vs image latency comparison
python examples/vlm_bench_compare.py \
--model ~/models/Qwen2.5-VL-3B-Instruct \
--image ./assets/logo.png -n 5
# Full H100 benchmark (requires Modal account)
modal run modal_bench.py- Chunked vision prefill — interleave image tokens across multiple prefill rounds instead of one large burst, capping the maximum latency spike per request
- LRU embed cache — VRAM-budget-aware eviction policy for the embedding cache
- Native end-to-end path —
VlmLLMwires the vision bridge into nano-vllm's paged-KV engine; needs end-to-end benchmarking - Shared image KV cache — cache the full KV activations from image tokens, not just ViT embeddings, to make even the first turn of repeated-image requests fast
- Resolution-adaptive batching — bin requests by vision token count to minimize padding waste across heterogeneous image sizes
Upstream nano-vllm text inference is unchanged. All multimodal work is additive:
nanovllm/engine/vlm_scheduler.py— new file, extendsSchedulernanovllm/engine/sequence.py— addedis_multimodal,inputs_embeds,image_grid_thwfields (backwards-compatible pickle)nanovllm/models/qwen2_vl.py— new file, vision bridgenanovllm/multimodal/— new directory
See FORK.md for full delta from upstream.
