nanovllm-mm

A research fork of nano-vllm exploring how to make multimodal (vision-language) inference faster and smarter — without the complexity of a production engine like vLLM.

nano-vllm is ~1,200 lines of Python: paged KV cache, continuous batching, CUDA graphs. Small enough to actually understand. This fork asks: what breaks when you add vision, and what can you do about it?

What We Found

1. The Prefill Tax

One image introduces 667 extra prompt tokens into the prefill batch — a 26.6× multiplier over the same text-only prompt. Almost all of the latency gap between text and image inference is here.

Measured on H100 (Qwen2.5-VL-3B-Instruct):
  text-only:   26 tokens →  51ms
  image+text: 692 tokens → 281ms   (5.5× slower, 26.6× more tokens)

2. Schedulers Are Blind to Token Type

nano-vllm's FCFS scheduler treats all tokens equally. When an image request and a text request arrive together, they get batched — and the text request waits for image prefill to finish even though it only needs 51ms. This is head-of-line blocking.

Fix: VisionAwareScheduler — two-lane prefill. Text requests always get their own batch first. Images are admitted only when no text is waiting.

10 mixed requests, simultaneous arrival (H100):
  Text P50 TTFT:   328ms → 63ms   (-81%)
  Image P50 TTFT:  656ms → 720ms  (+10%)

Text latency drops 81%. Images pay one extra text-batch delay. The tradeoff is explicit and bounded.

3. Vision Encoders Run Redundantly

In a multi-turn conversation, the ViT runs on every turn — even when the image hasn't changed. Standard engines re-encode the same image on every request.

Fix: content-addressed embedding cache — SHA-1 keyed cache on encode_images(). Same image hits the cache; zero ViT compute.

6 encode_images() calls on the same image (H100):
  Call 1:  188ms  [COLD — ViT runs]
  Call 2:   10ms  [WARM — cache hit]
  ...
  Call 6:    9ms  [WARM — cache hit]

  Speedup: 20x
  5-turn conversation: 940ms → 225ms vision cost  (-76%)

Inspired by SGLang's IPC handle cache (Modal blog) — same principle (cache expensive idempotent work in the hot path), applied to vision encoder outputs.

Architecture

nanovllm/
├── engine/
│   ├── scheduler.py          # upstream FCFS scheduler
│   ├── vlm_scheduler.py      # VisionAwareScheduler (two-lane prefill)
│   └── sequence.py           # extended with is_multimodal, inputs_embeds
├── models/
│   ├── qwen3.py              # upstream text decoder
│   └── qwen2_vl.py           # Qwen2VLVisionBridge: ViT extraction + embed cache
├── multimodal/
│   ├── hf_engine.py          # HF bridge: full model via transformers
│   ├── llm.py                # VlmLLM: native engine + vision bridge
│   ├── processor.py          # Qwen2-VL processor wrapper
│   └── messages.py           # VlmMessage / ImageInput types
notes/
├── RESULTS.md                # all benchmark results with raw numbers
├── VLM_SCHEDULER_INSIGHT.md  # full write-up of the scheduling problem
└── token_composition.py      # prefill tax analysis script
modal_bench.py                # end-to-end benchmark (runs on Modal H100)

Benchmarks

All results on H100 SXM5 via Modal, Qwen2.5-VL-3B-Instruct, CUDA 12.4 / PyTorch 2.4.0.

Optimization	Mechanism	Result
VisionAwareScheduler	Text-first two-lane prefill	Text TTFT −81% under mixed load
Image embedding cache	SHA-1 keyed ViT output cache	20× speedup, 76% less vision compute per conversation

Full numbers in notes/RESULTS.md.

Quick Start

pip install -e ".[mm]"
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ~/models/Qwen2.5-VL-3B-Instruct

# HF bridge — validated, runs on any GPU
python examples/vlm_hf_generate.py \
  --model ~/models/Qwen2.5-VL-3B-Instruct \
  --image ./assets/logo.png \
  --prompt "Describe this image."

# Text vs image latency comparison
python examples/vlm_bench_compare.py \
  --model ~/models/Qwen2.5-VL-3B-Instruct \
  --image ./assets/logo.png -n 5

# Full H100 benchmark (requires Modal account)
modal run modal_bench.py

What's Next

Chunked vision prefill — interleave image tokens across multiple prefill rounds instead of one large burst, capping the maximum latency spike per request
LRU embed cache — VRAM-budget-aware eviction policy for the embedding cache
Native end-to-end path — VlmLLM wires the vision bridge into nano-vllm's paged-KV engine; needs end-to-end benchmarking
Shared image KV cache — cache the full KV activations from image tokens, not just ViT embeddings, to make even the first turn of repeated-image requests fast
Resolution-adaptive batching — bin requests by vision token count to minimize padding waste across heterogeneous image sizes

Relation to Upstream

Upstream nano-vllm text inference is unchanged. All multimodal work is additive:

nanovllm/engine/vlm_scheduler.py — new file, extends Scheduler
nanovllm/engine/sequence.py — added is_multimodal, inputs_embeds, image_grid_thw fields (backwards-compatible pickle)
nanovllm/models/qwen2_vl.py — new file, vision bridge
nanovllm/multimodal/ — new directory

See FORK.md for full delta from upstream.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
assets		assets
examples		examples
nanovllm		nanovllm
notes		notes
tests		tests
.gitignore		.gitignore
FORK.md		FORK.md
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
example.py		example.py
modal_bench.py		modal_bench.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanovllm-mm

What We Found

1. The Prefill Tax

2. Schedulers Are Blind to Token Type

3. Vision Encoders Run Redundantly

Architecture

Benchmarks

Quick Start

What's Next

Relation to Upstream

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanovllm-mm

What We Found

1. The Prefill Tax

2. Schedulers Are Blind to Token Type

3. Vision Encoders Run Redundantly

Architecture

Benchmarks

Quick Start

What's Next

Relation to Upstream

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages