Skip to content

krishs0404/nanovllm-mm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanovllm-mm

A research fork of nano-vllm exploring how to make multimodal (vision-language) inference faster and smarter — without the complexity of a production engine like vLLM.

nano-vllm is ~1,200 lines of Python: paged KV cache, continuous batching, CUDA graphs. Small enough to actually understand. This fork asks: what breaks when you add vision, and what can you do about it?


What We Found

1. The Prefill Tax

One image introduces 667 extra prompt tokens into the prefill batch — a 26.6× multiplier over the same text-only prompt. Almost all of the latency gap between text and image inference is here.

Measured on H100 (Qwen2.5-VL-3B-Instruct):
  text-only:   26 tokens →  51ms
  image+text: 692 tokens → 281ms   (5.5× slower, 26.6× more tokens)

2. Schedulers Are Blind to Token Type

nano-vllm's FCFS scheduler treats all tokens equally. When an image request and a text request arrive together, they get batched — and the text request waits for image prefill to finish even though it only needs 51ms. This is head-of-line blocking.

Fix: VisionAwareScheduler — two-lane prefill. Text requests always get their own batch first. Images are admitted only when no text is waiting.

10 mixed requests, simultaneous arrival (H100):
  Text P50 TTFT:   328ms → 63ms   (-81%)
  Image P50 TTFT:  656ms → 720ms  (+10%)

Text latency drops 81%. Images pay one extra text-batch delay. The tradeoff is explicit and bounded.

3. Vision Encoders Run Redundantly

In a multi-turn conversation, the ViT runs on every turn — even when the image hasn't changed. Standard engines re-encode the same image on every request.

Fix: content-addressed embedding cache — SHA-1 keyed cache on encode_images(). Same image hits the cache; zero ViT compute.

6 encode_images() calls on the same image (H100):
  Call 1:  188ms  [COLD — ViT runs]
  Call 2:   10ms  [WARM — cache hit]
  ...
  Call 6:    9ms  [WARM — cache hit]

  Speedup: 20x
  5-turn conversation: 940ms → 225ms vision cost  (-76%)

Inspired by SGLang's IPC handle cache (Modal blog) — same principle (cache expensive idempotent work in the hot path), applied to vision encoder outputs.


Architecture

nanovllm/
├── engine/
│   ├── scheduler.py          # upstream FCFS scheduler
│   ├── vlm_scheduler.py      # VisionAwareScheduler (two-lane prefill)
│   └── sequence.py           # extended with is_multimodal, inputs_embeds
├── models/
│   ├── qwen3.py              # upstream text decoder
│   └── qwen2_vl.py           # Qwen2VLVisionBridge: ViT extraction + embed cache
├── multimodal/
│   ├── hf_engine.py          # HF bridge: full model via transformers
│   ├── llm.py                # VlmLLM: native engine + vision bridge
│   ├── processor.py          # Qwen2-VL processor wrapper
│   └── messages.py           # VlmMessage / ImageInput types
notes/
├── RESULTS.md                # all benchmark results with raw numbers
├── VLM_SCHEDULER_INSIGHT.md  # full write-up of the scheduling problem
└── token_composition.py      # prefill tax analysis script
modal_bench.py                # end-to-end benchmark (runs on Modal H100)

Benchmarks

All results on H100 SXM5 via Modal, Qwen2.5-VL-3B-Instruct, CUDA 12.4 / PyTorch 2.4.0.

Optimization Mechanism Result
VisionAwareScheduler Text-first two-lane prefill Text TTFT −81% under mixed load
Image embedding cache SHA-1 keyed ViT output cache 20× speedup, 76% less vision compute per conversation

Full numbers in notes/RESULTS.md.


Quick Start

pip install -e ".[mm]"
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ~/models/Qwen2.5-VL-3B-Instruct

# HF bridge — validated, runs on any GPU
python examples/vlm_hf_generate.py \
  --model ~/models/Qwen2.5-VL-3B-Instruct \
  --image ./assets/logo.png \
  --prompt "Describe this image."

# Text vs image latency comparison
python examples/vlm_bench_compare.py \
  --model ~/models/Qwen2.5-VL-3B-Instruct \
  --image ./assets/logo.png -n 5

# Full H100 benchmark (requires Modal account)
modal run modal_bench.py

What's Next

  • Chunked vision prefill — interleave image tokens across multiple prefill rounds instead of one large burst, capping the maximum latency spike per request
  • LRU embed cache — VRAM-budget-aware eviction policy for the embedding cache
  • Native end-to-end pathVlmLLM wires the vision bridge into nano-vllm's paged-KV engine; needs end-to-end benchmarking
  • Shared image KV cache — cache the full KV activations from image tokens, not just ViT embeddings, to make even the first turn of repeated-image requests fast
  • Resolution-adaptive batching — bin requests by vision token count to minimize padding waste across heterogeneous image sizes

Relation to Upstream

Upstream nano-vllm text inference is unchanged. All multimodal work is additive:

  • nanovllm/engine/vlm_scheduler.py — new file, extends Scheduler
  • nanovllm/engine/sequence.py — added is_multimodal, inputs_embeds, image_grid_thw fields (backwards-compatible pickle)
  • nanovllm/models/qwen2_vl.py — new file, vision bridge
  • nanovllm/multimodal/ — new directory

See FORK.md for full delta from upstream.

About

Multimodal inference experiments on nano-vllm: vision-aware scheduling, image embedding cache, H100 benchmarks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages