Skip to content

vllm-project/tpu-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2,360 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

vLLM TPU vLLM TPU

| Documentation | Blog | User Forum | Developer Slack (#sig-tpu) |


🀝 Contribute to the Project
Looking to help? Click a badge below to find issues that need your attention.

bug good first issue enhancement contribution-welcome auto-generated View All Issues

Latest News

Previous News πŸ”₯

About

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:

  • Push the limits of TPU hardware performance in open source.
  • Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
  • Retain vLLM standardization: keep the same user experience, telemetry, and interface.

Recommended models and features

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, we’ve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.


Get started

Get started with vLLM on TPUs by following the quickstart guide.

Visit our documentation to learn more.

Compatible TPU Generations

  • Recommended: v7x, v5e, v6e
  • Experimental: v3, v4, v5p

Recipes


TPU Support Matrix Dashboard

Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.

Last Updated: 2026-04-14 07:31 PM UTC

🚦 Status Legend
  • βœ… Passing: Tested and works as expected. Ready for use.
  • ❌ Failing: Known to be broken or not functional. Help is wanted to fix this!
  • πŸ§ͺ Experimental: Works, but unoptimized or pending community validation.
  • πŸ“ Planned: Not yet implemented, but on the official roadmap.
  • ⛔️ Unplanned: There is no benefit to adding this.
  • ❓ Untested: The functionality exists but has not been recently or thoroughly verified.
πŸ“ View Matrix Aggregation Rules (v6e/v7x & C+P)
  • πŸ› οΈ Correctness + Performance (C + P)

    • ❌ Failing: If either check fails.
    • βœ… Passing: If BOTH checks pass successfully.
    • ❓ Untested: If any check is untested (and neither fails).
  • 🌐 Hardware Rollups (v6e + v7x)

    • ❌ Failing: If the feature fails on either v6e or v7x.
    • βœ… Passing: If the feature passes on BOTH v6e and v7x.
    • ❓ Untested: If either generation is untested (and neither fails).


Nightly Support Matrices

Click to expand support matrices

Support status for the latest nightly/main branch developments.

βœ… Tested Models
Model Type UnitΒ Test CorrectnessΒ Test PerformanceΒ Test
google/gemma-3-27b-it Text βœ… βœ… βœ…
meta-llama/Llama-3.1-8B-Instruct Text βœ… βœ… βœ…
meta-llama/Llama-3.3-70B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3-30B-A3B Text βœ… βœ… βœ…
Qwen/Qwen3-32B Text βœ… βœ… βœ…
Qwen/Qwen3-4B Text βœ… βœ… βœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3.5-397B-A17B Text βœ… βœ… ❌
Qwen/Qwen2.5-VL-7B-Instruct Multimodal βœ… ❌ ❓
deepseek-ai/DeepSeek-OCR Multimodal ❓ ❓ ❓
moonshotai/Kimi-K2.5 Multimodal ❓ ❓ ❓
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3.5-9B Multimodal ❓ ❓ ❓
deepseek-ai/DeepSeek-Math-V2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-R1 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.1 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ ❓ ❓
MiniMaxAI/MiniMax-M2.5 Text ❓ ❓ ❓
moonshotai/Kimi-K2-Thinking Text ❓ ❓ ❓
openai/gpt-oss-120b Text ❓ ❓ ❓
openai/gpt-oss-20b Text ❓ ❓ ❓
zai-org/GLM-5 Text ❓ ❓ ❓
πŸš€Β  Advanced Capabilities
Core Features
Feature Flax Torchax Default
Chunked Prefill βœ… βœ… βœ…
DCN-based P/D disaggregation βœ… βœ… βœ…
LoRA_Torch βœ… βœ… βœ…
Prefix Caching βœ… βœ… βœ…
Single Program Multi Data βœ… βœ… βœ…
Speculative Decoding: Ngram βœ… βœ… ❌
Speculative Decoding: Eagle3 βœ… ❌ βœ…
async scheduler ❌ βœ… ❌
Multimodal Inputs ❌ ❌ ❌
Out-of-tree model support ❌ ❌ ❌
hybrid kv cache ❓ ❓ ❓
KV cache host offloading ❓ ❓ ❓
multi-host ❓ ❓ ❓
runai_model_streamer_loader ❓ ❓ ❓
sampling_params ❓ ❓ ❓
Single-Host-P-D-disaggregation ❓ ❓ ❓
structured_decoding ❓ ❓ ❓
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP βœ… ❌ βœ… ❌
DP βœ… ❓ βœ… ❓
EP βœ… ❓ βœ… ❓
TP βœ… ❓ βœ… ❓
CP ❓ ❓ ❓ ❓
SPΒ (voteΒ toΒ prioritize) ❓ ❓ ❓ ❓
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 ❌ ❓
FP8 W8A16 compressed-tensor v7 ❌ ❓
FP8 W8A8 compressed-tensor v7 ❌ ❓
INT4 W4A16 awq v5, v6 ❌ ❓
INT8 W8A8 compressed-tensor v5, v6 ❌ ❓

Note:

  • This table only tests checkpoint loading compatibility.
πŸ”¬ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe FusedΒ MoE ❓ ❓ ❓ ❓ ❓ ❓
gmm ❓ ❓ ❓ ❓ ❓ ❓
Dense All‑gatherΒ matmul ❓ ❓ ❓ ❓ ❓ ❓
Attention GenericΒ RaggedΒ Paged
AttentionΒ V3*
❓ ❓ ❓ ❓ ❓ ❓
MLA ❓ ❓ ❓ ❓ ❓ ❓
RaggedΒ Paged
AttentionΒ V3Β Head_Dim
64*
❓ ❓ ❓ ❓ ❓ ❓

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

🀝 Contribute

bug good first issue enhancement contribution-welcome auto-generated View All Issues

We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.

Ways to Contribute:

  • 🐞 Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
  • πŸ‘€ Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
  • πŸ“š Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.

If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:

  • Tips for finding an issue to work on (we recommend starting with our good-first issues!.

🌟 Contributors Wall

A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!

🌟 Contribution Type Legend & Ranking
Emoji Contribution Meaning
πŸ’» Code Submitted merged pull requests or code changes.
πŸ› Issues Opened valid issues or bug reports.
πŸ‘€ Reviews Reviewed pull requests and provided feedback.

πŸ† Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.


xiangxu-google
xiangxu-google

πŸ’»
jrplatin
jrplatin

πŸ› πŸ‘€ πŸ’»
buildkite-bot
buildkite-bot

πŸ’»
kyuyeunk
kyuyeunk

πŸ› πŸ‘€ πŸ’»
py4
py4

πŸ’»
fenghuizhang
fenghuizhang

πŸ’»
lk-chen
lk-chen

πŸ› πŸ‘€ πŸ’»
wenxindongwork
wenxindongwork

πŸ‘€ πŸ’»
vanbasten23
vanbasten23

πŸ‘€ πŸ’»
sixiang-google
sixiang-google

πŸ’»
lsy323
lsy323

πŸ’»
Lumosis
Lumosis

πŸ’»
QiliangCui
QiliangCui

πŸ‘€ πŸ’»
Chenyaaang
Chenyaaang

πŸ‘€ πŸ’»
bzgoogle
bzgoogle

πŸ‘€ πŸ’»
gpolovets1
gpolovets1

πŸ‘€ πŸ’»
mrjunwan-lang
mrjunwan-lang

πŸ‘€ πŸ’»
yarongmu-google
yarongmu-google

πŸ’»
wwl2755-google
wwl2755-google

πŸ’»
yaochengji
yaochengji

πŸ’»
patemotter
patemotter

πŸ‘€ πŸ’»

...and more! Click to view all contributors.
boe20211
boe20211

πŸ’»
jcyang43
jcyang43

πŸ‘€ πŸ’»
kwang3939
kwang3939

πŸ‘€ πŸ’»
bythew3i
bythew3i

πŸ’»
pv97
pv97

πŸ‘€ πŸ’»
karan
karan

πŸ› πŸ’»
dennisYehCienet
dennisYehCienet

πŸ‘€ πŸ’»
syhuang22
syhuang22

πŸ‘€ πŸ’»
helloworld1
helloworld1

πŸ› πŸ‘€ πŸ’»
ica-chao
ica-chao

πŸ’»
richardsliu
richardsliu

πŸ‘€ πŸ’»
catswe
catswe

πŸ‘€ πŸ’»
RobMulla
RobMulla

πŸ› πŸ’»
xingliu14
xingliu14

πŸ› πŸ’»
juncgu-google
juncgu-google

πŸ‘€
saltysoup
saltysoup

πŸ›
weiyu0824
weiyu0824

πŸ‘€ πŸ’»
andrewkvuong
andrewkvuong

πŸ’»
rupengliu-meta
rupengliu-meta

πŸ› πŸ’»
bvrockwell
bvrockwell

πŸ› πŸ’»
sierraisland
sierraisland

πŸ’»
wang2yn84
wang2yn84

πŸ’»
wdhongtw
wdhongtw

πŸ’»
JiriesKaileh
JiriesKaileh

πŸ’»
ylangtsou
ylangtsou

πŸ’»
amacaskill
amacaskill

πŸ’»
BirdsOfAFthr
BirdsOfAFthr

πŸ’»
patrickji2014
patrickji2014

πŸ‘€ πŸ’»
qihqi
qihqi

πŸ› πŸ’»
yuanfz98
yuanfz98

πŸ›
cychiuak
cychiuak

πŸ’»
hosseinsarshar
hosseinsarshar

πŸ› πŸ’»
samos123
samos123

πŸ›
AlienKevin
AlienKevin

πŸ›
dgouju
dgouju

πŸ›
eitanporat
eitanporat

πŸ›
ernie-chang
ernie-chang

πŸ’»
lepan-google
lepan-google

πŸ› πŸ’»
muskansh-google
muskansh-google

πŸ›
saikat-royc
saikat-royc

πŸ‘€
abhinavclemson
abhinavclemson

πŸ’»
aman2930
aman2930

πŸ’»
BabyChouSr
BabyChouSr

πŸ›
CienetStingLin
CienetStingLin

πŸ’»
coolkp
coolkp

πŸ’»
functionstackx
functionstackx

πŸ›
helloleah
helloleah

πŸ’»
mailvijayasingh
mailvijayasingh

πŸ’»
QiliangCui2023
QiliangCui2023

πŸ‘€
shireen-bean
shireen-bean

πŸ›
utkarshsharma1
utkarshsharma1

πŸ’»
A9isha
A9isha

πŸ’»
AahilA
AahilA

πŸ’»
amishacorns
amishacorns

πŸ’»
carlesoctav
carlesoctav

πŸ›
dannikay
dannikay

πŸ’»
depksingh
depksingh

πŸ›
Dineshkumar-Anandan-ZS0367
Dineshkumar-Anandan-ZS0367

πŸ›
dtrifiro
dtrifiro

πŸ›
erfanzar
erfanzar

πŸ›
inho9606
inho9606

πŸ’»
jk1333
jk1333

πŸ›
jyj0w0
jyj0w0

πŸ‘€
kuafou
kuafou

πŸ’»
kyle-google
kyle-google

πŸ’»
Mhdaw
Mhdaw

πŸ›
mokeddembillel
mokeddembillel

πŸ›
oindrila-b
oindrila-b

πŸ›
oliverdutton
oliverdutton

πŸ›
pathfinder-pf
pathfinder-pf

πŸ›
piotrfrankowski
piotrfrankowski

πŸ›
reeaz27-droid
reeaz27-droid

πŸ›
rupeng-liu
rupeng-liu

πŸ’»
salmanmohammadi
salmanmohammadi

πŸ›
vlad-karp
vlad-karp

πŸ’»
XMaster96
XMaster96

πŸ›
yixinshi
yixinshi

πŸ‘€
yuyanpeng-google
yuyanpeng-google

πŸ’»
zixi-qi
zixi-qi

πŸ’»
zongweiz
zongweiz

πŸ›
zzzwen
zzzwen

πŸ’»

πŸ’¬Β  Contact us

About

TPU inference for vLLM, with unified JAX and PyTorch support.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors