| Documentation | Blog | User Forum | Developer Slack (#sig-tpu) |
π€ Contribute to the Project
Looking to help? Click a badge below to find issues that need your attention.
- Announcing Gemma 4 on vLLM Byte for byte, the most capable open models - available on TPUs on Day 0!
Previous News π₯
- Pytorch Conference Learn how Spotify uses vLLM with both GPUs and TPUs to drive down costs and improve user experience.
- Ray Summit, November 3-5 in San Francisco!
- JAX DevLab on November 18th in Sunnyvale!
- [2025/10] vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU
vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:
- Push the limits of TPU hardware performance in open source.
- Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
- Retain vLLM standardization: keep the same user experience, telemetry, and interface.
Although vLLM TPUβs new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.
For this reason, weβve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.
Get started with vLLM on TPUs by following the quickstart guide.
Visit our documentation to learn more.
Compatible TPU Generations
- Recommended: v7x, v5e, v6e
- Experimental: v3, v4, v5p
Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.
Last Updated: 2026-04-14 07:31 PM UTC
π¦ Status Legend
- β Passing: Tested and works as expected. Ready for use.
- β Failing: Known to be broken or not functional. Help is wanted to fix this!
- π§ͺ Experimental: Works, but unoptimized or pending community validation.
- π Planned: Not yet implemented, but on the official roadmap.
- βοΈ Unplanned: There is no benefit to adding this.
- β Untested: The functionality exists but has not been recently or thoroughly verified.
π View Matrix Aggregation Rules (v6e/v7x & C+P)
π οΈ Correctness + Performance (C + P)
- β Failing: If either check fails.
- β Passing: If BOTH checks pass successfully.
- β Untested: If any check is untested (and neither fails).
π Hardware Rollups (v6e + v7x)
- β Failing: If the feature fails on either v6e or v7x.
- β Passing: If the feature passes on BOTH v6e and v7x.
- β Untested: If either generation is untested (and neither fails).
Click to expand support matrices
Support status for the latest nightly/main branch developments.
β Tested Models
πΒ Advanced Capabilities
Core Features
Feature Flax Torchax Default Chunked Prefill β β β DCN-based P/D disaggregation β β β LoRA_Torch β β β Prefix Caching β β β Single Program Multi Data β β β Speculative Decoding: Ngram β β β Speculative Decoding: Eagle3 β β β async scheduler β β β Multimodal Inputs β β β Out-of-tree model support β β β hybrid kv cache β β β KV cache host offloading β β β multi-host β β β runai_model_streamer_loader β β β sampling_params β β β Single-Host-P-D-disaggregation β β β structured_decoding β β β Parallelism Techniques
Feature Flax Torchax Single-host Multi-host Single-host Multi-host PP β β β β DP β β β β EP β β β β TP β β β β CP β β β β SPΒ (voteΒ toΒ prioritize) β β β β Quantization Methods
Checkpoint dtype Method Supported
Hardware AccelerationFlax Torchax FP4 W4A16 mxfp4 v7 β β FP8 W8A16 compressed-tensor v7 β β FP8 W8A8 compressed-tensor v7 β β INT4 W4A16 awq v5, v6 β β INT8 W8A8 compressed-tensor v5, v6 β β Note:
- This table only tests checkpoint loading compatibility.
π¬ Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16 Moe FusedΒ MoE β β β β β β gmm β β β β β β Dense AllβgatherΒ matmul β β β β β β Attention GenericΒ RaggedΒ Paged
AttentionΒ V3*β β β β β β MLA β β β β β β RaggedΒ Paged
AttentionΒ V3Β Head_Dim
64*β β β β β β Note:
- For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.
We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.
Ways to Contribute:
- π Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
- π Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
- π Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.
If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:
- Tips for finding an issue to work on (we recommend starting with our good-first issues!.
A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!
π Contribution Type Legend & Ranking
Emoji Contribution Meaning π» Code Submitted merged pull requests or code changes. π Issues Opened valid issues or bug reports. π Reviews Reviewed pull requests and provided feedback.
π Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.
...and more! Click to view all contributors.
- For technical questions and feature requests, open a GitHub Issue
- For feature requests, please open one on Github here
- For discussing with fellow users, use the TPU support topic in the vLLM Forum
- For coordinating contributions and development, use the Developer Slack
- For collaborations and partnerships, contact us at vllm-tpu@google.com

