You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-:fire:**2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
23
+
-:fire:**2026-02-14 · [Try the Online Demo](https://www.tilert.ai/)**. Our online demo is now live! Experience ultra-low-latency inference with **GLM-5** and **DeepSeek-V3.2**. [Try it now !](https://www.tilert.ai)
24
+
25
+
- 🎉 **2026-02-14 · [v0.1.3](https://github.com/tile-ai/TileRT/releases/tag/v0.1.3) Released**. The v0.1.3 release introduces full support for the latest GLM-5 model, achieving up to 500 tokens/s on GLM-5-FP8 and up to 600 tokens/s on DeepSeek-V3.2.
26
+
27
+
- 🚀 **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP)** is now available in TileRT! With mtp=3, we achieve decoding rates of up to **590 tokens/s** under synthetic workloads.
28
+
29
+
<details>
30
+
<summary>Key Milestones</summary>
24
31
25
32
- ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
26
33
27
34
- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
**TileRT** is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).
43
+
44
+
In our latest **v0.1.3** release, we tested **TileRT's** performance on the newest [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.
34
45
35
-
TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
46
+
Using the [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.
Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <ahref="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as <ahref="https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html">vLLM-GPT5-recipe</a>); TileRT v0.1.3 with MTP=3.
40
51
</p>
41
52
42
-
We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
@@ -117,36 +126,46 @@ You're now ready to use TileRT! Proceed to the [Getting Started](#getting-starte
117
126
118
127
## Getting Started
119
128
120
-
### Download Pre-Converted Weights from HuggingFace
129
+
### Step 1: Download Official Model Weights
130
+
131
+
Starting from release v0.1.3, TileRT no longer requires downloading pre-converted weights from Hugging Face. Instead, you can download the official model weights directly from the model's source (e.g., Hugging Face), and then convert them using the weight converter script included with the latest TileRT release.
121
132
122
-
TileRT requires preprocessing of the original DeepSeek-V3.2-Exp model weights before they can be used for ultra-low-latency inference.
123
-
To simplify this process, we provide **pre-converted weights** directly on HuggingFace so users do not need to run the preprocessing pipeline themselves.
133
+
### Step 2: Convert Weights Using `weight_converter.py`
124
134
125
-
You can download the weights using one of the recommended methods below:
135
+
After downloading the official model weights, you can use the following command to convert them into a format compatible with TileRT:
126
136
127
-
#### Option 1: Using `huggingface-cli` (recommended)
This will download all files into the `./tilert_weights`directory.
146
+
Replace `/path/to/DeepSeek-V3.2` with the directory where you've downloaded the model weights, and `/path/to/DeepSeek-V3.2-TileRT` with the directory where you'd like the converted weights to be saved.
For additional download methods or advanced usage, please refer to the official Hugging Face documentation.
157
+
Replace `/path/to/GLM-5-FP8` with the directory containing the downloaded GLM-5 model weights, and `/path/to/GLM-5-FP8-TileRT` with the desired location for saving the converted weights.
158
+
159
+
### Step 3: Set the Converted Weights Directory
143
160
144
-
After downloading the weights, point TileRT to the directory using:
161
+
Once the weights are converted, set the environment variable to point TileRT to the directory containing the converted weights:
145
162
146
163
```bash
147
-
export MODEL_WEIGHTS_DIR=/path/to/tilert_weights
164
+
export MODEL_WEIGHTS_DIR= ... # converted weights
148
165
```
149
166
167
+
Now you're ready to use TileRT with the converted weights!
168
+
150
169
### Running the Generation Example
151
170
152
171
After downloading the model weights, you can run the generation example within the Docker environment as follows:
@@ -203,11 +222,6 @@ This example demonstrates basic single-step autoregressive generation using the
203
222
204
223
### Running the Generation Example with Multi-Token Prediction (MTP)
205
224
206
-
> \[!IMPORTANT\]
207
-
> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
208
-
> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
209
-
> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
210
-
211
225
TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
212
226
213
227
To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
0 commit comments