You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Release TileRT v0.1.2-alpha.1 with initial support for Multi-Token
Prediction (MTP).
With mtp=3, decoding reaches up to 590 tokens/s on synthetic workloads
and ~440 tokens/s on real generation tasks.
-**\[2025-12-23\]** ⚡ **[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved ~35% reduction in end-to-end token generation latency on a single node with 8× NVIDIA B200. See our latest benchmarks for detailed measurements.
19
+
<aid="news"></a>
17
20
18
-
-**\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
21
+
## 📰 News
22
+
23
+
-:fire:**2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
24
+
25
+
- ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
26
+
27
+
- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
27
40
</p>
28
41
29
42
We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
@@ -39,6 +52,8 @@ To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a
39
52
40
53
The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.
3. A joke that only makes sense if you've ever tried to train a large language model.
164
-
Keep each joke under 15 words.
165
-
"""
175
+
prompt = (
176
+
"Tell me three jokes:\n\n"
177
+
"1. A dad joke,\n"
178
+
"2. A programmer joke,\n"
179
+
"3. A joke that only makes sense if you've ever tried "
180
+
"to train a large language model.\n"
181
+
"Keep each joke under 15 words."
182
+
)
166
183
167
184
print("Prompt:", prompt)
168
185
print("Completion:")
169
-
completion: generator.generate(prompt)
186
+
completion= generator.generate(prompt)
170
187
```
171
188
172
-
For instance, using the above prompt, TileRT might generate:
189
+
For example, TileRT may generate:
190
+
191
+
<details>
192
+
<summary><b>Sample output (click to expand)</b></summary>
173
193
174
194
```text
175
195
1. I'm afraid for the calendar. Its days are numbered.
176
196
2. There are only 10 kinds of people: those who understand binary and those who don't.
177
197
3. My model's loss is low, but its answers are still nonsense. Overfitting.
178
198
```
179
199
180
-
This example gives you a quick idea of the type of output you can expect from the precompiled model.
200
+
</details>
201
+
202
+
This example demonstrates basic single-step autoregressive generation using the precompiled model.
203
+
204
+
### Running the Generation Example with Multi-Token Prediction (MTP)
205
+
206
+
> \[!IMPORTANT\]
207
+
> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
208
+
> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
209
+
> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
210
+
211
+
TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
212
+
213
+
To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
214
+
215
+
```python
216
+
from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
prompt ="Tell me 10 jokes, keep them all under 100 words."
225
+
226
+
print("Prompt:", prompt)
227
+
print("Completion:")
228
+
completion = generator.generate(prompt)
229
+
```
230
+
231
+
When MTP is enabled, TileRT may report statistics similar to the following during generation:
232
+
233
+
```text
234
+
Accepted length: mean=2.77, min=1, max=4
235
+
```
236
+
237
+
This indicates that, on average, multiple tokens are accepted per decoding step under MTP.
238
+
239
+
<details>
240
+
<summary><b>Sample output (click to expand)</b></summary>
241
+
242
+
```text
243
+
Of course! Here are 10 short jokes for you.
244
+
245
+
1. I told my wife she was drawing her eyebrows too high. She looked surprised.
246
+
247
+
2. I invented a new word: Plagiarism.
248
+
249
+
3. Why don't scientists trust atoms? Because they make up everything.
250
+
251
+
4. I'm reading a book on anti-gravity. It's impossible to put down.
252
+
253
+
5. What's the best thing about Switzerland? I don't know, but the flag is a big plus.
254
+
255
+
6. I told my computer I needed a break, and now it won't stop sending me vacation ads.
256
+
257
+
7. Why did the scarecrow win an award? He was outstanding in his field.
258
+
259
+
8. What do you call a fake noodle? An impasta.
260
+
261
+
9. I told my suitcase there's no vacation, and now it has a lot of baggage.
262
+
263
+
10. Why don't skeletons fight each other? They don't have the guts.
264
+
```
265
+
266
+
</details>
267
+
268
+
This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
181
269
182
270
For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py).
0 commit comments