Skip to content

Commit deeedc3

Browse files
lcy-sesokdtreeyuxiaoguoxiayuqing0622jlxue
authored
v0.1.3 release. GLM-5 lands! (tile-ai#19)
v0.1.3 release. GLM-5 lands. Co-authored-by: Guojun Chen <gjchen@live.com> Co-authored-by: Yuxiao Guo <yuxiao.guo@outlook.com> Co-authored-by: Yuqing Xia <Xiayuqing0622@outlook.com> Co-authored-by: Jilong Xue <xuejilong@gmail.com> Co-authored-by: Lingxiao Ma <xysmlx@gmail.com> Co-authored-by: Liu Heng <18821707235@163.com> Co-authored-by: Zheng QiHang <zhengqihang0915@qq.com>
1 parent d18b3ef commit deeedc3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+12460
-2957
lines changed

README.md

Lines changed: 41 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,30 +20,39 @@ ______________________________________________________________________
2020

2121
## 📰 News
2222

23-
- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
23+
- :fire: **2026-02-14 · [Try the Online Demo](https://www.tilert.ai/)**. Our online demo is now live! Experience ultra-low-latency inference with **GLM-5** and **DeepSeek-V3.2**. [Try it now !](https://www.tilert.ai)
24+
25+
- 🎉 **2026-02-14 · [v0.1.3](https://github.com/tile-ai/TileRT/releases/tag/v0.1.3) Released**. The v0.1.3 release introduces full support for the latest GLM-5 model, achieving up to 500 tokens/s on GLM-5-FP8 and up to 600 tokens/s on DeepSeek-V3.2.
26+
27+
- 🚀 **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP)** is now available in TileRT! With mtp=3, we achieve decoding rates of up to **590 tokens/s** under synthetic workloads.
28+
29+
<details>
30+
<summary>Key Milestones</summary>
2431

2532
-**2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
2633

2734
- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
2835

36+
</details>
37+
2938
______________________________________________________________________
3039

3140
<a id="overview"></a>
3241

33-
## TileRT: Pushing LLM Latency to the Limit
42+
**TileRT** is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).
43+
44+
In our latest **v0.1.3** release, we tested **TileRT's** performance on the newest [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.
3445

35-
TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
46+
Using the [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.
3647

3748
<p align="center">
38-
<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
39-
Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
49+
<img src="assets/glm5-mtp.png" alt="TileRT Benchmark" width="800"><br>
50+
Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as <a href="https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html">vLLM-GPT5-recipe</a>); TileRT v0.1.3 with MTP=3.
4051
</p>
4152

42-
We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
43-
4453
<p align="center">
45-
<img src="assets/perf.png" alt="TileRT Benchmark" width="500"><br>
46-
Figure 2. Evaluation setup. Batch size: 1, Input sequence length/Output sequence length: 1K/1K; SGLang v0.5.6, TensorRT-LLM v1.2.0-rc5, vLLM v0.13.0, TileRT v0.1.1 with CUDA 12.9.
54+
<img src="assets/glm5-without-mtp.png" alt="TileRT Benchmark" width="800"><br>
55+
Figure 2. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0; vLLM v0.16.0rc2.dev173; TileRT v0.1.3.
4756
</p>
4857

4958
Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
@@ -117,36 +126,46 @@ You're now ready to use TileRT! Proceed to the [Getting Started](#getting-starte
117126

118127
## Getting Started
119128

120-
### Download Pre-Converted Weights from HuggingFace
129+
### Step 1: Download Official Model Weights
130+
131+
Starting from release v0.1.3, TileRT no longer requires downloading pre-converted weights from Hugging Face. Instead, you can download the official model weights directly from the model's source (e.g., Hugging Face), and then convert them using the weight converter script included with the latest TileRT release.
121132

122-
TileRT requires preprocessing of the original DeepSeek-V3.2-Exp model weights before they can be used for ultra-low-latency inference.
123-
To simplify this process, we provide **pre-converted weights** directly on HuggingFace so users do not need to run the preprocessing pipeline themselves.
133+
### Step 2: Convert Weights Using `weight_converter.py`
124134

125-
You can download the weights using one of the recommended methods below:
135+
After downloading the official model weights, you can use the following command to convert them into a format compatible with TileRT:
126136

127-
#### Option 1: Using `huggingface-cli` (recommended)
137+
For **DeepSeek-V3.2**, run:
128138

129139
```bash
130-
hf download Tile-AI/DeepSeek-V3.2-Exp-TileRT --local-dir ./tilert_weights
140+
python -m tilert.models.preprocess.weight_converter \
141+
--model_type deepseek-v32 \
142+
--model_dir "/path/to/DeepSeek-V3.2" \
143+
--save_dir "/path/to/DeepSeek-V3.2-TileRT"
131144
```
132145

133-
This will download all files into the `./tilert_weights` directory.
146+
Replace `/path/to/DeepSeek-V3.2` with the directory where you've downloaded the model weights, and `/path/to/DeepSeek-V3.2-TileRT` with the directory where you'd like the converted weights to be saved.
134147

135-
#### Option 2: Using Git + Git LFS
148+
Similarly, for **GLM-5**, run:
136149

137150
```bash
138-
git lfs install
139-
git clone https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT
151+
python -m tilert.models.preprocess.weight_converter \
152+
--model_type glm-5 \
153+
--model_dir "/path/to/GLM-5-FP8" \
154+
--save_dir "/path/to/GLM-5-FP8-TileRT"
140155
```
141156

142-
For additional download methods or advanced usage, please refer to the official Hugging Face documentation.
157+
Replace `/path/to/GLM-5-FP8` with the directory containing the downloaded GLM-5 model weights, and `/path/to/GLM-5-FP8-TileRT` with the desired location for saving the converted weights.
158+
159+
### Step 3: Set the Converted Weights Directory
143160

144-
After downloading the weights, point TileRT to the directory using:
161+
Once the weights are converted, set the environment variable to point TileRT to the directory containing the converted weights:
145162

146163
```bash
147-
export MODEL_WEIGHTS_DIR=/path/to/tilert_weights
164+
export MODEL_WEIGHTS_DIR= ... # converted weights
148165
```
149166

167+
Now you're ready to use TileRT with the converted weights!
168+
150169
### Running the Generation Example
151170

152171
After downloading the model weights, you can run the generation example within the Docker environment as follows:
@@ -203,11 +222,6 @@ This example demonstrates basic single-step autoregressive generation using the
203222

204223
### Running the Generation Example with Multi-Token Prediction (MTP)
205224

206-
> \[!IMPORTANT\]
207-
> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
208-
> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
209-
> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
210-
211225
TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
212226

213227
To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:

assets/generate.gif

-1.11 MB
Binary file not shown.

assets/glm5-mtp.png

235 KB
Loading

assets/glm5-without-mtp.png

244 KB
Loading

assets/logo.png

-268 KB
Loading

assets/perf.png

-42 KB
Binary file not shown.

python/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ def _load_library(filename: str) -> Any:
5050

5151

5252
from . import models # noqa: E402
53-
from .generate import ShowHandsGenerator # noqa: E402
5453
from .models import deepseek_v3_2 # noqa: E402
5554
from .tilert_init import tilert_init # noqa: E402
5655

@@ -59,6 +58,5 @@ def _load_library(filename: str) -> Any:
5958
"tilert_init",
6059
"models",
6160
"deepseek_v3_2",
62-
"ShowHandsGenerator",
6361
"__version__",
6462
]

python/benchmark/__init__.py

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
"""Benchmark suite for TileRT generation."""
2+
3+
from dataclasses import dataclass
4+
from typing import TypeAlias
5+
6+
from tilert.models.deepseek_v3_2.generator import DSAv32Generator
7+
from tilert.models.glm_5.generator import GLM5Generator
8+
9+
Generator: TypeAlias = DSAv32Generator | GLM5Generator
10+
11+
12+
@dataclass
13+
class BenchMode:
14+
"""Configuration for a single benchmark mode."""
15+
16+
with_mtp: bool
17+
label: str
18+
# Sampling parameters — None means keep current generator defaults (top-k1 argmax).
19+
use_topp: bool = False
20+
top_p: float = 1.0
21+
top_k: int = 256
22+
temperature: float = 1.0
23+
24+
25+
@dataclass
26+
class CellStats:
27+
"""Stats for a single table cell (one mode x one benchmark column)."""
28+
29+
tok_s: float = 0.0
30+
ms: float = 0.0
31+
acc_rate: str = "-"
32+
33+
34+
BenchStats = dict[str, dict[str, CellStats]]
35+
36+
37+
def apply_mode(generator: Generator, mode: BenchMode) -> None:
38+
"""Apply sampling parameters for a benchmark mode."""
39+
generator.update_sampling_params(
40+
temperature=mode.temperature,
41+
top_p=mode.top_p,
42+
top_k=mode.top_k,
43+
use_topp=mode.use_topp,
44+
)
45+
46+
47+
def merge_stats(stats_list: list[BenchStats]) -> BenchStats:
48+
"""Merge multiple benchmark stats dicts by mode label."""
49+
merged: BenchStats = {}
50+
for stats in stats_list:
51+
for mode, cols in stats.items():
52+
merged.setdefault(mode, {}).update(cols)
53+
return merged
54+
55+
56+
def _fmt(number: float, suffix: str) -> str:
57+
return f"{number:.3f} {suffix}"
58+
59+
60+
def print_summary_table(
61+
all_stats: BenchStats,
62+
model_name: str,
63+
) -> None:
64+
"""Print a markdown summary table from merged benchmark stats.
65+
66+
Each mode occupies 3 rows: tok/s, ms, acc_rate.
67+
"""
68+
if not all_stats:
69+
return
70+
71+
# Collect column keys in insertion order (preserves benchmark ordering)
72+
col_keys: list[str] = []
73+
for cols in all_stats.values():
74+
for k in cols:
75+
if k not in col_keys:
76+
col_keys.append(k)
77+
78+
ROW_LABELS = ["tok/s", "ms", "acc"]
79+
80+
# Build formatted cell strings: {mode: {col: [row0, row1, row2]}}
81+
formatted: dict[str, dict[str, list[str]]] = {}
82+
for mode, cols in all_stats.items():
83+
formatted[mode] = {}
84+
for k in col_keys:
85+
cell = cols.get(k)
86+
if cell is None:
87+
formatted[mode][k] = ["-", "-", "-"]
88+
else:
89+
formatted[mode][k] = [
90+
_fmt(cell.tok_s, "tok/s"),
91+
_fmt(cell.ms, "ms"),
92+
cell.acc_rate,
93+
]
94+
95+
# Compute column widths
96+
col_widths: dict[str, int] = {}
97+
for k in col_keys:
98+
w = len(k)
99+
for mode_cells in formatted.values():
100+
for row_str in mode_cells.get(k, ["-"]):
101+
w = max(w, len(row_str))
102+
col_widths[k] = w
103+
104+
mode_width = max(len("Mode"), max(len(m) for m in all_stats))
105+
# Row label column shares the mode column; pick wider of mode names vs row labels
106+
mode_width = max(mode_width, max(len(r) for r in ROW_LABELS))
107+
108+
print(f"\n## Benchmark Summary ({model_name})\n")
109+
110+
# Header
111+
hdr = [f" {'Mode':<{mode_width}} "]
112+
hdr += [f" {k:<{col_widths[k]}} " for k in col_keys]
113+
print("|" + "|".join(hdr) + "|")
114+
115+
# Separator
116+
sep = ["-" * (mode_width + 2)]
117+
sep += ["-" * (col_widths[k] + 2) for k in col_keys]
118+
print("|" + "|".join(sep) + "|")
119+
120+
# Data rows: 3 rows per mode
121+
mode_list = list(all_stats.keys())
122+
for _, mode in enumerate(mode_list):
123+
for row_idx, _row_label in enumerate(ROW_LABELS):
124+
label = mode if row_idx == 0 else ""
125+
cells = [f" {label:<{mode_width}} "]
126+
for k in col_keys:
127+
cell_text = formatted[mode][k][row_idx]
128+
cells.append(f" {cell_text:<{col_widths[k]}} ")
129+
print("|" + "|".join(cells) + "|")

python/benchmark/coding_prompt.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""Coding-prompt benchmark: single generation, measures coding task throughput."""
2+
3+
from typing import cast
4+
5+
import numpy as np
6+
from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode
7+
8+
PROMPT = "Hi, can you write a sort program in C for me?"
9+
10+
11+
def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
12+
"""Run the coding-prompt benchmark for each mode.
13+
14+
Returns stats with column: Coding.
15+
"""
16+
stats: BenchStats = {}
17+
18+
for mode in modes:
19+
apply_mode(generator, mode)
20+
print(f"\n--- Coding-prompt benchmark ({mode.label}) ---")
21+
print(f"Prompt: {PROMPT}")
22+
print("Completion:")
23+
24+
_, time_list, accepted_counts = cast(
25+
tuple[str, list[float], list[int]],
26+
generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
27+
)
28+
29+
mode_stats: dict[str, CellStats] = {}
30+
31+
if mode.with_mtp and accepted_counts:
32+
total_tokens = sum(accepted_counts)
33+
total_time = sum(time_list)
34+
speed = total_tokens / total_time if total_time > 0 else 0
35+
avg_ms = total_time / len(time_list) * 1000
36+
avg_a = total_tokens / len(accepted_counts)
37+
acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
38+
mode_stats["Coding"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
39+
elif time_list:
40+
mean_time = float(np.mean(time_list))
41+
speed = 1 / mean_time
42+
mode_stats["Coding"] = CellStats(tok_s=speed, ms=mean_time * 1000)
43+
44+
stats[mode.label] = mode_stats
45+
46+
return stats

python/benchmark/long_prompt.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""Long-prompt benchmark: single generation, measures long-form throughput."""
2+
3+
from typing import cast
4+
5+
import numpy as np
6+
from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode
7+
8+
PROMPT = "Hi, can you tell me a very long story, with roughly 3000 words?"
9+
10+
11+
def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
12+
"""Run the long-prompt benchmark for each mode.
13+
14+
Returns stats with column: Long.
15+
"""
16+
stats: BenchStats = {}
17+
18+
for mode in modes:
19+
apply_mode(generator, mode)
20+
print(f"\n--- Long-prompt benchmark ({mode.label}) ---")
21+
print(f"Prompt: {PROMPT}")
22+
print("Completion:")
23+
24+
_, time_list, accepted_counts = cast(
25+
tuple[str, list[float], list[int]],
26+
generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
27+
)
28+
29+
mode_stats: dict[str, CellStats] = {}
30+
31+
if mode.with_mtp and accepted_counts:
32+
total_tokens = sum(accepted_counts)
33+
total_time = sum(time_list)
34+
speed = total_tokens / total_time if total_time > 0 else 0
35+
avg_ms = total_time / len(time_list) * 1000
36+
avg_a = total_tokens / len(accepted_counts)
37+
acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
38+
mode_stats["Long"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
39+
elif time_list:
40+
mean_time = float(np.mean(time_list))
41+
speed = 1 / mean_time
42+
mode_stats["Long"] = CellStats(tok_s=speed, ms=mean_time * 1000)
43+
44+
stats[mode.label] = mode_stats
45+
46+
return stats

0 commit comments

Comments
 (0)