Skip to content

Commit 4b66aaa

Browse files
committed
update(docs): Updated README.
1 parent f54cf01 commit 4b66aaa

File tree

1 file changed

+22
-37
lines changed

1 file changed

+22
-37
lines changed

README.md

Lines changed: 22 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -11,44 +11,34 @@
1111
</p>
1212
</div>
1313

14-
TileRT is an experimental project that explores core compiler techniques designed to serve large language models in ultra-low-latency scenarios. Unlike existing inference systems built for high-throughput batch processing, TileRT focuses on delivering extreme responsiveness—critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI coding, where users care more about the latency of a few requests or even a single request.
14+
## News
1515

16-
The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT.
16+
- **\[2025/12\]****v0.1.1 released** — end-to-end token generation speed significantly reduced (~35%) on a single node with 8× NVIDIA B200, improving from ~170 to ~230 tokens/s under ultra-low-latency settings.
17+
- **\[2025/11\]** 🚀 TileRT initial release for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference (available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT)).
18+
19+
## About
20+
21+
TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
1722

1823
<p align="center">
1924
<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
2025
Fig. Sequence generation using SGLang (left), vLLM (middle), and TileRT (right) with the DeepSeek-V3.2-Exp model.
2126
</p>
2227

23-
TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale.
24-
25-
We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems:
28+
We evaluated TileRT’s preliminary performance using the **DeepSeek-V3.2-Exp** model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
2629

2730
<p align="center">
2831
<img src="assets/perf.png" alt="TileRT Benchmark" width="400"><br>
2932
Fig. Evaluation setup: batch size: 1, input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9
3033
</p>
3134

32-
TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates!
33-
34-
- [Installation](#installation)
35-
- [Prerequisites](#prerequisites)
36-
- [**Hardware**](#hardware)
37-
- [**Operating System**](#operating-system)
38-
- [**Python**](#python)
39-
- [**PyTorch Build**](#pytorch-build)
40-
- [Python Package Installation](#python-package-installation)
41-
- [Docker Installation](#docker-installation)
42-
- [Getting Started](#getting-started)
43-
- [Download Pre-Converted Weights from HuggingFace](#download-pre-converted-weights-from-huggingface)
44-
- [Option 1: Using `huggingface-cli` (recommended)](#option-1-using-huggingface-cli-recommended)
45-
- [Option 2: Using Git + Git LFS](#option-2-using-git--git-lfs)
46-
- [Running the Generation Example](#running-the-generation-example)
47-
- [Status & Future Work](#status--future-work)
35+
Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
4836

49-
## Installation
37+
To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a compiler-driven approach, LLM operators are decomposed into fine-grained tile-level tasks, while the runtime dynamically reschedules computation, I/O, and communication across multiple devices in a highly overlapped manner. This design minimizes idle time and improves hardware utilization.
5038

51-
### Prerequisites
39+
The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.
40+
41+
## Installation
5242

5343
Before installing the TileRT wheel package, please ensure your environment meets the following requirements:
5444

@@ -147,38 +137,33 @@ docker run --gpus all -it \
147137
Once inside the container, you can run the following Python script:
148138

149139
```python
150-
import torch # TileRT requires PyTorch runtime to be loaded first
151-
from tilert.generate import ShowHandsGenerator
152-
153-
# Initialize the generator with desired settings
154-
generator = ShowHandsGenerator(
155-
max_new_tokens=4000,
156-
temperature=0.0,
157-
model_weights_dir="xxx", # Specify your model weights directory here
158-
)
140+
from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
159141

160-
# Load pre-trained weights
142+
generator: ShowHandsGenerator = ShowHandsGenerator(
143+
max_new_tokens=1000,
144+
model_weights_dir=MODEL_WEIGHTS_DIR,
145+
)
161146
generator.from_pretrained()
162147

163-
# Example prompt to test the model's generation abilities
164148
prompt = """Tell me three jokes:
165149
166150
1. A dad joke,
167151
2. A programmer joke,
168152
3. A joke that only makes sense if you've ever tried to train a large language model.
169153
Keep each joke under 15 words.
170154
"""
155+
171156
print("Prompt:", prompt)
172157
print("Completion:")
173-
completion = generator.generate(prompt)
158+
completion: generator.generate(prompt)
174159
```
175160

176161
For instance, using the above prompt, TileRT might generate:
177162

178163
```text
179164
1. I'm afraid for the calendar. Its days are numbered.
180-
2. There are 10 types of people: those who understand binary and those who don't.
181-
3. My model just generated a coherent sentence. I think I'll go lie down.
165+
2. There are only 10 kinds of people: those who understand binary and those who don't.
166+
3. My model's loss is low, but its answers are still nonsense. Overfitting.
182167
```
183168

184169
This example gives you a quick idea of the type of output you can expect from the precompiled model.

0 commit comments

Comments
 (0)