Skip to content

Commit 9ef13e4

Browse files
committed
update(docs): Updated README.
1 parent f54cf01 commit 9ef13e4

File tree

1 file changed

+47
-51
lines changed

1 file changed

+47
-51
lines changed

README.md

Lines changed: 47 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -11,94 +11,95 @@
1111
</p>
1212
</div>
1313

14-
TileRT is an experimental project that explores core compiler techniques designed to serve large language models in ultra-low-latency scenarios. Unlike existing inference systems built for high-throughput batch processing, TileRT focuses on delivering extreme responsiveness—critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI coding, where users care more about the latency of a few requests or even a single request.
14+
## News
1515

16-
The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT.
16+
- **\[2025-12-23\]****[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved a significant improvement in end-to-end token generation, reducing latency by ~35% on a single node with 8× NVIDIA B200.
17+
18+
- **\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
19+
20+
## TileRT: Pushing LLM Latency to the Limit
21+
22+
TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
1723

1824
<p align="center">
1925
<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
20-
Fig. Sequence generation using SGLang (left), vLLM (middle), and TileRT (right) with the DeepSeek-V3.2-Exp model.
26+
Figure: Sequence generation comparison (ISL/OSL 1K/1K) using the DeepSeek-V3.2-Exp model with three frameworks: SGLang (left), vLLM (middle), and TileRT (right).
2127
</p>
2228

23-
TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale.
24-
25-
We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems:
29+
We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
2630

2731
<p align="center">
2832
<img src="assets/perf.png" alt="TileRT Benchmark" width="400"><br>
2933
Fig. Evaluation setup: batch size: 1, input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9
3034
</p>
3135

32-
TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates!
33-
34-
- [Installation](#installation)
35-
- [Prerequisites](#prerequisites)
36-
- [**Hardware**](#hardware)
37-
- [**Operating System**](#operating-system)
38-
- [**Python**](#python)
39-
- [**PyTorch Build**](#pytorch-build)
40-
- [Python Package Installation](#python-package-installation)
41-
- [Docker Installation](#docker-installation)
42-
- [Getting Started](#getting-started)
43-
- [Download Pre-Converted Weights from HuggingFace](#download-pre-converted-weights-from-huggingface)
44-
- [Option 1: Using `huggingface-cli` (recommended)](#option-1-using-huggingface-cli-recommended)
45-
- [Option 2: Using Git + Git LFS](#option-2-using-git--git-lfs)
46-
- [Running the Generation Example](#running-the-generation-example)
47-
- [Status & Future Work](#status--future-work)
36+
Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
37+
38+
To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a compiler-driven approach, LLM operators are decomposed into fine-grained tile-level tasks, while the runtime dynamically reschedules computation, I/O, and communication across multiple devices in a highly overlapped manner. This design minimizes idle time and improves hardware utilization.
39+
40+
The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.
4841

4942
## Installation
5043

44+
- [Prerequisites](#prerequisites)
45+
- [Python Package Installation](#python-package-installation)
46+
5147
### Prerequisites
5248

53-
Before installing the TileRT wheel package, please ensure your environment meets the following requirements:
49+
Before installing TileRT, ensure your environment meets the following requirements:
5450

55-
#### **Hardware**
51+
**Hardware Requirements**
5652

57-
- 8 NVIDIA B200 GPUs
53+
- 8× NVIDIA B200 GPUs
5854

59-
#### **Operating System**
55+
**Operating System**
6056

6157
- Linux x86_64 (Ubuntu 20.04 or later recommended)
6258

63-
#### **Python**
59+
**Python Version**
6460

6561
- Python 3.11 – 3.12
66-
*(The wheel is built and tested against these versions.)*
62+
*(The wheel package is built and tested against these versions.)*
6763

68-
#### **PyTorch Build**
64+
**PyTorch Build**
6965

70-
- PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)
66+
- PyTorch wheels compiled for CUDA 12.8 or 12.9
67+
*(Must match the CUDA driver/runtime version required for B200 GPUs.)*
7168

7269
### Python Package Installation
7370

7471
> \[!IMPORTANT\]
75-
> ***Disclaimer***: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image.
72+
> **Disclaimer**: TileRT is an experimental project. The current pre-built package supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image.
7673
77-
#### Docker Installation
74+
The recommended installation method is using the pre-configured Docker image, which includes all necessary dependencies.
7875

79-
To get started, pull the Docker image:
76+
**Step 1: Pull the Docker image**
8077

8178
```bash
8279
docker pull tileai/tilert:v0.1.0
8380
```
8481

85-
Then, launch a Docker container using the following command:
82+
**Step 2: Launch a Docker container**
8683

8784
```bash
8885
IMAGE_NAME="tileai/tilert:v0.1.0"
89-
WORKSPACE_PATH="xxx" # Path to the workspace you want to mount
86+
WORKSPACE_PATH="/path/to/your/workspace" # Replace with your actual workspace path
9087

9188
docker run --gpus all -it \
9289
-v $WORKSPACE_PATH:/workspace/ \
9390
$IMAGE_NAME
9491
```
9592

96-
After the container starts, install the TileRT package:
93+
**Step 3: Install the TileRT package**
94+
95+
Once inside the container, install TileRT using pip:
9796

9897
```bash
9998
pip install tilert
10099
```
101100

101+
You're now ready to use TileRT! Proceed to the [Getting Started](#getting-started) section to download model weights and run your first inference.
102+
102103
## Getting Started
103104

104105
### Download Pre-Converted Weights from HuggingFace
@@ -147,43 +148,38 @@ docker run --gpus all -it \
147148
Once inside the container, you can run the following Python script:
148149

149150
```python
150-
import torch # TileRT requires PyTorch runtime to be loaded first
151-
from tilert.generate import ShowHandsGenerator
152-
153-
# Initialize the generator with desired settings
154-
generator = ShowHandsGenerator(
155-
max_new_tokens=4000,
156-
temperature=0.0,
157-
model_weights_dir="xxx", # Specify your model weights directory here
158-
)
151+
from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
159152

160-
# Load pre-trained weights
153+
generator: ShowHandsGenerator = ShowHandsGenerator(
154+
max_new_tokens=1000,
155+
model_weights_dir=MODEL_WEIGHTS_DIR,
156+
)
161157
generator.from_pretrained()
162158

163-
# Example prompt to test the model's generation abilities
164159
prompt = """Tell me three jokes:
165160
166161
1. A dad joke,
167162
2. A programmer joke,
168163
3. A joke that only makes sense if you've ever tried to train a large language model.
169164
Keep each joke under 15 words.
170165
"""
166+
171167
print("Prompt:", prompt)
172168
print("Completion:")
173-
completion = generator.generate(prompt)
169+
completion: generator.generate(prompt)
174170
```
175171

176172
For instance, using the above prompt, TileRT might generate:
177173

178174
```text
179175
1. I'm afraid for the calendar. Its days are numbered.
180-
2. There are 10 types of people: those who understand binary and those who don't.
181-
3. My model just generated a coherent sentence. I think I'll go lie down.
176+
2. There are only 10 kinds of people: those who understand binary and those who don't.
177+
3. My model's loss is low, but its answers are still nonsense. Overfitting.
182178
```
183179

184180
This example gives you a quick idea of the type of output you can expect from the precompiled model.
185181

186-
For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/tilert/generate.py).
182+
For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py).
187183

188184
## Status & Future Work
189185

0 commit comments

Comments
 (0)