|
11 | 11 | </p> |
12 | 12 | </div> |
13 | 13 |
|
14 | | -TileRT is an experimental project that explores core compiler techniques designed to serve large language models in ultra-low-latency scenarios. Unlike existing inference systems built for high-throughput batch processing, TileRT focuses on delivering extreme responsiveness—critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI coding, where users care more about the latency of a few requests or even a single request. |
| 14 | +## News |
15 | 15 |
|
16 | | -The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT. |
| 16 | +- **\[2025-12-23\]** ⚡ **[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved notable latency reduction for end-to-end token generation. Full performance results are available in our latest speed tests. |
| 17 | + |
| 18 | +- **\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT). |
| 19 | + |
| 20 | +## TileRT: Pushing LLM Latency to the Limit |
| 21 | + |
| 22 | +TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**. |
17 | 23 |
|
18 | 24 | <p align="center"> |
19 | 25 | <img src="assets/generate.gif" alt="TileRT Benchmark"><br> |
20 | | -Fig. Sequence generation using SGLang (left), vLLM (middle), and TileRT (right) with the DeepSeek-V3.2-Exp model. |
| 26 | +Figure: Sequence generation comparison (ISL/OSL 1K/1K) using the DeepSeek-V3.2-Exp model with three frameworks: SGLang (left), vLLM (middle), and TileRT (right). |
21 | 27 | </p> |
22 | 28 |
|
23 | | -TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale. |
24 | | - |
25 | | -We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems: |
| 29 | +We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems. |
26 | 30 |
|
27 | 31 | <p align="center"> |
28 | 32 | <img src="assets/perf.png" alt="TileRT Benchmark" width="400"><br> |
29 | 33 | Fig. Evaluation setup: batch size: 1, input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9 |
30 | 34 | </p> |
31 | 35 |
|
32 | | -TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates! |
33 | | - |
34 | | -- [Installation](#installation) |
35 | | - - [Prerequisites](#prerequisites) |
36 | | - - [**Hardware**](#hardware) |
37 | | - - [**Operating System**](#operating-system) |
38 | | - - [**Python**](#python) |
39 | | - - [**PyTorch Build**](#pytorch-build) |
40 | | - - [Python Package Installation](#python-package-installation) |
41 | | - - [Docker Installation](#docker-installation) |
42 | | -- [Getting Started](#getting-started) |
43 | | - - [Download Pre-Converted Weights from HuggingFace](#download-pre-converted-weights-from-huggingface) |
44 | | - - [Option 1: Using `huggingface-cli` (recommended)](#option-1-using-huggingface-cli-recommended) |
45 | | - - [Option 2: Using Git + Git LFS](#option-2-using-git--git-lfs) |
46 | | - - [Running the Generation Example](#running-the-generation-example) |
47 | | -- [Status & Future Work](#status--future-work) |
| 36 | +Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most. |
| 37 | + |
| 38 | +To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a compiler-driven approach, LLM operators are decomposed into fine-grained tile-level tasks, while the runtime dynamically reschedules computation, I/O, and communication across multiple devices in a highly overlapped manner. This design minimizes idle time and improves hardware utilization. |
| 39 | + |
| 40 | +The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**. |
48 | 41 |
|
49 | 42 | ## Installation |
50 | 43 |
|
| 44 | +- [Prerequisites](#prerequisites) |
| 45 | +- [Python Package Installation](#python-package-installation) |
| 46 | + |
51 | 47 | ### Prerequisites |
52 | 48 |
|
53 | | -Before installing the TileRT wheel package, please ensure your environment meets the following requirements: |
| 49 | +Before installing TileRT, ensure your environment meets the following requirements: |
54 | 50 |
|
55 | | -#### **Hardware** |
| 51 | +**Hardware Requirements** |
56 | 52 |
|
57 | | -- 8 NVIDIA B200 GPUs |
| 53 | +- 8× NVIDIA B200 GPUs |
58 | 54 |
|
59 | | -#### **Operating System** |
| 55 | +**Operating System** |
60 | 56 |
|
61 | 57 | - Linux x86_64 (Ubuntu 20.04 or later recommended) |
62 | 58 |
|
63 | | -#### **Python** |
| 59 | +**Python Version** |
64 | 60 |
|
65 | 61 | - Python 3.11 – 3.12 |
66 | | - *(The wheel is built and tested against these versions.)* |
| 62 | + *(The wheel package is built and tested against these versions.)* |
67 | 63 |
|
68 | | -#### **PyTorch Build** |
| 64 | +**PyTorch Build** |
69 | 65 |
|
70 | | -- PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200) |
| 66 | +- PyTorch wheels compiled for CUDA 12.8 or 12.9 |
| 67 | + *(Must match the CUDA driver/runtime version required for B200 GPUs.)* |
71 | 68 |
|
72 | 69 | ### Python Package Installation |
73 | 70 |
|
74 | 71 | > \[!IMPORTANT\] |
75 | | -> ***Disclaimer***: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image. |
| 72 | +> **Disclaimer**: TileRT is an experimental project. The current pre-built package supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image. |
76 | 73 |
|
77 | | -#### Docker Installation |
| 74 | +The recommended installation method is using the pre-configured Docker image, which includes all necessary dependencies. |
78 | 75 |
|
79 | | -To get started, pull the Docker image: |
| 76 | +**Step 1: Pull the Docker image** |
80 | 77 |
|
81 | 78 | ```bash |
82 | 79 | docker pull tileai/tilert:v0.1.0 |
83 | 80 | ``` |
84 | 81 |
|
85 | | -Then, launch a Docker container using the following command: |
| 82 | +**Step 2: Launch a Docker container** |
86 | 83 |
|
87 | 84 | ```bash |
88 | 85 | IMAGE_NAME="tileai/tilert:v0.1.0" |
89 | | -WORKSPACE_PATH="xxx" # Path to the workspace you want to mount |
| 86 | +WORKSPACE_PATH="/path/to/your/workspace" # Replace with your actual workspace path |
90 | 87 |
|
91 | 88 | docker run --gpus all -it \ |
92 | 89 | -v $WORKSPACE_PATH:/workspace/ \ |
93 | 90 | $IMAGE_NAME |
94 | 91 | ``` |
95 | 92 |
|
96 | | -After the container starts, install the TileRT package: |
| 93 | +**Step 3: Install the TileRT package** |
| 94 | + |
| 95 | +Once inside the container, install TileRT using pip: |
97 | 96 |
|
98 | 97 | ```bash |
99 | 98 | pip install tilert |
100 | 99 | ``` |
101 | 100 |
|
| 101 | +You're now ready to use TileRT! Proceed to the [Getting Started](#getting-started) section to download model weights and run your first inference. |
| 102 | + |
102 | 103 | ## Getting Started |
103 | 104 |
|
104 | 105 | ### Download Pre-Converted Weights from HuggingFace |
@@ -147,43 +148,38 @@ docker run --gpus all -it \ |
147 | 148 | Once inside the container, you can run the following Python script: |
148 | 149 |
|
149 | 150 | ```python |
150 | | -import torch # TileRT requires PyTorch runtime to be loaded first |
151 | | -from tilert.generate import ShowHandsGenerator |
152 | | - |
153 | | -# Initialize the generator with desired settings |
154 | | -generator = ShowHandsGenerator( |
155 | | - max_new_tokens=4000, |
156 | | - temperature=0.0, |
157 | | - model_weights_dir="xxx", # Specify your model weights directory here |
158 | | -) |
| 151 | +from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator |
159 | 152 |
|
160 | | -# Load pre-trained weights |
| 153 | +generator: ShowHandsGenerator = ShowHandsGenerator( |
| 154 | + max_new_tokens=1000, |
| 155 | + model_weights_dir=MODEL_WEIGHTS_DIR, |
| 156 | +) |
161 | 157 | generator.from_pretrained() |
162 | 158 |
|
163 | | -# Example prompt to test the model's generation abilities |
164 | 159 | prompt = """Tell me three jokes: |
165 | 160 |
|
166 | 161 | 1. A dad joke, |
167 | 162 | 2. A programmer joke, |
168 | 163 | 3. A joke that only makes sense if you've ever tried to train a large language model. |
169 | 164 | Keep each joke under 15 words. |
170 | 165 | """ |
| 166 | + |
171 | 167 | print("Prompt:", prompt) |
172 | 168 | print("Completion:") |
173 | | -completion = generator.generate(prompt) |
| 169 | +completion: generator.generate(prompt) |
174 | 170 | ``` |
175 | 171 |
|
176 | 172 | For instance, using the above prompt, TileRT might generate: |
177 | 173 |
|
178 | 174 | ```text |
179 | 175 | 1. I'm afraid for the calendar. Its days are numbered. |
180 | | -2. There are 10 types of people: those who understand binary and those who don't. |
181 | | -3. My model just generated a coherent sentence. I think I'll go lie down. |
| 176 | +2. There are only 10 kinds of people: those who understand binary and those who don't. |
| 177 | +3. My model's loss is low, but its answers are still nonsense. Overfitting. |
182 | 178 | ``` |
183 | 179 |
|
184 | 180 | This example gives you a quick idea of the type of output you can expect from the precompiled model. |
185 | 181 |
|
186 | | -For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/tilert/generate.py). |
| 182 | +For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py). |
187 | 183 |
|
188 | 184 | ## Status & Future Work |
189 | 185 |
|
|
0 commit comments