lcy-seso
diff --git a/‎README.md‎
Lines changed: 41 additions & 27 deletions b/‎README.md‎
Lines changed: 41 additions & 27 deletions
diff --git a/‎assets/generate.gif‎
-1.11 MB b/‎assets/generate.gif‎
-1.11 MB
diff --git a/‎assets/glm5-mtp.png‎
235 KB b/‎assets/glm5-mtp.png‎
235 KB
diff --git a/‎assets/glm5-without-mtp.png‎
244 KB b/‎assets/glm5-without-mtp.png‎
244 KB
diff --git a/‎assets/logo.png‎
-268 KB b/‎assets/logo.png‎
-268 KB
diff --git a/‎assets/perf.png‎
-42 KB b/‎assets/perf.png‎
-42 KB
diff --git a/‎python/__init__.py‎
Lines changed: 0 additions & 2 deletions b/‎python/__init__.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎python/benchmark/__init__.py‎
Lines changed: 129 additions & 0 deletions b/‎python/benchmark/__init__.py‎
Lines changed: 129 additions & 0 deletions
diff --git a/‎python/benchmark/coding_prompt.py‎
Lines changed: 46 additions & 0 deletions b/‎python/benchmark/coding_prompt.py‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎python/benchmark/long_prompt.py‎
Lines changed: 46 additions & 0 deletions b/‎python/benchmark/long_prompt.py‎
Lines changed: 46 additions & 0 deletions
@@ -20,30 +20,39 @@ ______________________________________________________________________
 
 ## 📰 News
 
-- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
+- :fire: **2026-02-14 · [Try the Online Demo](https://www.tilert.ai/)**. Our online demo is now live! Experience ultra-low-latency inference with **GLM-5** and **DeepSeek-V3.2**. [Try it now !](https://www.tilert.ai)
+
+- 🎉 **2026-02-14 · [v0.1.3](https://github.com/tile-ai/TileRT/releases/tag/v0.1.3) Released**. The v0.1.3 release introduces full support for the latest GLM-5 model, achieving up to 500 tokens/s on GLM-5-FP8 and up to 600 tokens/s on DeepSeek-V3.2.
+
+- 🚀 **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP)** is now available in TileRT! With mtp=3, we achieve decoding rates of up to **590 tokens/s** under synthetic workloads.
+
+<details>
+  <summary>Key Milestones</summary>
 
 - ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
 
 - 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
 
+</details>
+
 ______________________________________________________________________
 
 <a id="overview"></a>
 
-## TileRT: Pushing LLM Latency to the Limit
+**TileRT** is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).
+
+In our latest **v0.1.3** release, we tested **TileRT's** performance on the newest [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.
 
-TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
+Using the [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.
 
 <p align="center">
-<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
-Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
+<img src="assets/glm5-mtp.png" alt="TileRT Benchmark" width="800"><br>
+Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as <a href="https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html">vLLM-GPT5-recipe</a>); TileRT v0.1.3 with MTP=3.
 </p>
 
-We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
-
 <p align="center">
-<img src="assets/perf.png" alt="TileRT Benchmark" width="500"><br>
-Figure 2. Evaluation setup. Batch size: 1, Input sequence length/Output sequence length: 1K/1K; SGLang v0.5.6, TensorRT-LLM v1.2.0-rc5, vLLM v0.13.0, TileRT v0.1.1 with CUDA 12.9.
+<img src="assets/glm5-without-mtp.png" alt="TileRT Benchmark" width="800"><br>
+Figure 2. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0; vLLM v0.16.0rc2.dev173; TileRT v0.1.3.
 </p>
 
 Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
@@ -117,36 +126,46 @@ You're now ready to use TileRT! Proceed to the [Getting Started](#getting-starte
 
 ## Getting Started
 
-### Download Pre-Converted Weights from HuggingFace
+### Step 1: Download Official Model Weights
+
+Starting from release v0.1.3, TileRT no longer requires downloading pre-converted weights from Hugging Face. Instead, you can download the official model weights directly from the model's source (e.g., Hugging Face), and then convert them using the weight converter script included with the latest TileRT release.
 
-TileRT requires preprocessing of the original DeepSeek-V3.2-Exp model weights before they can be used for ultra-low-latency inference.
-To simplify this process, we provide **pre-converted weights** directly on HuggingFace so users do not need to run the preprocessing pipeline themselves.
+### Step 2: Convert Weights Using `weight_converter.py`
 
-You can download the weights using one of the recommended methods below:
+After downloading the official model weights, you can use the following command to convert them into a format compatible with TileRT:
 
-#### Option 1: Using `huggingface-cli` (recommended)
+For **DeepSeek-V3.2**, run:
 
 ```bash
-hf download Tile-AI/DeepSeek-V3.2-Exp-TileRT --local-dir ./tilert_weights
+python -m tilert.models.preprocess.weight_converter \
+  --model_type deepseek-v32 \
+  --model_dir "/path/to/DeepSeek-V3.2" \
+  --save_dir "/path/to/DeepSeek-V3.2-TileRT"
 ```
 
-This will download all files into the `./tilert_weights` directory.
+Replace `/path/to/DeepSeek-V3.2` with the directory where you've downloaded the model weights, and `/path/to/DeepSeek-V3.2-TileRT` with the directory where you'd like the converted weights to be saved.
 
-#### Option 2: Using Git + Git LFS
+Similarly, for **GLM-5**, run:
 
 ```bash
-git lfs install
-git clone https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT
+python -m tilert.models.preprocess.weight_converter \
+  --model_type glm-5 \
+  --model_dir "/path/to/GLM-5-FP8" \
+  --save_dir "/path/to/GLM-5-FP8-TileRT"
 ```
 
-For additional download methods or advanced usage, please refer to the official Hugging Face documentation.
+Replace `/path/to/GLM-5-FP8` with the directory containing the downloaded GLM-5 model weights, and `/path/to/GLM-5-FP8-TileRT` with the desired location for saving the converted weights.
+
+### Step 3: Set the Converted Weights Directory
 
-After downloading the weights, point TileRT to the directory using:
+Once the weights are converted, set the environment variable to point TileRT to the directory containing the converted weights:
 
 ```bash
-export MODEL_WEIGHTS_DIR=/path/to/tilert_weights
+export MODEL_WEIGHTS_DIR= ... # converted weights
 ```
 
+Now you're ready to use TileRT with the converted weights!
+
 ### Running the Generation Example
 
 After downloading the model weights, you can run the generation example within the Docker environment as follows:
@@ -203,11 +222,6 @@ This example demonstrates basic single-step autoregressive generation using the
 
 ### Running the Generation Example with Multi-Token Prediction (MTP)
 
-> \[!IMPORTANT\]
-> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
-> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
-> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
-
 TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
 
 To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
 
@@ -50,7 +50,6 @@ def _load_library(filename: str) -> Any:
 
 
 from . import models  # noqa: E402
-from .generate import ShowHandsGenerator  # noqa: E402
 from .models import deepseek_v3_2  # noqa: E402
 from .tilert_init import tilert_init  # noqa: E402
 
@@ -59,6 +58,5 @@ def _load_library(filename: str) -> Any:
     "tilert_init",
     "models",
     "deepseek_v3_2",
-    "ShowHandsGenerator",
     "__version__",
 ]
@@ -0,0 +1,129 @@
+"""Benchmark suite for TileRT generation."""
+
+from dataclasses import dataclass
+from typing import TypeAlias
+
+from tilert.models.deepseek_v3_2.generator import DSAv32Generator
+from tilert.models.glm_5.generator import GLM5Generator
+
+Generator: TypeAlias = DSAv32Generator | GLM5Generator
+
+
+@dataclass
+class BenchMode:
+    """Configuration for a single benchmark mode."""
+
+    with_mtp: bool
+    label: str
+    # Sampling parameters — None means keep current generator defaults (top-k1 argmax).
+    use_topp: bool = False
+    top_p: float = 1.0
+    top_k: int = 256
+    temperature: float = 1.0
+
+
+@dataclass
+class CellStats:
+    """Stats for a single table cell (one mode x one benchmark column)."""
+
+    tok_s: float = 0.0
+    ms: float = 0.0
+    acc_rate: str = "-"
+
+
+BenchStats = dict[str, dict[str, CellStats]]
+
+
+def apply_mode(generator: Generator, mode: BenchMode) -> None:
+    """Apply sampling parameters for a benchmark mode."""
+    generator.update_sampling_params(
+        temperature=mode.temperature,
+        top_p=mode.top_p,
+        top_k=mode.top_k,
+        use_topp=mode.use_topp,
+    )
+
+
+def merge_stats(stats_list: list[BenchStats]) -> BenchStats:
+    """Merge multiple benchmark stats dicts by mode label."""
+    merged: BenchStats = {}
+    for stats in stats_list:
+        for mode, cols in stats.items():
+            merged.setdefault(mode, {}).update(cols)
+    return merged
+
+
+def _fmt(number: float, suffix: str) -> str:
+    return f"{number:.3f} {suffix}"
+
+
+def print_summary_table(
+    all_stats: BenchStats,
+    model_name: str,
+) -> None:
+    """Print a markdown summary table from merged benchmark stats.
+
+    Each mode occupies 3 rows: tok/s, ms, acc_rate.
+    """
+    if not all_stats:
+        return
+
+    # Collect column keys in insertion order (preserves benchmark ordering)
+    col_keys: list[str] = []
+    for cols in all_stats.values():
+        for k in cols:
+            if k not in col_keys:
+                col_keys.append(k)
+
+    ROW_LABELS = ["tok/s", "ms", "acc"]
+
+    # Build formatted cell strings: {mode: {col: [row0, row1, row2]}}
+    formatted: dict[str, dict[str, list[str]]] = {}
+    for mode, cols in all_stats.items():
+        formatted[mode] = {}
+        for k in col_keys:
+            cell = cols.get(k)
+            if cell is None:
+                formatted[mode][k] = ["-", "-", "-"]
+            else:
+                formatted[mode][k] = [
+                    _fmt(cell.tok_s, "tok/s"),
+                    _fmt(cell.ms, "ms"),
+                    cell.acc_rate,
+                ]
+
+    # Compute column widths
+    col_widths: dict[str, int] = {}
+    for k in col_keys:
+        w = len(k)
+        for mode_cells in formatted.values():
+            for row_str in mode_cells.get(k, ["-"]):
+                w = max(w, len(row_str))
+        col_widths[k] = w
+
+    mode_width = max(len("Mode"), max(len(m) for m in all_stats))
+    # Row label column shares the mode column; pick wider of mode names vs row labels
+    mode_width = max(mode_width, max(len(r) for r in ROW_LABELS))
+
+    print(f"\n## Benchmark Summary ({model_name})\n")
+
+    # Header
+    hdr = [f" {'Mode':<{mode_width}} "]
+    hdr += [f" {k:<{col_widths[k]}} " for k in col_keys]
+    print("|" + "|".join(hdr) + "|")
+
+    # Separator
+    sep = ["-" * (mode_width + 2)]
+    sep += ["-" * (col_widths[k] + 2) for k in col_keys]
+    print("|" + "|".join(sep) + "|")
+
+    # Data rows: 3 rows per mode
+    mode_list = list(all_stats.keys())
+    for _, mode in enumerate(mode_list):
+        for row_idx, _row_label in enumerate(ROW_LABELS):
+            label = mode if row_idx == 0 else ""
+            cells = [f" {label:<{mode_width}} "]
+            for k in col_keys:
+                cell_text = formatted[mode][k][row_idx]
+                cells.append(f" {cell_text:<{col_widths[k]}} ")
+            print("|" + "|".join(cells) + "|")
@@ -0,0 +1,46 @@
+"""Coding-prompt benchmark: single generation, measures coding task throughput."""
+
+from typing import cast
+
+import numpy as np
+from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode
+
+PROMPT = "Hi, can you write a sort program in C for me?"
+
+
+def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
+    """Run the coding-prompt benchmark for each mode.
+
+    Returns stats with column: Coding.
+    """
+    stats: BenchStats = {}
+
+    for mode in modes:
+        apply_mode(generator, mode)
+        print(f"\n--- Coding-prompt benchmark ({mode.label}) ---")
+        print(f"Prompt: {PROMPT}")
+        print("Completion:")
+
+        _, time_list, accepted_counts = cast(
+            tuple[str, list[float], list[int]],
+            generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
+        )
+
+        mode_stats: dict[str, CellStats] = {}
+
+        if mode.with_mtp and accepted_counts:
+            total_tokens = sum(accepted_counts)
+            total_time = sum(time_list)
+            speed = total_tokens / total_time if total_time > 0 else 0
+            avg_ms = total_time / len(time_list) * 1000
+            avg_a = total_tokens / len(accepted_counts)
+            acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
+            mode_stats["Coding"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
+        elif time_list:
+            mean_time = float(np.mean(time_list))
+            speed = 1 / mean_time
+            mode_stats["Coding"] = CellStats(tok_s=speed, ms=mean_time * 1000)
+
+        stats[mode.label] = mode_stats
+
+    return stats
@@ -0,0 +1,46 @@
+"""Long-prompt benchmark: single generation, measures long-form throughput."""
+
+from typing import cast
+
+import numpy as np
+from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode
+
+PROMPT = "Hi, can you tell me a very long story, with roughly 3000 words?"
+
+
+def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
+    """Run the long-prompt benchmark for each mode.
+
+    Returns stats with column: Long.
+    """
+    stats: BenchStats = {}
+
+    for mode in modes:
+        apply_mode(generator, mode)
+        print(f"\n--- Long-prompt benchmark ({mode.label}) ---")
+        print(f"Prompt: {PROMPT}")
+        print("Completion:")
+
+        _, time_list, accepted_counts = cast(
+            tuple[str, list[float], list[int]],
+            generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
+        )
+
+        mode_stats: dict[str, CellStats] = {}
+
+        if mode.with_mtp and accepted_counts:
+            total_tokens = sum(accepted_counts)
+            total_time = sum(time_list)
+            speed = total_tokens / total_time if total_time > 0 else 0
+            avg_ms = total_time / len(time_list) * 1000
+            avg_a = total_tokens / len(accepted_counts)
+            acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
+            mode_stats["Long"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
+        elif time_list:
+            mean_time = float(np.mean(time_list))
+            speed = 1 / mean_time
+            mode_stats["Long"] = CellStats(tok_s=speed, ms=mean_time * 1000)
+
+        stats[mode.label] = mode_stats
+
+    return stats