tile-ai
diff --git a/‎.github/workflows/lint.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/lint.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 105 additions & 17 deletions b/‎README.md‎
Lines changed: 105 additions & 17 deletions
diff --git a/‎assets/generate.gif‎
-1.19 MB b/‎assets/generate.gif‎
-1.19 MB
diff --git a/‎python/__init__.py‎
Lines changed: 2 additions & 1 deletion b/‎python/__init__.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎python/generate.py‎
Lines changed: 88 additions & 11 deletions b/‎python/generate.py‎
Lines changed: 88 additions & 11 deletions
diff --git a/‎python/models/base.py‎
Lines changed: 5 additions & 3 deletions b/‎python/models/base.py‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎python/models/deepseek_v3_2/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎python/models/deepseek_v3_2/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -36,6 +36,6 @@ jobs:
       - name: Install lint dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install --no-cache-dir -r requirements-ci.txt
+          pip install --no-cache-dir -r requirements-dev.txt
       - name: Run all linting checks
         run: ./scripts/lint.sh
@@ -6,24 +6,37 @@
     <a href="https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-1E90FF"></a>
   </p>
   <p>
-    <a href="#python-package-installation"><b>Installation</b></a> |
-    <a href="#getting-started"><b>Getting Started</b></a>
+    <a href="#overview"><b>Overview</b></a> ·
+    <a href="#running-the-generation-example"><b>Generation</b></a> ·
+    <a href="#running-the-generation-example-with-multi-token-prediction-mtp"><b>MTP Generation</b></a> ·
+    <a href="#installation"><b>Installation</b></a> ·
+    <a href="#news"><b>News</b></a>
   </p>
 </div>
 
-## News
+______________________________________________________________________
 
-- **\[2025-12-23\]** ⚡ **[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved ~35% reduction in end-to-end token generation latency on a single node with 8× NVIDIA B200. See our latest benchmarks for detailed measurements.
+<a id="news"></a>
 
-- **\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
+## 📰 News
+
+- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
+
+- ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
+
+- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
+
+______________________________________________________________________
+
+<a id="overview"></a>
 
 ## TileRT: Pushing LLM Latency to the Limit
 
 TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
 
 <p align="center">
 <img src="assets/generate.gif" alt="TileRT Benchmark"><br>
-Figure 1. Sequence generation with TileRT.
+Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
 </p>
 
 We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
@@ -39,6 +52,8 @@ To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a
 
 The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.
 
+______________________________________________________________________
+
 ## Installation
 
 - [Prerequisites](#prerequisites)
@@ -145,39 +160,112 @@ docker run --gpus all -it \
     tilert:v0.1.0
 ```
 
-Once inside the container, you can run the following Python script:
+Once inside the container, run the following Python script to perform text generation:
 
 ```python
 from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
 
 generator: ShowHandsGenerator = ShowHandsGenerator(
     max_new_tokens=1000,
     model_weights_dir=MODEL_WEIGHTS_DIR,
+    with_mtp=False,  # Disable MTP
 )
 generator.from_pretrained()
 
-prompt = """Tell me three jokes:
-
-1. A dad joke,
-2. A programmer joke,
-3. A joke that only makes sense if you've ever tried to train a large language model.
-Keep each joke under 15 words.
-"""
+prompt = (
+    "Tell me three jokes:\n\n"
+    "1. A dad joke,\n"
+    "2. A programmer joke,\n"
+    "3. A joke that only makes sense if you've ever tried "
+    "to train a large language model.\n"
+    "Keep each joke under 15 words."
+)
 
 print("Prompt:", prompt)
 print("Completion:")
-completion: generator.generate(prompt)
+completion = generator.generate(prompt)
 ```
 
-For instance, using the above prompt, TileRT might generate:
+For example, TileRT may generate:
+
+<details>
+<summary><b>Sample output (click to expand)</b></summary>
 
 ```text
 1. I'm afraid for the calendar. Its days are numbered.
 2. There are only 10 kinds of people: those who understand binary and those who don't.
 3. My model's loss is low, but its answers are still nonsense. Overfitting.
 ```
 
-This example gives you a quick idea of the type of output you can expect from the precompiled model.
+</details>
+
+This example demonstrates basic single-step autoregressive generation using the precompiled model.
+
+### Running the Generation Example with Multi-Token Prediction (MTP)
+
+> \[!IMPORTANT\]
+> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
+> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
+> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
+
+TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
+
+To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
+
+```python
+from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
+
+generator: ShowHandsGenerator = ShowHandsGenerator(
+    max_new_tokens=1000,
+    model_weights_dir=MODEL_WEIGHTS_DIR,
+    with_mtp=True,  # Enable MTP
+)
+generator.from_pretrained()
+prompt = "Tell me 10 jokes, keep them all under 100 words."
+
+print("Prompt:", prompt)
+print("Completion:")
+completion = generator.generate(prompt)
+```
+
+When MTP is enabled, TileRT may report statistics similar to the following during generation:
+
+```text
+Accepted length: mean=2.77, min=1, max=4
+```
+
+This indicates that, on average, multiple tokens are accepted per decoding step under MTP.
+
+<details>
+<summary><b>Sample output (click to expand)</b></summary>
+
+```text
+Of course! Here are 10 short jokes for you.
+
+1. I told my wife she was drawing her eyebrows too high. She looked surprised.
+
+2. I invented a new word: Plagiarism.
+
+3. Why don't scientists trust atoms? Because they make up everything.
+
+4. I'm reading a book on anti-gravity. It's impossible to put down.
+
+5. What's the best thing about Switzerland? I don't know, but the flag is a big plus.
+
+6. I told my computer I needed a break, and now it won't stop sending me vacation ads.
+
+7. Why did the scarecrow win an award? He was outstanding in his field.
+
+8. What do you call a fake noodle? An impasta.
+
+9. I told my suitcase there's no vacation, and now it has a lot of baggage.
+
+10. Why don't skeletons fight each other? They don't have the guts.
+```
+
+</details>
+
+This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
 
 For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py).
 
 
@@ -40,7 +40,8 @@ def _load_library(filename: str) -> Any:
     lib_path = Path(__file__).parent / filename
 
     try:
-        return ctypes.CDLL(str(lib_path))
+        torch.ops.load_library(str(lib_path))
+        return lib_path
     except Exception as e:
         raise RuntimeError(f"Failed to load library from {lib_path}") from e
 
 
@@ -1,6 +1,9 @@
 """Text generation script for TileRT."""
 
 from argparse import ArgumentParser
+from typing import cast
+
+import numpy as np
 
 from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
 
@@ -16,7 +19,16 @@ def parse_args():  # type: ignore
     parser.add_argument("--max-new-tokens", type=int, default=4000, help="Max tokens to generate")
     parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature")
     parser.add_argument("--interactive", action="store_true")
-    parser.add_argument("--fp8", action="store_true")
+    parser.add_argument(
+        "--with-mtp",
+        action="store_true",
+        help="Enable MTP (Multi-Token Prediction) for speculative decoding",
+    )
+    parser.add_argument(
+        "--use-random-weights",
+        action="store_true",
+        help="Use random weights instead of pretrained (for testing MTP without real weights)",
+    )
     return parser.parse_args()
 
 
@@ -25,22 +37,31 @@ def parse_args():  # type: ignore
     usage:
     execute below command under tilert root directory:
 
+    # Standard generation with pretrained weights:
     python python/generate.py --model-weights-dir "xxxx" 2>&1 | tee test.log
+
+    # MTP generation with random weights (for testing):
+    python python/generate.py --model-weights-dir "xxxx" --with-mtp \
+        --use-random-weights 2>&1 | tee test.log
+
+    # MTP generation with pretrained weights (when available):
+    python python/generate.py --model-weights-dir "xxxx" --with-mtp 2>&1 | tee test.log
     """
     args = parse_args()
 
     generator: ShowHandsGenerator = ShowHandsGenerator(
         max_new_tokens=args.max_new_tokens,
         temperature=args.temperature,
         model_weights_dir=args.model_weights_dir,
-        enable_fp8_ops=args.fp8,
+        with_mtp=args.with_mtp,
     )
 
-    # uncomment to use random weights
-    # generator.init_random_weights()
-
-    # use pretrained weights
-    generator.from_pretrained()
+    if args.use_random_weights:
+        print("Initializing with random weights...")
+        generator.init_random_weights()
+    else:
+        print("Loading pretrained weights...")
+        generator.from_pretrained()
 
     # simple memoryless interactive mode
     if args.interactive:
@@ -53,14 +74,70 @@ def parse_args():  # type: ignore
     else:
         # This prompt is to test the model’s ability to follow instructions
         # (in terms of quantity, type, and length) while keeping it fun.
+        print("==== Performance ====")
         prompt = "Tell me 10 jokes, keep them all under 100 words."
-
         print("Prompt:", prompt)
-        print("Completion:")
-        completion: str = generator.generate(prompt)  # type: ignore[has-type]
+        all_times = []
+        all_accepted = []
+        for _iter in range(20):
+            if _iter % 5 == 0:
+                print(f"Executing iter {_iter}...")
+            results, time_list, accepted_counts = cast(
+                tuple[str, list[float], list[int]],
+                generator.generate(prompt, False),  # type: ignore[has-type]
+            )
+            all_times.append(time_list)
+            all_accepted.append(accepted_counts)
+
+        if args.with_mtp:
+            for token_num in range(100, 300, 100):
+                times_to_token_num = []
+                for time_list, accepted_list in zip(all_times, all_accepted):
+                    if len(time_list) > 5 and len(accepted_list) > 5:
+                        times = time_list[5:]
+                        accepted = accepted_list[5:]
+                        cumsum_tokens = np.cumsum(accepted)
+                        cumsum_times = np.cumsum(times)
+                        # Find index where we reach token_num tokens
+                        idx = np.searchsorted(cumsum_tokens, token_num)
+                        if idx < len(cumsum_times):
+                            times_to_token_num.append(cumsum_times[idx])
+                if times_to_token_num:
+                    mean_total_time = np.mean(times_to_token_num)
+                    mean_time = mean_total_time / token_num
+                    speed = 1 / mean_time
+                    out_str = (
+                        f"**Perf@{token_num}: {speed:.3f} tokens/s & "
+                        f"{(mean_time * 1000):.3f} ms**"
+                    )
+                    print(out_str)
+
+            # Print accepted tokens statistics
+            flat_accepted = [a for accepted_list in all_accepted for a in accepted_list]
+            if flat_accepted:
+                avg_accepted = sum(flat_accepted) / len(flat_accepted)
+                min_accepted = min(flat_accepted)
+                max_accepted = max(flat_accepted)
+                print(
+                    f"**Accepted length: mean={avg_accepted:.2f}, "
+                    f"min={min_accepted}, max={max_accepted}**"
+                )
+        else:
+            all_times_np = np.array(all_times)
+            for token_num in range(100, 300, 100):
+                mean_time = np.mean(all_times_np[..., 5:token_num])
+                speed = 1 / mean_time
+                out_str = (
+                    f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
+                )
+                print(out_str)
+        print(results)
 
         # This prompt is used to test long sequence generation
         prompt = "Hi, can you tell me a very long story, with roughly 3000 words?"
         print("Prompt:", prompt)
         print("Completion:")
-        completion = generator.generate(prompt)  # type: ignore[has-type]
+        completion, _, _ = generator.generate(prompt)  # type: ignore[has-type]
+
+    print("Cleaning up...")
+    generator.cleanup()
@@ -9,6 +9,7 @@
 
 from tilert import logger
 from tilert.models.deepseek_config import get_rank, get_world_size
+from tilert.models.deepseek_v3_2.params import BaseParams
 from tilert.models.preprocess import WeightLoader
 from tilert.utils import get_profile_log_tensor
 
@@ -52,9 +53,10 @@ def __init__(
 
         self.flag_enable_tilert = False
 
-        if compute_kernel_type not in ["bf16", "fp8"]:
+        if compute_kernel_type not in ["bf16", "fp8", "fp8mma"]:
             raise ValueError(
-                f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8."
+                f"Invalid compute kernel type: {compute_kernel_type}, \
+                must be one of bf16, fp8, fp8mma."
             )
         self.compute_kernel_type = compute_kernel_type
 
@@ -215,7 +217,7 @@ def tilert_forward(self, *args: Any, **kwargs: Any) -> Any:  # noqa: U100
         raise NotImplementedError("Tilert forward not implemented")
 
     @abstractmethod
-    def to_tilert_weights(self, *args: Any, **kwargs: Any) -> None:
+    def to_tilert_weights(self, *args: Any, **kwargs: Any) -> BaseParams | None:
         """Convert weights to tilert.
 
         Args:
 
@@ -0,0 +1 @@
+"""DeepSeek v3.2 model package."""