Spec-driven GPU operator library for LLMs — designed for AI agents to build, evaluate, and optimize
Built on TileLang
Status: TileOPs is under active development. APIs may change.
TileOPs is a GPU operator library for LLM training and inference, built on TileLang. Beyond providing a growing collection of production-quality operators, TileOPs explores a spec-driven development model where AI agents can read declarative operator specifications, generate kernel implementations, and evaluate them against hardware-theoretical performance bounds — with minimal human scaffolding.
Every operator is split into two layers with a strict boundary:
- Op (L2) — stateless Python entry point. Handles validation, dtype casting, and memory layout. Compatible with CUDA-Graph and
torch.compile. - Kernel (L1) — TileLang GPU implementation with hardware-specific optimizations (Ampere, Hopper).
This separation keeps user-facing behavior independent of GPU strategy, allowing agents and developers to modify either layer without side effects on the other.
- Spec-driven — each operator is declared in a machine-readable manifest (
ops_manifest.yaml) that specifies signatures, workloads, and roofline formulas, serving as the entry point for both agent code generation and automated validation - Roofline-evaluated — kernel performance is measured against Speed-of-Light hardware bounds, not relative baselines
- Auto-tuning — built-in search over tile sizes, pipelines, and scheduling parameters
- Lightweight — depends only on TileLang, PyTorch, and einops
TileOPs can be installed from PyPI or built from source. A CUDA-capable GPU is required.
- Python >= 3.10
- PyTorch >= 2.1
- CUDA Toolkit
- NVIDIA GPU: Hopper (SM_90)
- TileLang == 0.1.8
pip install tileopsgit clone https://github.com/tile-ai/TileOPs
cd TileOPs
make install # dev dependencies + pre-commit hooksNote
If CUDA and TileLang are already installed system-wide and you encounter build issues:
PIP_NO_BUILD_ISOLATION=1 pip install -e '.[dev]' -v && pre-commit install
Verify:
python -m pytest tests/ -q # requires a CUDA GPUimport torch
from tileops.ops import GemmOp
M, N, K = 1024, 1024, 512
dtype = torch.float16
gemm = GemmOp(M, N, K, dtype=dtype)
A = torch.randn(M, K, device="cuda", dtype=dtype)
B = torch.randn(K, N, device="cuda", dtype=dtype)
C = gemm(A, B)Design docs and development guides are in docs/. The full API reference and performance tables are published at TileOPs.github.io.
See workflow.md for branch naming, commit conventions, and the PR process.
TileOPs is released under the MIT License.