Cottus Runtime

C++/CUDA LLM inference engine with Python bindings.

Cottus Runtime is a custom inference engine built from scratch for Llama architectures, prioritizing low-latency and strict memory management. It implements its own Transformer execution pipeline, KV cache management, and attention kernels.

Features

Core: Custom C++20 Transformer implementation.
Memory: PagedAttention with BlockAllocator for efficient KV cache management.
Compute: CUDA-accelerated kernels for Attention, RoPE, and GEMM (cuBLAS).
Parity: Exact token matching with HuggingFace Transformers (verified).
Interface: Clean Python API via PyBind11.

Supported Models

Cottus Runtime currently supports the following architectures with verified exact parity to HuggingFace Transformers.

Architecture	Models	Precision	Status
Llama	Llama 2 (7B/13B/70B), Llama 3 (8B/70B)	FP16 / FP32	IN PRODUCTION
TinyLlama	TinyLlama-1.1B	FP16 / FP32	IN PRODUCTION
Mistral	Mistral 7B	FP16	IN PROGRESS

What to expect in future

Here is what I have planned to add to the project in later iterations. Send a PR if you want to contribute to the project.

Multi‑GPU & Distributed Execution – Enable scaling across multiple GPUs and clusters for larger models.
Expanded Model Support – Add native support for Mistral, Falcon, and other non‑LLaMA families.
Optimized CPU Backend – Introduce a high‑performance CPU path (vectorized kernels, OpenMP) and enable CPU‑only inference.
Quantization & INT8 – Provide post‑training quantization pipelines and INT8 kernels for reduced memory and faster inference.
FlashAttention‑style Kernels – Integrate memory‑efficient, block‑sparse attention kernels to cut latency and improve throughput.
Plugin System – Allow community‑contributed extensions (custom ops, alternative KV‑cache strategies).
Better Tooling – CLI utilities for model conversion, benchmarking, and profiling.

Installation

Prerequisites

NVIDIA GPU with CUDA 11/12 (Recommended)
C++ Compiler (GCC 10+ or Clang 12+)
CMake 3.18+
Python 3.8+

Install from PyPI

Using pip:

pip install cottus

Using uv (recommended for speed):

uv add cottus

Quick Start

1. Basic Inference (Tiny Random Model)

python examples/1_basic_inference.py --device cuda

2. CPU Fallback

There is a CPU fallback here. Wouldn't recommebd=

python examples/2_cpu_inference.py

3. Real Chat (TinyLlama-1.1B)

Requires ~2.2GB download.

python examples/3_tinyllama_real.py

Usage

The best way to get started is to look at the examples/ directory, which contains complete scripts for various use cases.

Basic Example

from cottus import Engine, EngineConfig
from cottus.model import load_hf_model

# 1. Load Model Weights
weights, _, _, tokenizer, _ = load_hf_model("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")

# 2. Config
config = EngineConfig()
config.model_type = "llama"
config.hidden_dim = 2048
config.num_layers = 22
config.num_heads = 32
config.num_kv_heads = 4
config.head_dim = 64
config.intermediate_dim = 5632
config.device = "cuda"

# 3. Helpers
engine = Engine(config, weights)
input_ids = tokenizer.encode("Hello!")

# 4. Generate
output_ids = engine.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output_ids))

License

Cottus Runtime is licensed under the Apache License Version 2.0. By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
architecture		architecture
assets/icons		assets/icons
cottus		cottus
examples		examples
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
release_notes.txt		release_notes.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cottus Runtime

C++/CUDA LLM inference engine with Python bindings.

Features

Supported Models

What to expect in future

Here is what I have planned to add to the project in later iterations. Send a PR if you want to contribute to the project.

Installation

Prerequisites

Install from PyPI

Quick Start

1. Basic Inference (Tiny Random Model)

2. CPU Fallback

3. Real Chat (TinyLlama-1.1B)

Usage

Basic Example

License

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

cottus-ai/cottus-runtime

Folders and files

Latest commit

History

Repository files navigation

Cottus Runtime

C++/CUDA LLM inference engine with Python bindings.

Features

Supported Models

What to expect in future

Here is what I have planned to add to the project in later iterations. Send a PR if you want to contribute to the project.

Installation

Prerequisites

Install from PyPI

Quick Start

1. Basic Inference (Tiny Random Model)

2. CPU Fallback

3. Real Chat (TinyLlama-1.1B)

Usage

Basic Example

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages