This file provides guidance to Coding Agent when working with code in this repository.
Pure Rust + CUDA LLM inference engine (~83K Rust, ~11K CUDA). No PyTorch, no frameworks. OpenAI-compatible /v1/completions API.
Supported models:
| Model | Crate | Feature flag | Architecture |
|---|---|---|---|
| Qwen3-4B / 8B | pegainfer-qwen3-4b |
always built | Full attention, TP support |
| Qwen3.5-4B | pegainfer-qwen35-4b |
always built | 24 linear + 8 full attention |
| DeepSeek-V4 | pegainfer-deepseek-v4 |
--features deepseek-v4 |
MoE + compressor + indexer, 8-GPU |
| DeepSeek-V2-Lite | pegainfer-deepseek-v2-lite |
--features deepseek-v2-lite |
MoE + EP, 2-GPU |
| Kimi-K2 | pegainfer-kimi-k2 |
--features kimi-k2 |
MLA + MoE + Marlin INT4, 8-GPU EP |
Always use --release — debug builds are extremely slow for GPU/CUDA and will timeout.
# Qwen models (default, no feature flags needed)
cargo run --release -- --model-path models/Qwen3.5-4B
# Feature-gated models
cargo run --release --features kimi-k2 -- --model-path models/Kimi-K2
cargo run --release --features deepseek-v4 -- --model-path models/DeepSeek-V4Key env vars:
PEGAINFER_CUDA_SM— GPU SM target override whennvidia-smiunavailable (e.g.120or120,80)PEGAINFER_TRITON_PYTHON— Python with Triton for build-time AOT kernel generationPEGAINFER_TEST_MODEL_PATH— override test model path (default:models/Qwen3-4B)PEGAINFER_BUILD_TIMING=1— print per-phase build timings (nvcc, Triton AOT, etc.)PEGAINFER_NVCC_JOBS— override parallel nvcc job count
# Unit tests (~9s)
cargo test --release --workspace --lib
# E2E greedy regression — requires GPU + model weights
PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p pegainfer-qwen3-4b --test e2e
PEGAINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p pegainfer-qwen35-4b --test e2e
# Single test
cargo test --release embedding_variants -- --nocaptureE2E tests compare against JSON baselines in test_data/. Regenerate baselines after any change that affects numerical output.
HTTP Request → vLLM frontend → EngineHandle → per-model scheduler/executor → TokenEvent
│
┌──────────────┬─────────────────┼─────────────────┬──────────────┐
│ │ │ │ │
pegainfer- pegainfer- pegainfer- pegainfer- pegainfer-
qwen3-4b qwen35-4b deepseek-v4 deepseek-v2- kimi-k2
(full attn) (linear+full) (MoE+indexer) lite (MoE+EP) (MLA+MoE)
│ │ │ │ │
└──────────────┴─────────────────┼─────────────────┴──────────────┘
│
pegainfer-core runtime + pegainfer-kernels
│
┌────────────────┼────────────────┐
│ │ │
CUDA / cuBLAS Triton AOT FlashInfer
(sampling, attention,
norm, MLA decode)
Key abstractions:
pegainfer-core::engine— shared request/event contract (EngineHandle,GenerateRequest,TokenEvent) used by the server and model crates.- Per-model crates — each model owns config, weights, prefill/decode execution, scheduler, tests, and benches.
pegainfer-core::ops— shared GPU operator wrappers used by model crates.pegainfer-kernels— tensor/FFI/kernel build owner for CUDA, cuBLAS, FlashInfer, and Triton AOT. Model-specific kernels live in feature-gated submodules (kimi_k2,deepseek_v4).pegainfer-comm— EP all-to-all communication (GDR, NCCL, IB verbs). Requires CUDA + RDMA hardware to compile.- CUDA Graph — decode path captured inside model executors with pre-allocated buffers to preserve pointer stability.
- KV state — model schedulers own request state; shared paged-KV primitives live in
pegainfer-core.
Build system: the virtual workspace root has no package build script. pegainfer-kernels/build.rs owns CUDA/Triton compilation:
- Compiles
pegainfer-kernels/csrc/*.cuwith nvcc (auto-detects GPU SM targets) - Runs Triton AOT via
pegainfer-kernels/tools/triton/gen_triton_aot.pyfor Qwen3.5 kernels - Feature-gated:
deepseek-v4triggers TileLang + CuTe DSL codegen;kimi-k2adds MLA/MoE/Marlin CUDA
Collaboration centered on the docs/ directory.
Docs are organized by what they're about, not by lifecycle stage. A doc's freshness lives in its TL;DR (and Last touched: for active areas) — not by which directory it sits in. Completed work stays co-located with its domain. There is no archives/ directory — if a doc no longer earns its keep, delete it; if a lasting lesson hides inside it, lift that lesson into lessons/ first, then delete.
docs/
├── index.md # Routing table — every doc must be listed here
├── roadmap/ # Strategic plans, quarterly direction, milestones
├── models/<line>/ # Per-model living docs (qwen3, qwen35, deepseek-v4, ...)
│ # — design, accuracy, perf, refactor records, gotchas
├── subsystems/<area>/ # Cross-cutting components (runtime, scheduler, frontend, kernels)
├── playbooks/ # Reusable how-to: benching, profiling, accuracy, onboarding
├── lessons/ # Tribal knowledge from research / other projects
├── benchmarks/ # Standalone benchmark snapshots and eval reports
├── conventions/ # Ongoing standards (bench regression, coding style)
└── private/ # Local-only notes (gitignored)
Classification rule at capture time:
- Is it tied to a specific model? →
models/<line>/ - A specific subsystem? →
subsystems/<area>/ - Reusable how-to applicable across models? →
playbooks/ - Lasting lesson from elsewhere (other repo, research, postmortem)? →
lessons/ - Snapshot of measurement, not a doc that evolves? →
benchmarks/ - Strategic / cross-cutting plan? →
roadmap/
If you can't pick one, the doc probably needs splitting.
- Docs cover what
--helpand code can't: pitfalls, diagnostic paths, decision context. Don't restate CLI reference. - Every command in a doc must be run and verified before committing. Unverified commands are technical debt.
- The only required header is a one-line TL;DR. Keep it true; that's the contract.
- For
models/<line>/andsubsystems/<area>/docs, addLast touched: YYYY-MMand bump it when you do meaningful work on the doc (not for typo fixes). The date is a fact, not a judgement — readers infer freshness themselves. playbooks/,lessons/,conventions/,roadmap/,benchmarks/,archives/don't need a freshness stamp. They're either timeless until disproven, or self-dated, or explicitly inert.- No
Status:enum. Enum fields go stale exactly when you need them most.
index.md is a routing table with a scanning-friendly TL;DR column. It is allowed to drift from the TL;DR inside each doc — the doc body is authoritative. Update index.md when you create or delete a doc, or when the existing TL;DR is so wrong it actively misleads. Don't churn it on every doc edit.
Documentation exists to advance work, not to hoard information. Four steps when handling information:
- Capture: Only record what materially advances the project. When in doubt, leave it out.
- Organize: Action-oriented. Resist the urge to organize for organization's sake — structure should be just enough.
- Distill: Refactor over append. When you learn something new or hit a pitfall, integrate it into the document body — don't pile a changelog at the bottom.
- Express: Every document must point to a next step. Split unwieldy documents proactively. Active documents must note the current blocker or next action.
Sync
At the start of each session, you must read index.md and load the documents needed for the task at hand.
Execute
- Update relevant documents as you go. When a new problem or idea arises, create a document in the appropriate domain directory (see classification rule above).
- Record why a decision was made, not just what was done.
Commit
When a session wraps up:
- Update the TL;DR (and
Last touched, where applicable) at the top of each modified document. - Update
index.mdonly when you created or deleted a doc, or when its TL;DR row is now misleading (see Drift Policy above).
Commit messages use Commitizen format: <type>(<scope>): <subject>. Never commit directly to main — create a feat//fix//chore//… branch first.
Module files use the flat layout (src/ops.rs + src/ops/) — no mod.rs.