Skip to content

Commit 714bcd8

Browse files
authored
Merge pull request #6 from AbdelStark/claude/sync-documentation-27Yio
docs: sync documentation with current project state
2 parents 3708792 + ddd2785 commit 714bcd8

File tree

4 files changed

+188
-6
lines changed

4 files changed

+188
-6
lines changed

AGENTS.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# AGENTS.md — AI Agent Technical Context
2+
3+
## Project Overview
4+
5+
**attnres-rs** is the first Rust implementation of Attention Residuals (MoonshotAI/Kimi paper) using the [burn](https://github.com/tracel-ai/burn) deep learning framework. It provides a drop-in replacement for standard residual connections in Transformers.
6+
7+
## Tech Stack
8+
9+
| Component | Technology | Version |
10+
|-------------|-----------------|----------|
11+
| Language | Rust | 2021 edition (1.80+) |
12+
| ML Framework| burn | 0.20 |
13+
| Test Backend| NdArray | (CPU, deterministic) |
14+
| Testing | cargo test + proptest + criterion ||
15+
| Linting | clippy + rustfmt ||
16+
| CI | GitHub Actions | test, clippy, fmt, build-examples |
17+
18+
## Project Structure
19+
20+
```
21+
src/
22+
├── lib.rs # Public API re-exports + module declarations
23+
├── config.rs # AttnResConfig — validated builder pattern
24+
├── attn_res_op.rs # Core AttnRes operation (depth-wise softmax attention)
25+
├── block_state.rs # BlockState — cumulative block representation tracking
26+
├── layer.rs # AttnResLayer — transformer layer with dual AttnRes
27+
├── model.rs # AttnResTransformer — full model (embed → layers → LM head)
28+
├── rms_norm.rs # RMSNorm implementation
29+
├── two_phase.rs # Two-phase inference optimization
30+
├── attention.rs # Multi-head self-attention
31+
├── feed_forward.rs # SwiGLU-style MLP
32+
└── utils.rs # Causal mask generation helpers
33+
34+
tests/
35+
├── unit_tests.rs # Core algorithm correctness tests
36+
├── differential_tests.rs # PyTorch reference comparison tests
37+
├── property_tests.rs # proptest property-based tests
38+
└── integration_tests.rs # Full model training loop tests
39+
40+
examples/
41+
├── train_tiny.rs # Train a small model on synthetic data
42+
├── compare_residuals.rs # Compare AttnRes vs standard residuals
43+
└── visualize_weights.rs # Visualize depth attention patterns
44+
45+
benches/
46+
└── attn_res_benchmark.rs # Criterion benchmarks
47+
48+
fixtures/ # Reference outputs from PyTorch
49+
├── attn_res_forward.json
50+
└── block_state_tracking.json
51+
```
52+
53+
## Commands
54+
55+
```bash
56+
cargo build # Build the project
57+
cargo test --all-features # Run all 57 tests
58+
cargo test test_name # Run specific test
59+
cargo clippy -- -D warnings # Lint (warnings = errors)
60+
cargo fmt # Format code
61+
cargo fmt -- --check # Check formatting without modifying
62+
cargo bench # Run Criterion benchmarks
63+
cargo run --example train_tiny # Train example
64+
cargo run --example compare_residuals # Comparison example
65+
cargo run --example visualize_weights # Visualization example
66+
```
67+
68+
## Architecture Essentials
69+
70+
### Core Algorithm (AttnRes)
71+
72+
Standard residual: `x_{l+1} = x_l + f_l(x_l)` (fixed unit weights)
73+
74+
AttnRes: `x_{l+1} = Σ α_i · v_i` where α = softmax(w_l · RMSNorm(V)) over depth dimension
75+
76+
Key invariants:
77+
1. **Zero-init pseudo-queries** → starts as uniform averaging (standard residual behavior)
78+
2. **Two AttnRes per transformer layer** — one before self-attention, one before MLP
79+
3. **Softmax over depth** (block/layer dimension), NOT over sequence tokens
80+
4. **RMSNorm on keys** to prevent magnitude domination
81+
5. **Block boundaries** at every `block_size/2` sublayers
82+
83+
### Data Flow
84+
85+
```
86+
Input IDs → Embedding → [AttnResLayer × N] → RMSNorm → LM Head → Logits
87+
88+
AttnResOp(pre-attn) → RMSNorm → MultiHeadAttention
89+
AttnResOp(pre-mlp) → RMSNorm → FeedForward
90+
```
91+
92+
### Configuration
93+
94+
`AttnResConfig::new(d_model, num_layers, num_blocks)` where:
95+
- `d_model`: Hidden dimension
96+
- `num_layers`: Number of **sublayers** (transformer layers × 2)
97+
- `num_blocks`: Number of blocks for Block AttnRes (set = num_layers for Full AttnRes)
98+
99+
## Boundaries
100+
101+
### Read-Only (never modify)
102+
- `spec.md`, `paper.md`, `research_report.md`, `implementation_plan.md`, `LICENSE`
103+
104+
### Gated (requires approval)
105+
- `Cargo.toml` (dependency changes)
106+
- `.github/workflows/` (CI changes)
107+
- `cargo publish`
108+
109+
## Source of Truth
110+
111+
`spec.md` is the authoritative specification. All algorithm implementations must match the pseudocode and equations defined there.
112+
113+
## Known Gaps
114+
115+
- No safetensors serialization
116+
- Two-phase inference not integrated into main forward path
117+
- GPU backends (wgpu, CUDA, Metal) untested
118+
- No distributed training support

CLAUDE.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@ attnres-rs: First Rust implementation of Attention Residuals (MoonshotAI/Kimi pa
66
| Layer | Technology | Version | Notes |
77
|-------------|---------------|----------|------------------------------------------|
88
| Language | Rust | 1.80+ | Nightly recommended for some burn features |
9-
| ML Framework| burn | latest | tracel-ai/burn — multi-backend DL framework |
9+
| ML Framework| burn | 0.20 | tracel-ai/burn — multi-backend DL framework |
1010
| Backends | CUDA, Metal, wgpu, NdArray || NdArray for CPU testing, wgpu for cross-platform GPU |
1111
| Testing | cargo test || + proptest (property-based), criterion (benchmarks) |
1212
| Serialization | safetensors || For weight loading/saving |
1313
| Linting | clippy + rustfmt || Enforced in CI |
14-
| CI | GitHub Actions || cargo test, clippy, fmt |
14+
| CI | GitHub Actions || cargo test, clippy, fmt, build-examples |
1515
</stack>
1616

1717
<status>
1818
PROJECT PHASE: Alpha (v0.1.0 — core algorithm implemented, tests passing).
19-
All source modules implemented. 52 tests passing (unit, differential, property-based, integration).
20-
CI configured (test, clippy, fmt, build-examples). Examples and benchmarks functional.
19+
All source modules implemented. 57 tests passing (28 inline unit + 18 external unit + 3 differential + 2 property + 5 integration + 1 doctest).
20+
CI configured (test, clippy, fmt, build-examples). Examples and benchmarks functional. burn upgraded to 0.20.
2121
Known gaps: no safetensors serialization, two-phase inference not integrated into main forward path, GPU backends untested.
2222
</status>
2323

@@ -28,6 +28,8 @@ Current directory layout:
2828
attnres-rs/
2929
├── Cargo.toml # Package manifest [agent: CREATE]
3030
├── CLAUDE.md # This file
31+
├── AGENTS.md # AI agent technical context [agent: MODIFY]
32+
├── ROADMAP.md # Feature roadmap and progress [agent: MODIFY]
3133
├── README.md # Project README [agent: MODIFY]
3234
├── LICENSE # MIT [agent: READ ONLY]
3335
├── spec.md # Technical specification [agent: READ ONLY — source of truth]
@@ -257,5 +259,8 @@ Available skills:
257259

258260
<lessons_learned>
259261
[Initial setup] This is a greenfield project. All implementation follows spec.md as the source of truth.
262+
[burn 0.16→0.20] Breaking API changes required updates to activation functions, loss computation, and tensor operations. Always check burn changelog when upgrading.
263+
[Testing] NdArray backend is deterministic and fast for small tensors. All tests use it. GPU backends remain untested.
264+
[Quality audit] Doc comments, config validation, and test coverage were hardened in a dedicated audit pass. Maintain this standard.
260265
</lessons_learned>
261266
</memory>

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,16 @@ cargo bench # Benchmarks
8383

8484
## Current Status
8585

86-
**Alpha** (v0.1.0). Core algorithm implemented and tested. Suitable for research and experimentation. Not yet suitable for production training at scale.
86+
**Alpha** (v0.1.0). Core algorithm implemented and tested with 57 passing tests (unit, differential, property-based, integration). Built on burn 0.20. Suitable for research and experimentation. Not yet suitable for production training at scale.
8787

8888
Known limitations:
8989
- No weight serialization/loading (safetensors support planned)
9090
- Two-phase inference optimization is implemented but not integrated into the main forward pass
91-
- NdArray backend only tested; GPU backends untested
91+
- NdArray backend only tested; GPU backends (wgpu, CUDA, Metal) untested
9292
- No distributed training support
9393

94+
See [ROADMAP.md](ROADMAP.md) for planned features and progress.
95+
9496
## Paper
9597

9698
> **Attention Residuals** -- Kimi Team (MoonshotAI), 2026

ROADMAP.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# attnres-rs Roadmap
2+
3+
## Current Phase: Alpha (v0.1.0)
4+
5+
Core algorithm implemented and tested. Suitable for research and experimentation.
6+
7+
---
8+
9+
## v0.1.0 — Core Implementation ✅
10+
11+
- [x] AttnResOp: Block AttnRes forward pass with depth-wise softmax
12+
- [x] BlockState: Cumulative block representation tracking
13+
- [x] RMSNorm: Custom implementation for key normalization
14+
- [x] AttnResLayer: Transformer layer with dual AttnRes (pre-attention + pre-MLP)
15+
- [x] AttnResTransformer: Full model with embedding, LM head, causal masking
16+
- [x] MultiHeadAttention: Standard multi-head self-attention
17+
- [x] FeedForward: SwiGLU-style MLP
18+
- [x] TwoPhase: Two-phase inference optimization (standalone)
19+
- [x] Config: Validated configuration with builder pattern
20+
- [x] Zero initialization of pseudo-query vectors
21+
- [x] 57 tests passing (unit, differential, property-based, integration, doctest)
22+
- [x] CI pipeline (test, clippy, fmt, build-examples)
23+
- [x] 3 examples (train_tiny, compare_residuals, visualize_weights)
24+
- [x] Criterion benchmarks
25+
- [x] Upgrade to burn 0.20
26+
27+
## v0.2.0 — Serialization & Inference (Planned)
28+
29+
- [ ] Safetensors weight save/load
30+
- [ ] Integrate two-phase inference into main forward path
31+
- [ ] Pre-trained weight loading from PyTorch checkpoints
32+
- [ ] Model export utilities
33+
34+
## v0.3.0 — GPU & Performance (Planned)
35+
36+
- [ ] Test and validate wgpu backend
37+
- [ ] Test and validate CUDA backend (via burn-cuda)
38+
- [ ] Test and validate Metal backend (via burn-tch)
39+
- [ ] GPU-specific benchmarks
40+
- [ ] Memory optimization for large models
41+
- [ ] KV-cache support for autoregressive generation
42+
43+
## v0.4.0 — Production Readiness (Planned)
44+
45+
- [ ] Distributed training support
46+
- [ ] Mixed precision (fp16/bf16) training
47+
- [ ] Gradient checkpointing for memory efficiency
48+
- [ ] Comprehensive documentation with examples
49+
- [ ] Publish to crates.io
50+
51+
## Future Ideas
52+
53+
- Full AttnRes mode (per-layer, not per-block) benchmarks at scale
54+
- Integration examples with popular Rust inference frameworks
55+
- ONNX export
56+
- Quantization support (INT8/INT4)
57+
- Streaming/chunked inference for long sequences

0 commit comments

Comments
 (0)