Merge pull request #6 from AbdelStark/claude/sync-documentation-27Yio

AbdelStark · web-flow · commit 714bcd8293aa · 2026-03-16T07:46:52.000+01:00
docs: sync documentation with current project state
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,118 @@
+# AGENTS.md — AI Agent Technical Context
+
+## Project Overview
+
+**attnres-rs** is the first Rust implementation of Attention Residuals (MoonshotAI/Kimi paper) using the [burn](https://github.com/tracel-ai/burn) deep learning framework. It provides a drop-in replacement for standard residual connections in Transformers.
+
+## Tech Stack
+
+| Component   | Technology       | Version  |
+|-------------|-----------------|----------|
+| Language    | Rust            | 2021 edition (1.80+) |
+| ML Framework| burn            | 0.20     |
+| Test Backend| NdArray         | (CPU, deterministic) |
+| Testing     | cargo test + proptest + criterion | — |
+| Linting     | clippy + rustfmt | —       |
+| CI          | GitHub Actions   | test, clippy, fmt, build-examples |
+
+## Project Structure
+
+```
+src/
+├── lib.rs              # Public API re-exports + module declarations
+├── config.rs           # AttnResConfig — validated builder pattern
+├── attn_res_op.rs      # Core AttnRes operation (depth-wise softmax attention)
+├── block_state.rs      # BlockState — cumulative block representation tracking
+├── layer.rs            # AttnResLayer — transformer layer with dual AttnRes
+├── model.rs            # AttnResTransformer — full model (embed → layers → LM head)
+├── rms_norm.rs         # RMSNorm implementation
+├── two_phase.rs        # Two-phase inference optimization
+├── attention.rs        # Multi-head self-attention
+├── feed_forward.rs     # SwiGLU-style MLP
+└── utils.rs            # Causal mask generation helpers
+
+tests/
+├── unit_tests.rs       # Core algorithm correctness tests
+├── differential_tests.rs # PyTorch reference comparison tests
+├── property_tests.rs   # proptest property-based tests
+└── integration_tests.rs # Full model training loop tests
+
+examples/
+├── train_tiny.rs       # Train a small model on synthetic data
+├── compare_residuals.rs # Compare AttnRes vs standard residuals
+└── visualize_weights.rs # Visualize depth attention patterns
+
+benches/
+└── attn_res_benchmark.rs # Criterion benchmarks
+
+fixtures/                # Reference outputs from PyTorch
+├── attn_res_forward.json
+└── block_state_tracking.json
+```
+
+## Commands
+
+```bash
+cargo build                        # Build the project
+cargo test --all-features          # Run all 57 tests
+cargo test test_name               # Run specific test
+cargo clippy -- -D warnings        # Lint (warnings = errors)
+cargo fmt                          # Format code
+cargo fmt -- --check               # Check formatting without modifying
+cargo bench                        # Run Criterion benchmarks
+cargo run --example train_tiny     # Train example
+cargo run --example compare_residuals  # Comparison example
+cargo run --example visualize_weights  # Visualization example
+```
+
+## Architecture Essentials
+
+### Core Algorithm (AttnRes)
+
+Standard residual: `x_{l+1} = x_l + f_l(x_l)` (fixed unit weights)
+
+AttnRes: `x_{l+1} = Σ α_i · v_i` where α = softmax(w_l · RMSNorm(V)) over depth dimension
+
+Key invariants:
+1. **Zero-init pseudo-queries** → starts as uniform averaging (standard residual behavior)
+2. **Two AttnRes per transformer layer** — one before self-attention, one before MLP
+3. **Softmax over depth** (block/layer dimension), NOT over sequence tokens
+4. **RMSNorm on keys** to prevent magnitude domination
+5. **Block boundaries** at every `block_size/2` sublayers
+
+### Data Flow
+
+```
+Input IDs → Embedding → [AttnResLayer × N] → RMSNorm → LM Head → Logits
+                              ↓
+                    AttnResOp(pre-attn) → RMSNorm → MultiHeadAttention
+                    AttnResOp(pre-mlp)  → RMSNorm → FeedForward
+```
+
+### Configuration
+
+`AttnResConfig::new(d_model, num_layers, num_blocks)` where:
+- `d_model`: Hidden dimension
+- `num_layers`: Number of **sublayers** (transformer layers × 2)
+- `num_blocks`: Number of blocks for Block AttnRes (set = num_layers for Full AttnRes)
+
+## Boundaries
+
+### Read-Only (never modify)
+- `spec.md`, `paper.md`, `research_report.md`, `implementation_plan.md`, `LICENSE`
+
+### Gated (requires approval)
+- `Cargo.toml` (dependency changes)
+- `.github/workflows/` (CI changes)
+- `cargo publish`
+
+## Source of Truth
+
+`spec.md` is the authoritative specification. All algorithm implementations must match the pseudocode and equations defined there.
+
+## Known Gaps
+
+- No safetensors serialization
+- Two-phase inference not integrated into main forward path
+- GPU backends (wgpu, CUDA, Metal) untested
+- No distributed training support
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,18 +6,18 @@ attnres-rs: First Rust implementation of Attention Residuals (MoonshotAI/Kimi pa
 | Layer       | Technology    | Version  | Notes                                    |
 |-------------|---------------|----------|------------------------------------------|
 | Language    | Rust          | 1.80+    | Nightly recommended for some burn features |
-| ML Framework| burn          | latest   | tracel-ai/burn — multi-backend DL framework |
+| ML Framework| burn          | 0.20     | tracel-ai/burn — multi-backend DL framework |
 | Backends    | CUDA, Metal, wgpu, NdArray | — | NdArray for CPU testing, wgpu for cross-platform GPU |
 | Testing     | cargo test    | —        | + proptest (property-based), criterion (benchmarks) |
 | Serialization | safetensors | —       | For weight loading/saving                |
 | Linting     | clippy + rustfmt | —     | Enforced in CI                           |
-| CI          | GitHub Actions | —       | cargo test, clippy, fmt                  |
+| CI          | GitHub Actions | —       | cargo test, clippy, fmt, build-examples  |
 </stack>
 
 <status>
 PROJECT PHASE: Alpha (v0.1.0 — core algorithm implemented, tests passing).
-All source modules implemented. 52 tests passing (unit, differential, property-based, integration).
-CI configured (test, clippy, fmt, build-examples). Examples and benchmarks functional.
+All source modules implemented. 57 tests passing (28 inline unit + 18 external unit + 3 differential + 2 property + 5 integration + 1 doctest).
+CI configured (test, clippy, fmt, build-examples). Examples and benchmarks functional. burn upgraded to 0.20.
 Known gaps: no safetensors serialization, two-phase inference not integrated into main forward path, GPU backends untested.
 </status>
 
@@ -28,6 +28,8 @@ Current directory layout:
 attnres-rs/
 ├── Cargo.toml                      # Package manifest [agent: CREATE]
 ├── CLAUDE.md                       # This file
+├── AGENTS.md                       # AI agent technical context [agent: MODIFY]
+├── ROADMAP.md                      # Feature roadmap and progress [agent: MODIFY]
 ├── README.md                       # Project README [agent: MODIFY]
 ├── LICENSE                         # MIT [agent: READ ONLY]
 ├── spec.md                         # Technical specification [agent: READ ONLY — source of truth]
@@ -257,5 +259,8 @@ Available skills:
 
 <lessons_learned>
   — [Initial setup] This is a greenfield project. All implementation follows spec.md as the source of truth.
+  — [burn 0.16→0.20] Breaking API changes required updates to activation functions, loss computation, and tensor operations. Always check burn changelog when upgrading.
+  — [Testing] NdArray backend is deterministic and fast for small tensors. All tests use it. GPU backends remain untested.
+  — [Quality audit] Doc comments, config validation, and test coverage were hardened in a dedicated audit pass. Maintain this standard.
 </lessons_learned>
 </memory>
diff --git a/README.md b/README.md
@@ -83,14 +83,16 @@ cargo bench                        # Benchmarks
 
 ## Current Status
 
-**Alpha** (v0.1.0). Core algorithm implemented and tested. Suitable for research and experimentation. Not yet suitable for production training at scale.
+**Alpha** (v0.1.0). Core algorithm implemented and tested with 57 passing tests (unit, differential, property-based, integration). Built on burn 0.20. Suitable for research and experimentation. Not yet suitable for production training at scale.
 
 Known limitations:
 - No weight serialization/loading (safetensors support planned)
 - Two-phase inference optimization is implemented but not integrated into the main forward pass
-- NdArray backend only tested; GPU backends untested
+- NdArray backend only tested; GPU backends (wgpu, CUDA, Metal) untested
 - No distributed training support
 
+See [ROADMAP.md](ROADMAP.md) for planned features and progress.
+
 ## Paper
 
 > **Attention Residuals** -- Kimi Team (MoonshotAI), 2026
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -0,0 +1,57 @@
+# attnres-rs Roadmap
+
+## Current Phase: Alpha (v0.1.0)
+
+Core algorithm implemented and tested. Suitable for research and experimentation.
+
+---
+
+## v0.1.0 — Core Implementation ✅
+
+- [x] AttnResOp: Block AttnRes forward pass with depth-wise softmax
+- [x] BlockState: Cumulative block representation tracking
+- [x] RMSNorm: Custom implementation for key normalization
+- [x] AttnResLayer: Transformer layer with dual AttnRes (pre-attention + pre-MLP)
+- [x] AttnResTransformer: Full model with embedding, LM head, causal masking
+- [x] MultiHeadAttention: Standard multi-head self-attention
+- [x] FeedForward: SwiGLU-style MLP
+- [x] TwoPhase: Two-phase inference optimization (standalone)
+- [x] Config: Validated configuration with builder pattern
+- [x] Zero initialization of pseudo-query vectors
+- [x] 57 tests passing (unit, differential, property-based, integration, doctest)
+- [x] CI pipeline (test, clippy, fmt, build-examples)
+- [x] 3 examples (train_tiny, compare_residuals, visualize_weights)
+- [x] Criterion benchmarks
+- [x] Upgrade to burn 0.20
+
+## v0.2.0 — Serialization & Inference (Planned)
+
+- [ ] Safetensors weight save/load
+- [ ] Integrate two-phase inference into main forward path
+- [ ] Pre-trained weight loading from PyTorch checkpoints
+- [ ] Model export utilities
+
+## v0.3.0 — GPU & Performance (Planned)
+
+- [ ] Test and validate wgpu backend
+- [ ] Test and validate CUDA backend (via burn-cuda)
+- [ ] Test and validate Metal backend (via burn-tch)
+- [ ] GPU-specific benchmarks
+- [ ] Memory optimization for large models
+- [ ] KV-cache support for autoregressive generation
+
+## v0.4.0 — Production Readiness (Planned)
+
+- [ ] Distributed training support
+- [ ] Mixed precision (fp16/bf16) training
+- [ ] Gradient checkpointing for memory efficiency
+- [ ] Comprehensive documentation with examples
+- [ ] Publish to crates.io
+
+## Future Ideas
+
+- Full AttnRes mode (per-layer, not per-block) benchmarks at scale
+- Integration examples with popular Rust inference frameworks
+- ONNX export
+- Quantization support (INT8/INT4)
+- Streaming/chunked inference for long sequences