Umbrella guidance: the workspace-root
AGENTS.mdis the source of truth for cross-repo thesis, boundaries, and rules. This file is the repo-specific authority forkin-infer.
Universal transformer inference engine in pure Rust. Custom GPU compute shaders, no external ML frameworks.
cargo build # CPU only
cargo build --features metal # macOS: Apple Metal GPU
cargo build --features cuda # Linux/Windows: NVIDIA CUDA GPU
cargo build --features accelerate # macOS: Accelerate BLAS (CPU)
cargo test --features metal # run all tests including Metal GPU (Warning: can hit stale binary bugs)
./scripts/run-tests.sh # RECOMMENDED: runs tests with clean environment (cleans stale binaries + GPU sweep)src/lib.rs— Core engine: model loading, forward pass, SIMD primitives (~2100 lines)src/gpu.rs— GPU abstraction:GpuComputetrait, device discovery, CPU fallbacksrc/metal_backend.rs— Apple Metal: custom MSL compute shaders for matmul, softmax, norms, activationssrc/cuda_backend.rs— NVIDIA CUDA: PTX kernels via driver API FFI (no toolkit needed at build time)
metal— Apple Metal GPU (macOS, M1/M2/M3). Deps:metal,objc2cuda— NVIDIA CUDA via driver API (Linux/Windows). No build-time deps, just needs NVIDIA driveraccelerate— Apple Accelerate BLAS for CPU matmulmkl— Intel MKL BLASopenblas— OpenBLAS fallback
BertConfig/ModelArchitecture— model configuration and auto-detectionBertModel— loaded model with weightsKvCache— decoder-only KV cache for generationSamplingParams— temperature, top-k, top-p, repetition penaltygpu::GpuCompute— trait for GPU-accelerated tensor opsgpu::GpuDeviceInfo— discovered GPU device informationgpu::discover_devices()— enumerate available GPUsgpu::create_compute()— get best available compute backend