This document tracks detailed progress toward forge-cute-py v0.1 completion.
Target: Harness infrastructure + Week 0-2 kernel implementations aligned to KernelHeim curriculum.
Status: Harness infrastructure complete. Week 0 kernel complete. Week 1-2 kernels pending.
Goal: End-to-end correctness + benchmark + profile scripts
- Package scaffolding and build system
- Three-layer architecture (ops/kernels/ref)
- PyTorch ops registration via
torch.library - Test infrastructure with pytest
- Benchmark framework with suite system
- Profiling documentation and helper scripts
- CI/CD with ruff linting and formatting
- Pre-commit hooks configuration
- Documentation (README, CONTRIBUTING, DEVELOPMENT)
- CuTe DSL kernel implementation (
forge_cute_py/kernels/copy_transpose.py) - Ops layer with compilation caching (
forge_cute_py/ops/copy_transpose.py) - PyTorch reference implementation (
forge_cute_py/ref/copy_transpose.py) - Correctness tests with exact tolerance (atol=0, rtol=0)
- Support for float16, bfloat16, float32 dtypes
- Support for tile_size=16 and tile_size=32 variants
- Benchmark integration in
bench/run.py - Profiling examples in README
Status: ✅ Complete
Goal: Multiple variants (naive → improved → shuffle) with correctness + benchmark coverage
- Test infrastructure (using PyTorch reference)
- Reference implementation (
forge_cute_py/ref/reduce_sum.py) - Ops registration and API design
- Benchmark integration
-
Naive variant: Simple reduction without optimizations
- CuTe DSL kernel implementation
- Correctness tests vs PyTorch reference
- Benchmark baseline
-
Improved variant: Optimized reduction with shared memory
- CuTe DSL kernel implementation
- Correctness tests vs PyTorch reference
- Benchmark comparison vs naive
-
Shuffle variant: Warp-level shuffle reduction
- CuTe DSL kernel implementation
- Correctness tests vs PyTorch reference
- Benchmark comparison vs improved
-
Documentation: Profiling notes and performance analysis
Status: ⏳ Test infrastructure ready, kernels pending
Goal: Single-pass online softmax with correctness + benchmark coverage + profiling notes
- Test infrastructure (using PyTorch reference)
- Reference implementation (
forge_cute_py/ref/softmax_online.py) - Ops registration and API design
- Benchmark integration
-
Single-pass online softmax kernel
- CuTe DSL kernel implementation
- Numerical stability handling (max subtraction)
- Correctness tests vs PyTorch reference
- Support for float16, bfloat16, float32
- Benchmark integration
-
Documentation:
- Profiling notes
- Performance characteristics
- Comparison with PyTorch softmax
Status: ⏳ Test infrastructure ready, kernel pending
- Ruff linting with GitHub Actions
- Ruff formatting checks
- Pre-commit hooks for local development
- Manual workflow dispatch support
- Configure GPU runners for correctness tests
- Run full test suite on GPU CI
- Optional: Performance smoke checks
- Optional: Nightly profiling runs
Status: ⏳ Local CI complete, GPU CI pending
Not currently in scope but may be added later:
- FlashAttention kernels (FA1, FA2)
- Decode/KV-cache operations
- FP8 support
- Distributed operations (NCCL)
- C++ extension builds
- Additional optimization variants
- Multi-GPU support
Active issues and milestones are tracked on GitHub:
For detailed change history, see CHANGELOG.md.