Roadmap

This document tracks detailed progress toward forge-cute-py v0.1 completion.

Overview

Target: Harness infrastructure + Week 0-2 kernel implementations aligned to KernelHeim curriculum.

Status: Harness infrastructure complete. Week 0 kernel complete. Week 1-2 kernels pending.

Week 0: Copy/Transpose

Goal: End-to-end correctness + benchmark + profile scripts

Harness Infrastructure

Package scaffolding and build system
Three-layer architecture (ops/kernels/ref)
PyTorch ops registration via torch.library
Test infrastructure with pytest
Benchmark framework with suite system
Profiling documentation and helper scripts
CI/CD with ruff linting and formatting
Pre-commit hooks configuration
Documentation (README, CONTRIBUTING, DEVELOPMENT)

Copy/Transpose Kernel

CuTe DSL kernel implementation (forge_cute_py/kernels/copy_transpose.py)
Ops layer with compilation caching (forge_cute_py/ops/copy_transpose.py)
PyTorch reference implementation (forge_cute_py/ref/copy_transpose.py)
Correctness tests with exact tolerance (atol=0, rtol=0)
Support for float16, bfloat16, float32 dtypes
Support for tile_size=16 and tile_size=32 variants
Benchmark integration in bench/run.py
Profiling examples in README

Status: ✅ Complete

Week 1: Reductions (Sum)

Goal: Multiple variants (naive → improved → shuffle) with correctness + benchmark coverage

Infrastructure

Test infrastructure (using PyTorch reference)
Reference implementation (forge_cute_py/ref/reduce_sum.py)
Ops registration and API design
Benchmark integration

Kernel Implementations

Status: ⏳ Test infrastructure ready, kernels pending

Week 2: Online Softmax

Goal: Single-pass online softmax with correctness + benchmark coverage + profiling notes

Infrastructure

Test infrastructure (using PyTorch reference)
Reference implementation (forge_cute_py/ref/softmax_online.py)
Ops registration and API design
Benchmark integration

Kernel Implementation

Status: ⏳ Test infrastructure ready, kernel pending

CI/CD Infrastructure

Local CI (Completed)

Ruff linting with GitHub Actions
Ruff formatting checks
Pre-commit hooks for local development
Manual workflow dispatch support

GPU CI (Pending)

Configure GPU runners for correctness tests
Run full test suite on GPU CI
Optional: Performance smoke checks
Optional: Nightly profiling runs

Status: ⏳ Local CI complete, GPU CI pending

Future Work (Post v0.1)

Not currently in scope but may be added later:

FlashAttention kernels (FA1, FA2)
Decode/KV-cache operations
FP8 support
Distributed operations (NCCL)
C++ extension builds
Additional optimization variants
Multi-GPU support

Issue Tracking

Active issues and milestones are tracked on GitHub:

For detailed change history, see CHANGELOG.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

Overview

Week 0: Copy/Transpose

Harness Infrastructure

Copy/Transpose Kernel

Week 1: Reductions (Sum)

Infrastructure

Kernel Implementations

Week 2: Online Softmax

Infrastructure

Kernel Implementation

CI/CD Infrastructure

Local CI (Completed)

GPU CI (Pending)

Future Work (Post v0.1)

Issue Tracking

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Roadmap

Overview

Week 0: Copy/Transpose

Harness Infrastructure

Copy/Transpose Kernel

Week 1: Reductions (Sum)

Infrastructure

Kernel Implementations

Week 2: Online Softmax

Infrastructure

Kernel Implementation

CI/CD Infrastructure

Local CI (Completed)

GPU CI (Pending)

Future Work (Post v0.1)

Issue Tracking