cuTile v0.3.0
Breaking changes: Replace ct.launch with @cuda backend=cuTile
Merged pull requests:
- Include
(c - a*b)pattern in FMA rewrite pipeline. (#190) (@maleadt) - Support for array slicing (#191) (@maleadt)
- Add generic dataflow framework; port constant & alias analyses (#192) (@maleadt)
- Add kernel and host RNG (#193) (@maleadt)
- Stop special-casing TileArray in codegen; add permutedims/transpose/reshape (#194) (@maleadt)
- Document intrinsics. (#198) (@maleadt)
- Add transform-side control-flow helpers. (#199) (@maleadt)
- Add KernelState plumbing for per-launch ambient state. (#200) (@maleadt)
- Various small fixes (#201) (@maleadt)
- Test BFloat16 broadcast subtraction. (#202) (@maleadt)
- Minor fixes given the Tile IR spec (#203) (@maleadt)
- Improve Random.jl coverage:
randnandrandexp(#204) (@maleadt) - Extend assumption analysis: divisibility, bounds, no-wrap (#205) (@maleadt)
- Add lightweight CSE on StructuredIRCode. (#207) (@maleadt)
- Update benchmarks (#208) (@maleadt)
- Benchmark harness improvements (#210) (@maleadt)
- Recompute IR flags from
efuncwhen the rewriter changes opcodes. (#211) (@maleadt) - Fold contiguous-axis stride in gather/scatter offset chains (#212) (@maleadt)
- Add rewrite rule to drop contiguous-axis stride in
scatter/gatheroffsets + unified AssumeOp injection (#213) (@maleadt) - Integrate with CUDA.jl + reduce launch overhead (#214) (@maleadt)
- Fix UndefVarError in README Quick Start (#215) (@AntonOresten)
- TTFX improvements (#216) (@maleadt)
- Backport fixes from cuTile Python (#218) (@maleadt)
- Suppress divby assume on tile-of-pointers from offset. (#219) (@maleadt)
- Run normalization rewrites to fixpoint before FMA fusion. (#220) (@maleadt)
- Update benchmark timings. (#221) (@maleadt)
Closed issues: