Skip to content

Commit 9851785

Browse files
committed
Update benchmarks.
1 parent 19117eb commit 9851785

1 file changed

Lines changed: 11 additions & 11 deletions

File tree

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -98,17 +98,17 @@ Benchmarks comparing cuTile.jl against cuTile Python on an RTX 5080 (`tileiras`
9898

9999
| Kernel | Size | Julia | Python | Status |
100100
|--------|------|-------|--------|--------|
101-
| Vector Addition | 2^27 f32 | 841 GB/s | 845 GB/s | OK (=) |
102-
| Matrix Transpose | 8192² f32 | 805 GB/s | 811 GB/s | OK (-1%) |
103-
| Layer Norm fwd | 4096² f32 | 925 GB/s | 722 GB/s | +28%* |
104-
| Layer Norm bwd | 4096² f32 | 243 GB/s | 251 GB/s | -3% |
105-
| Matrix Multiplication | 4096³ f32 | 46.9 TFLOPS | 43.4 TFLOPS | +8%** |
106-
| Batch Matrix Multiply | 1024×512×2048 ×8 f32 | 33.6 TFLOPS | 30.9 TFLOPS | +9%** |
107-
| FFT (3-stage Cooley-Tukey) | 1024-pt ×64 c64 | 3263 μs | 3127 μs | -4% |
108-
| Mixture of Experts | 256tok 1024h 32e 2048i f16 | 19.3 TFLOPS | 20.3 TFLOPS | -5% |
109-
| Attention (FMHA) | 8×16×1024² ×64 f16 causal | 88.5 TFLOPS | 61.6 TFLOPS | +44%*** |
110-
111-
\* The pow(x, 2) → mulf(x, x) strength reduction eliminates the expensive
101+
| Vector Addition | 2^27 f32 | 842 GB/s | 847 GB/s | OK (=) |
102+
| Matrix Transpose | 8192² f32 | 813 GB/s | 812 GB/s | OK (=) |
103+
| Layer Norm fwd | 4096² f32 | 931 GB/s | 716 GB/s | +30%* |
104+
| Layer Norm bwd | 4096² f32 | 245 GB/s | 250 GB/s | OK (-2%) |
105+
| Matrix Multiplication | 4096³ f32 | 47.0 TFLOPS | 43.3 TFLOPS | +9%** |
106+
| Batch Matrix Multiply | 1024×512×2048 ×8 f32 | 33.4 TFLOPS | 30.7 TFLOPS | +9%** |
107+
| FFT (3-stage Cooley-Tukey) | 512-pt ×64 c64 | 592 μs | 562 μs | OK (+5%) |
108+
| Mixture of Experts | 256tok 1024h 32e 2048i f16 | 18.8 TFLOPS | 20.3 TFLOPS | -7% |
109+
| Attention (FMHA) | 8×16×1024² ×64 f16 causal | 89.3 TFLOPS | 63.9 TFLOPS | +40%*** |
110+
111+
\* The `pow(x, 2)``mulf(x, x)` strength reduction eliminates the expensive
112112
transcendental in the variance computation. Python still emits `pow`.
113113

114114
\*\* Likely because Julia's `for` loop guards give `tileiras` a guarantee that the

0 commit comments

Comments
 (0)