| rank | exp | latency (s) | psnr (dB) | max_abs_diff | desc |
|---|---|---|---|---|---|
| 1 | 18 | 4.791 | 61.9 | 0.0610 | break-free, no coordinate_descent |
| 2 | 16 | 4.796 | 61.9 | 0.0688 | fuse unpatchify into compiled decoder |
| 3 | 14 | 4.816 | 61.9 | 0.0593 | break-free decoder (per-module cache, full fusion) |
| 4 | 13 | 6.641 | 61.3 | 0.0911 | inductor coordinate_descent_tuning |
| 5 | 10 | 6.643 | 61.3 | 0.0892 | compile max-autotune + graph |
| 6 | 9 | 6.658 | 61.2 | 0.0771 | compile decoder for elementwise fusion + graph |
| 7 | 8 | 7.419 | 61.1 | 0.0667 | native-spatial-pad conv (avoid F.pad copy) |
| 8 | 7 | 8.002 | 61.1 | 0.0667 | full-decode CUDA graph |
| 9 | 6 | 8.021 | 61.1 | 0.0667 | bf16 + channels_last + bf16-native upsample |
| 10 | 4 | 8.105 | 61.3 | 0.0922 | bf16 + compile max-autotune-no-cudagraphs |
| 11 | 3 | 8.155 | 61.1 | 0.0667 | bf16 + channels_last_3d eager |
| 12 | 2 | 8.257 | 61.3 | 0.0844 | bf16 + compile decoder |
| 13 | 1 | 8.423 | 61.1 | 0.0708 | bf16 eager |
| 14 | 0 | 14.467 | 126.0 | 0.0000 | baseline fp32 eager (Wan 2.2 VAE) |