Skip to content

Conversation

@slckl
Copy link
Contributor

@slckl slckl commented Oct 17, 2025

Changes

  • new cpu_backend/conv2d module with all the conv2d cpu stuff in there
  • new general conv2d impl, tentatively called TiledIm2Col
  • conv2d entry point with specialization -> default general fallback kernel choice flow
  • specialized 1x1 fast path as first case with decent perf
  • makes tiledim2col + specialized 1x1 as the default

What's missing?

  • any kind of arm/neon benchmark numbers
  • looking back at this, there seems to be an opening for parallelized im2col -> full im2col matrix -> gemm impl, but that will still suffer from slow strided copy at the end
  • other small kernels not hitting the fast 1x1 path are probably still fastest on the full im2col - we could add a "specialization" here, but need more benchmarks...

Benchmarks

Benchmarks obtained using cargo bench conv2d -p candle-nn, and switching the DEFAULT inside conv2d.rs.
I bolded (well, sonnet did) the winners in each bench.
Note that conv2d + bias benches are heavily dominated by the +bias term obscuring conv diff (as evidenced by the huge delta vs non-bias benches).

i7-12700h

Benchmark Method Time
cpu_conv2d_F32_i128_k3x3_b/iter Direct [5.1890 ms 5.2280 ms 5.2659 ms]
full im2col [5.0741 ms 5.1176 ms 5.1603 ms]
tiled im2col [5.1679 ms 5.2891 ms 5.4236 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter Direct [64.180 µs 64.335 µs 64.531 µs]
full im2col [48.858 µs 49.143 µs 49.510 µs]
tiled im2col [103.65 µs 104.83 µs 106.87 µs]
Specialized [31.947 µs 32.132 µs 32.384 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter Direct [1.1330 ms 1.1351 ms 1.1382 ms]
full im2col [372.25 µs 373.10 µs 374.08 µs]
tiled im2col [254.32 µs 258.43 µs 264.46 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter Direct [6.9583 ms 6.9752 ms 6.9957 ms]
full im2col [3.2091 ms 3.2245 ms 3.2447 ms]
tiled im2col [2.0145 ms 2.0691 ms 2.1208 ms]
cpu_conv2d_F16_i128_k3x3_b/iter Direct [11.063 ms 11.104 ms 11.155 ms]
full im2col [10.438 ms 10.468 ms 10.506 ms]
tiled im2col [10.721 ms 10.798 ms 10.881 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter Direct [83.230 µs 83.447 µs 83.782 µs]
full im2col [57.435 µs 57.924 µs 58.345 µs]
tiled im2col [103.74 µs 105.38 µs 107.76 µs]
Specialized [36.860 µs 36.939 µs 37.039 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter Direct [1.5233 ms 1.5355 ms 1.5485 ms]
full im2col [372.31 µs 373.41 µs 374.72 µs]
tiled im2col [231.27 µs 234.71 µs 239.23 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter Direct [26.366 ms 26.632 ms 26.935 ms]
full im2col [6.0294 ms 6.0748 ms 6.1248 ms]
tiled im2col [2.4195 ms 2.4403 ms 2.4688 ms]

Ryzen 5900x

Benchmark Method Time
cpu_conv2d_F32_i128_k3x3_b/iter Direct [5.6169 ms 5.6314 ms 5.6476 ms]
full im2col [5.3950 ms 5.4146 ms 5.4362 ms]
tiled im2col [5.6461 ms 5.6780 ms 5.7103 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter Direct [70.219 µs 70.350 µs 70.512 µs]
full im2col [56.717 µs 56.859 µs 57.021 µs]
tiled im2col [114.58 µs 115.28 µs 115.99 µs]
Specialized [37.189 µs 37.242 µs 37.295 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter Direct [1.2667 ms 1.2691 ms 1.2717 ms]
full im2col [325.88 µs 326.46 µs 327.14 µs]
tiled im2col [236.90 µs 238.75 µs 240.91 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter Direct [7.8799 ms 7.8934 ms 7.9071 ms]
full im2col [2.8233 ms 2.8263 ms 2.8292 ms]
tiled im2col [1.1494 ms 1.1558 ms 1.1627 ms]
cpu_conv2d_F16_i128_k3x3_b/iter Direct [13.148 ms 13.183 ms 13.222 ms]
full im2col [11.018 ms 11.035 ms 11.055 ms]
tiled im2col [12.330 ms 12.376 ms 12.423 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter Direct [161.07 µs 161.67 µs 162.37 µs]
full im2col [123.15 µs 123.48 µs 123.85 µs]
tiled im2col [110.57 µs 111.63 µs 112.73 µs]
Specialized [102.03 µs 102.20 µs 102.38 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter Direct [3.4117 ms 3.4208 ms 3.4310 ms]
full im2col [342.73 µs 343.63 µs 344.72 µs]
tiled im2col [223.55 µs 225.02 µs 226.67 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter Direct [56.797 ms 56.956 ms 57.134 ms]
full im2col [5.8133 ms 5.8373 ms 5.8630 ms]
tiled im2col [1.9412 ms 1.9492 ms 1.9588 ms]

@slckl
Copy link
Contributor Author

slckl commented Oct 18, 2025

Fixed clippy issues.

@slckl
Copy link
Contributor Author

slckl commented Oct 18, 2025

Added Ryzen 5900x numbers.

@ivarflakstad
Copy link
Member

Apple M4 Max
(Added main results as well for comparison)

Benchmark Method Time
cpu_conv2d_F32_i128_k3x3_b/iter Main [5.7246 ms 5.7402 ms 5.7573 ms]
Direct [5.9733 ms 5.9832 ms 5.9933 ms]
full im2col [5.7958 ms 5.8065 ms 5.8178 ms]
tiled im2col [6.2416 ms 6.2627 ms 6.2863 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter Main [30.528 µs 30.733 µs 30.973 µs]
Direct [56.829 µs 58.201 µs 59.665 µs]
full im2col [30.804 µs 31.151 µs 31.553 µs]
tiled im2col [113.31 µs 113.95 µs 114.59 µs]
Specialized [31.947 µs 32.132 µs 32.384 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter Main [240.50 µs 241.23 µs 242.10 µs]
Direct [768.42 µs 769.72 µs 770.96 µs]
full im2col [241.59 µs 242.55 µs 243.60 µs]
tiled im2col [209.30 µs 213.43 µs 217.21 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter Main [1.6429 ms 1.6440 ms 1.6453 ms]
Direct [4.6420 ms 4.6595 ms 4.6806 ms]
full im2col [1.6465 ms 1.6481 ms 1.6502 ms]
tiled im2col [703.71 µs 705.35 µs 706.98 µs]
cpu_conv2d_F16_i128_k3x3_b/iter Main [5.8473 ms 5.8845 ms 5.9251 ms]
Direct [5.9108 ms 5.9375 ms 5.9685 ms]
full im2col [5.8168 ms 5.8274 ms 5.8407 ms]
tiled im2col [6.0945 ms 6.1098 ms 6.1265 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter Main [26.561 µs 26.587 µs 26.616 µs]
Direct [32.756 µs 32.887 µs 33.015 µs]
full im2col [27.047 µs 27.115 µs 27.211 µs]
tiled im2col [77.565 µs 77.760 µs 77.950 µs]
Specialized [22.566 µs 22.592 µs 22.619 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter Main [233.41 µs 234.23 µs 235.00 µs]
Direct [523.16 µs 523.45 µs 523.86 µs]
full im2col [228.31 µs 228.84 µs 229.44 µs]
tiled im2col [122.89 µs 123.23 µs 123.56 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter Main [3.8959 ms 3.9107 ms 3.9273 ms]
Direct [8.6383 ms 8.6492 ms 8.6617 ms]
full im2col [3.7789 ms 3.7843 ms 3.7908 ms]
tiled im2col [1.1103 ms 1.1136 ms 1.1176 ms]

@ivarflakstad
Copy link
Member

Some of these numbers are really great 👏
Most of them are further improved by the shape/layout/stride changes we've been looking into (as you know).

From what I can tell we have fairly consistent winners for each benchmark, which means we can assume that if a variant is considerably faster on your cpu it likely behaves similarly on other cpus as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants