CPU Conv2d: separate module, tiled im2col, specialization #3136

slckl · 2025-10-17T21:28:48Z

Changes

new cpu_backend/conv2d module with all the conv2d cpu stuff in there
new general conv2d impl, tentatively called TiledIm2Col
conv2d entry point with specialization -> default general fallback kernel choice flow
specialized 1x1 fast path as first case with decent perf
makes tiledim2col + specialized 1x1 as the default

What's missing?

any kind of arm/neon benchmark numbers
looking back at this, there seems to be an opening for parallelized im2col -> full im2col matrix -> gemm impl, but that will still suffer from slow strided copy at the end
other small kernels not hitting the fast 1x1 path are probably still fastest on the full im2col - we could add a "specialization" here, but need more benchmarks...

Benchmarks

Benchmarks obtained using cargo bench conv2d -p candle-nn, and switching the DEFAULT inside conv2d.rs.
I bolded (well, sonnet did) the winners in each bench.
Note that conv2d + bias benches are heavily dominated by the +bias term obscuring conv diff (as evidenced by the huge delta vs non-bias benches).

i7-12700h

Benchmark	Method	Time
cpu_conv2d_F32_i128_k3x3_b/iter	Direct	[5.1890 ms 5.2280 ms 5.2659 ms]
	full im2col	[5.0741 ms 5.1176 ms 5.1603 ms]
	tiled im2col	[5.1679 ms 5.2891 ms 5.4236 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter	Direct	[64.180 µs 64.335 µs 64.531 µs]
	full im2col	[48.858 µs 49.143 µs 49.510 µs]
	tiled im2col	[103.65 µs 104.83 µs 106.87 µs]
	Specialized	[31.947 µs 32.132 µs 32.384 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter	Direct	[1.1330 ms 1.1351 ms 1.1382 ms]
	full im2col	[372.25 µs 373.10 µs 374.08 µs]
	tiled im2col	[254.32 µs 258.43 µs 264.46 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter	Direct	[6.9583 ms 6.9752 ms 6.9957 ms]
	full im2col	[3.2091 ms 3.2245 ms 3.2447 ms]
	tiled im2col	[2.0145 ms 2.0691 ms 2.1208 ms]
cpu_conv2d_F16_i128_k3x3_b/iter	Direct	[11.063 ms 11.104 ms 11.155 ms]
	full im2col	[10.438 ms 10.468 ms 10.506 ms]
	tiled im2col	[10.721 ms 10.798 ms 10.881 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter	Direct	[83.230 µs 83.447 µs 83.782 µs]
	full im2col	[57.435 µs 57.924 µs 58.345 µs]
	tiled im2col	[103.74 µs 105.38 µs 107.76 µs]
	Specialized	[36.860 µs 36.939 µs 37.039 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter	Direct	[1.5233 ms 1.5355 ms 1.5485 ms]
	full im2col	[372.31 µs 373.41 µs 374.72 µs]
	tiled im2col	[231.27 µs 234.71 µs 239.23 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter	Direct	[26.366 ms 26.632 ms 26.935 ms]
	full im2col	[6.0294 ms 6.0748 ms 6.1248 ms]
	tiled im2col	[2.4195 ms 2.4403 ms 2.4688 ms]

Ryzen 5900x

Benchmark	Method	Time
cpu_conv2d_F32_i128_k3x3_b/iter	Direct	[5.6169 ms 5.6314 ms 5.6476 ms]
	full im2col	[5.3950 ms 5.4146 ms 5.4362 ms]
	tiled im2col	[5.6461 ms 5.6780 ms 5.7103 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter	Direct	[70.219 µs 70.350 µs 70.512 µs]
	full im2col	[56.717 µs 56.859 µs 57.021 µs]
	tiled im2col	[114.58 µs 115.28 µs 115.99 µs]
	Specialized	[37.189 µs 37.242 µs 37.295 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter	Direct	[1.2667 ms 1.2691 ms 1.2717 ms]
	full im2col	[325.88 µs 326.46 µs 327.14 µs]
	tiled im2col	[236.90 µs 238.75 µs 240.91 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter	Direct	[7.8799 ms 7.8934 ms 7.9071 ms]
	full im2col	[2.8233 ms 2.8263 ms 2.8292 ms]
	tiled im2col	[1.1494 ms 1.1558 ms 1.1627 ms]
cpu_conv2d_F16_i128_k3x3_b/iter	Direct	[13.148 ms 13.183 ms 13.222 ms]
	full im2col	[11.018 ms 11.035 ms 11.055 ms]
	tiled im2col	[12.330 ms 12.376 ms 12.423 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter	Direct	[161.07 µs 161.67 µs 162.37 µs]
	full im2col	[123.15 µs 123.48 µs 123.85 µs]
	tiled im2col	[110.57 µs 111.63 µs 112.73 µs]
	Specialized	[102.03 µs 102.20 µs 102.38 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter	Direct	[3.4117 ms 3.4208 ms 3.4310 ms]
	full im2col	[342.73 µs 343.63 µs 344.72 µs]
	tiled im2col	[223.55 µs 225.02 µs 226.67 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter	Direct	[56.797 ms 56.956 ms 57.134 ms]
	full im2col	[5.8133 ms 5.8373 ms 5.8630 ms]
	tiled im2col	[1.9412 ms 1.9492 ms 1.9588 ms]

slckl · 2025-10-18T07:28:41Z

Fixed clippy issues.

slckl · 2025-10-18T15:59:21Z

Added Ryzen 5900x numbers.

ivarflakstad · 2025-10-21T20:34:29Z

Apple M4 Max
(Added main results as well for comparison)

Benchmark	Method	Time
cpu_conv2d_F32_i128_k3x3_b/iter	Main	[5.7246 ms 5.7402 ms 5.7573 ms]
	Direct	[5.9733 ms 5.9832 ms 5.9933 ms]
	full im2col	[5.7958 ms 5.8065 ms 5.8178 ms]
	tiled im2col	[6.2416 ms 6.2627 ms 6.2863 ms]
cpu_conv2d_F32_i128_k1x1_nb/iter	Main	[30.528 µs 30.733 µs 30.973 µs]
	Direct	[56.829 µs 58.201 µs 59.665 µs]
	full im2col	[30.804 µs 31.151 µs 31.553 µs]
	tiled im2col	[113.31 µs 113.95 µs 114.59 µs]
	Specialized	[31.947 µs 32.132 µs 32.384 µs]
cpu_conv2d_F32_i128_k5x5_nb/iter	Main	[240.50 µs 241.23 µs 242.10 µs]
	Direct	[768.42 µs 769.72 µs 770.96 µs]
	full im2col	[241.59 µs 242.55 µs 243.60 µs]
	tiled im2col	[209.30 µs 213.43 µs 217.21 µs]
cpu_conv2d_F32_i512_k3x3_nb/iter	Main	[1.6429 ms 1.6440 ms 1.6453 ms]
	Direct	[4.6420 ms 4.6595 ms 4.6806 ms]
	full im2col	[1.6465 ms 1.6481 ms 1.6502 ms]
	tiled im2col	[703.71 µs 705.35 µs 706.98 µs]
cpu_conv2d_F16_i128_k3x3_b/iter	Main	[5.8473 ms 5.8845 ms 5.9251 ms]
	Direct	[5.9108 ms 5.9375 ms 5.9685 ms]
	full im2col	[5.8168 ms 5.8274 ms 5.8407 ms]
	tiled im2col	[6.0945 ms 6.1098 ms 6.1265 ms]
cpu_conv2d_F16_i128_k1x1_nb/iter	Main	[26.561 µs 26.587 µs 26.616 µs]
	Direct	[32.756 µs 32.887 µs 33.015 µs]
	full im2col	[27.047 µs 27.115 µs 27.211 µs]
	tiled im2col	[77.565 µs 77.760 µs 77.950 µs]
	Specialized	[22.566 µs 22.592 µs 22.619 µs]
cpu_conv2d_F16_i128_k5x5_nb/iter	Main	[233.41 µs 234.23 µs 235.00 µs]
	Direct	[523.16 µs 523.45 µs 523.86 µs]
	full im2col	[228.31 µs 228.84 µs 229.44 µs]
	tiled im2col	[122.89 µs 123.23 µs 123.56 µs]
cpu_conv2d_F16_i512_k5x5_nb/iter	Main	[3.8959 ms 3.9107 ms 3.9273 ms]
	Direct	[8.6383 ms 8.6492 ms 8.6617 ms]
	full im2col	[3.7789 ms 3.7843 ms 3.7908 ms]
	tiled im2col	[1.1103 ms 1.1136 ms 1.1176 ms]

ivarflakstad · 2025-10-21T20:39:10Z

Some of these numbers are really great 👏
Most of them are further improved by the shape/layout/stride changes we've been looking into (as you know).

From what I can tell we have fairly consistent winners for each benchmark, which means we can assume that if a variant is considerably faster on your cpu it likely behaves similarly on other cpus as well.

slckl added 2 commits October 17, 2025 23:59

candle-core: cpu conv2d in separate mod + tiled im2col variant

d96cb62

candle-core: clippy compliance for conv2d

9cec5a2

candle-core: remove dead code in conv2d

ec19acf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU Conv2d: separate module, tiled im2col, specialization #3136

CPU Conv2d: separate module, tiled im2col, specialization #3136

Uh oh!

slckl commented Oct 17, 2025 •

edited

Loading

Uh oh!

slckl commented Oct 18, 2025

Uh oh!

slckl commented Oct 18, 2025

Uh oh!

ivarflakstad commented Oct 21, 2025

Uh oh!

ivarflakstad commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CPU Conv2d: separate module, tiled im2col, specialization #3136

Are you sure you want to change the base?

CPU Conv2d: separate module, tiled im2col, specialization #3136

Uh oh!

Conversation

slckl commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

What's missing?

Benchmarks

i7-12700h

Ryzen 5900x

Uh oh!

slckl commented Oct 18, 2025

Uh oh!

slckl commented Oct 18, 2025

Uh oh!

ivarflakstad commented Oct 21, 2025

Uh oh!

ivarflakstad commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

slckl commented Oct 17, 2025 •

edited

Loading