While working on a fix for test_large_max_pool_contig (from issue #2366), it was discovered that the memory layout of the input tensor was implicitly changed from Contiguous to ChannelsLast. This caused the output to be wrong (indexing mismatch between input and output tensors). A fix was proposed - #2763, but as a follow up task, one should explore the performance of different MaxPool2d kernel variants (whether the channels last kernel is faster than the general one).
CC: @EikanWang