I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. The Encoder adopts convolutional downsampling, which leads to changes in the channel and temporal dimensions, especially the reduction of the temporal dimension. However, none of the operations in the Decoder part can modify the temporal dimension. This makes it impossible for the multi-head attention part of the Decoder to match the dimensions when performing multi-head attention calculations using the output of the Encoder and the output of the masked attention.
I also attempted to add padding during the downsampling in the Encoder to ensure that the size remains unchanged, but this already goes against the concept of downsampling. At the same time, when the convolution kernel and stride do not match, it will still cause changes in the temporal dimension.
I can only stop here without going any further.