I tried to reproduce the code of this paper and found that the original description presents an unfeasible process.

I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. The Encoder adopts convolutional downsampling, which leads to changes in the channel and temporal dimensions, especially the reduction of the temporal dimension. However, none of the operations in the Decoder part can modify the temporal dimension. This makes it impossible for the multi-head attention part of the Decoder to match the dimensions when performing multi-head attention calculations using the output of the Encoder and the output of the masked attention.
I also attempted to add padding during the downsampling in the Encoder to ensure that the size remains unchanged, but this already goes against the concept of downsampling. At the same time, when the convolution kernel and stride do not match, it will still cause changes in the temporal dimension.
I can only stop here without going any further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions