Skip to content

I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. #7

@BUG423

Description

@BUG423

I tried to reproduce the code of this paper and found that the original description presents an unfeasible process. The Encoder adopts convolutional downsampling, which leads to changes in the channel and temporal dimensions, especially the reduction of the temporal dimension. However, none of the operations in the Decoder part can modify the temporal dimension. This makes it impossible for the multi-head attention part of the Decoder to match the dimensions when performing multi-head attention calculations using the output of the Encoder and the output of the masked attention.
I also attempted to add padding during the downsampling in the Encoder to ensure that the size remains unchanged, but this already goes against the concept of downsampling. At the same time, when the convolution kernel and stride do not match, it will still cause changes in the temporal dimension.
I can only stop here without going any further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions