Open
Description
Here:
pytorch-image-models/timm/models/swin_transformer_v2.py
Lines 380 to 383 in a49b020
shift_size
. However, the applied padding would offset these values such that the generated mask does not contain the shifted values. Meaning, patches are being included in attention calculation when they should not be.
consider the following x
x = [[1,2,3],[4,5,6],[7,8,9]]
after shifting the windows you get
x = [[2,3,1],[5,6,4],[8,9,7]]
if the window size is 2 then it would apply padding like so
x = [[2,3,1, 0],[5,6,4, 0],[8,9,7,0], [0,0,0,0]]
because the shifted window attention mask is calculated from x at this point the calculated attn mask would only mask out the added padding tokens not the shifted values. In this particular example, the shifted values do not attend to each other inappropriately, but in the case of a larger grid (e.g. 3x3) you would see cases where tokens such as 7 might attend to a token such as 1.
Metadata
Metadata
Assignees
Labels
No labels