fix: Scale QK in vision/attention #1154

changlan · 2025-05-04T04:22:30Z

WindowAttention layer in vision/attention.py overrides the forward calls directly so we need to scale the QK before computing logits.

ruomingp

Thanks! Please also see my comments on the internal PR.

axlearn/vision/attention.py

changlan · 2025-05-08T04:51:16Z

Gentle ping @ruomingp

changlan requested review from ruomingp, markblee and a team as code owners May 4, 2025 04:22

changlan enabled auto-merge May 4, 2025 05:03

ruomingp reviewed May 4, 2025

View reviewed changes

axlearn/vision/attention.py Outdated Show resolved Hide resolved

Scale QK in vision/attention

0cf34e4

changlan force-pushed the vision-scaleqk branch from baf4e8f to 0cf34e4 Compare May 5, 2025 00:13

changlan requested a review from ruomingp May 5, 2025 00:14

changlan assigned ruomingp May 8, 2025

ruomingp approved these changes May 9, 2025

View reviewed changes

changlan added this pull request to the merge queue May 9, 2025

Merged via the queue into apple:main with commit f72d420 May 9, 2025
6 checks passed

changlan deleted the vision-scaleqk branch May 9, 2025 21:25

Steboss pushed a commit to Steboss/axlearn that referenced this pull request May 15, 2025

Scale QK in vision/attention (apple#1154)

d940725

Provide feedback