fix(model): attention pad mask #40

shenxiangzhuang · 2025-10-24T06:34:16Z

Fix the attention pad mask, with .unsqueeze(1).unsqueeze(3) will got wrong broadcast result. Currently, the code works despite this bug because we are using -10000 rather than -inf as usually:

transformer/models/layers/scale_dot_product_attention.py

Line 35 in 6328654

score = score.masked_fill(mask == 0, -10000)

If we use -inf normally, we will got errors. Here is the analysis:

pad_mask_bad = (input_ids != padding_idx).unsqueeze(1).unsqueeze(3)

which yields shape (batch, 1, seq_len, 1). For a toy batch with padding at the end:

ids = torch.tensor([[5, 7, 0, 0]])
pad_mask_bad[0, 0] == tensor([
                            [ True],
                            [ True],
                            [False],
                            [False]])

During attention this mask must broadcast to (batch, heads, seq_q, seq_k). The third query row owns only a single False, so broadcasting replicates it across every key position:

row_2_after_broadcast = [False, False, False, False]

After applying the causal mask everything in that row stays False, so the logits become -inf, and softmax turns the row into nan. The failure only appeared when an entire suffix was padding, which explains why it slipped through basic smoke tests.

Step-by-step view.

Build the naïve mask: pad_mask_bad.shape == (1, 1, 4, 1).
Combine with the lower-triangular causal mask:

[[ True, False, False, False],
    [ True,  True, False, False],
    [False, False, False, False],
    [False, False, False, False]]

Apply to logits → rows 2 and 3 contain only -inf → nan attention weights.

The fix. Keep the key axis explicit:

pad_mask_good = (input_ids != padding_idx).unsqueeze(1).unsqueeze(2)

Now the mask starts at (batch, 1, 1, seq_len) and broadcasting preserves the column-wise padding information:

[[ True, False, False, False],
 [ True,  True, False, False],
 [ True,  True, False, False],
 [ True,  True, False, False]]

Rows 2 and 3 still attend to the earlier valid tokens, so the logits stay finite and the model trains normally.

shenxiangzhuang · 2025-10-28T09:17:25Z

Hi @hyunwoongko , can you review this at spare time? I want to make sure that I don't misunderstand the purpose of original implementation.

shenxiangzhuang added 2 commits October 24, 2025 14:31

fix(model): attention pad mask

12b727c

fix

7ebe903

shenxiangzhuang mentioned this pull request Oct 24, 2025

Attention Mask fill -inf or -10000 ai-glimpse/toynlp#50

Closed

sssn-tech mentioned this pull request Oct 28, 2025

The versions of dependencies #36

Open

shenxiangzhuang mentioned this pull request Nov 14, 2025

fix(attention): fix wrong broadcast ai-glimpse/toynlp#52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(model): attention pad mask #40

fix(model): attention pad mask #40

Uh oh!

shenxiangzhuang commented Oct 24, 2025 •

edited

Loading

Uh oh!

shenxiangzhuang commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(model): attention pad mask #40

Are you sure you want to change the base?

fix(model): attention pad mask #40

Uh oh!

Conversation

shenxiangzhuang commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shenxiangzhuang commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shenxiangzhuang commented Oct 24, 2025 •

edited

Loading