The way to mask attention matrices in Flax #2915

kenkenpa2126 · 2023-03-01T15:55:50Z

kenkenpa2126
Mar 1, 2023

It seems that attention matrices are masked both for queries and keys simultaneously and masked positions are changed to jnp.finfo(dtype).min, not to -jnp.inf in Flax implementation.

flax/flax/linen/attention.py

Line 104 in 10a2123

big_neg = jnp.finfo(dtype).min

When both queries and keys are masked at the same time and the masked positions are changed to -jnp.inf the rows of all -jnp.inf cause if the query has masked positions, and it causes an error in the calculation in softmax because the denominator becomes zero.
I guess that's why masked positions are changed to jnp.finfo(dtype).min instead of -jnp.inf.

However, it lets gradient flow into the masked positions, and it's thought to be not good.
Also, we can change the masked positions to -jnp.inf if making attention masks only for keys.

I wonder what brings Flax to adopt this way to mask attention matrices both for queries and keys at the same time.

cgarciae · 2023-03-01T17:43:29Z

cgarciae
Mar 1, 2023
Maintainer

Hey @kenkenpa2126

However, it lets gradient flow into the masked positions, and it's thought to be not good.

Can you explain why this would happen? The way I see it jnp.finfo(dtype).min is a constant so no gradient will flow through it.

3 replies

kenkenpa2126 Mar 2, 2023
Author

Here's what I thought. (If I have some misunderstandings, I would appreciate it if you would point them out.)

Assuming a query = [ q_1, q_2, q_3]^T, key = [k_1, k_2, k_3]^T and value = [v_1, v_2, v_3]^T,
the product of query and transposed key is gonna be,

qk^T = [ [q_1*k_1, q_1*k_2, q_1*k_3],
           [q_2*k_1, q_2*k_2, q_2*k_3],
           [q_3*k_1, q_3*k_2, q_3*k_3]]

If k_3 is masked and replaced with -jnp.inf,
it's gonna be

masked = [ [q_1*k_1, q_1*k_2, -jnp.inf],
             [q_2*k_1, q_2*k_2, -jnp.inf],
             [q_3*k_1, q_3*k_2, -jnp.inf]]

And after softmax, it's gonna be,

attn = [ [a_11, a_12, 0],
             [a_21, a_22, 0],
             [a_31, a_32, 0]]

Then, after multiplying with value, the output becomes,

out = [[a_11*v_1 + a_12*v_2],
          [a_21*v_1 + a_22*v_2],
          [a_31*v_1 + a_32*v_2]]

In this case, the gradients of value will be,

dout/dv = [[a_11, a_12, 0],
                 [a_21, a_22, 0],
                 [a_31, a_32, 0]]

However, if masked positions are replaced with jnp.finfo(dtype).min,
the very small constant c is remained in the gradients and the padded positions information flows into the gradients of values.

dout/dv = [[a_11, a_12, c],
                 [a_21, a_22, c],
                 [a_31, a_32, c]]

In many Flax implementations, the query is masked in the loss function, too. In that case, it doesn't matter but still, if the query is masked in the loss, we don't have to mask it during the attention calculations and the masked positions can be replaced with -jnp.inf because no error happens in softmax if only the key is masked during the attention calculation.

kenkenpa2126 Mar 2, 2023
Author

I tried an easy experiment with Colab, following the attention implementation.

flax/flax/linen/attention.py

Line 108 in 10a2123

attn_weights = jax.nn.softmax(attn_weights).astype(dtype)

From this easy experiment, when masked positions are replaced with jnp.finfo(dtype).min, the very small constant c in key directions doesn't remain but will be 0.. So, now I confirmed that what I thought above won't happen. I guess that's because jnp.finfo(dtype).min becomes smaller than the number dtype (jnp.float32) can express after softmax.

However, it seems that when the query is masked and replaced with jnp.finfo(jnp.float32).min, the attention weights for masked positions in the query direction are not small. I think it could be a bag.

I know that if the query direction is masked in the loss function (like multiplying the weights), it doesn't matter.
Instead, I would like to know the reason why the query direction is masked even if it allows for strange behavior.

kenkenpa2126 Mar 2, 2023
Author

Does that mean that you devised it so that even complexly shaped attention masks can be easily applied?

cgarciae · 2023-03-02T16:17:48Z

cgarciae
Mar 2, 2023
Maintainer

Hey @kenkenpa2126, I did a simple experiment and can confirm that the gradients are indeed masked when they are replaced by any constant:

import jax
import jax.numpy as jnp

weights = jnp.full((3, 3), 2.0)
mask = jnp.array([
    [1, 0, 1], 
    [1, 1, 0], 
    [0, 1, 0],
])

def fn(weights, mask):
    big_neg = jnp.finfo(jnp.float32).min
    weights = jnp.where(mask, weights, big_neg)
    return jnp.sum(weights)

grads = jax.grad(fn)(weights, mask)

print(grads)

[[1. 0. 1.]
 [1. 1. 0.]
 [0. 1. 0.]]

1 reply

kenkenpa2126 Mar 3, 2023
Author

I explain what I mean in the following, doing a similar experiment.

import jax
import jax.numpy as jnp
import flax.linen as nn

inputs = jnp.array([8, 7, 1, 0])
mask = nn.make_attention_mask(
  inputs > 0, inputs > 0, dtype=jnp.float32)

print('mask')
print(mask)
print('----------')

attn_weights = jnp.full((4, 4), 2.0)
big_neg = jnp.finfo(jnp.float32).min
attn_weights = jnp.where(mask, attn_weights, big_neg)

attn_weights = jax.nn.softmax(attn_weights)
print('attn_weights')
print(attn_weights)
print('----------')

value = jnp.array([[1.],
                   [1.],
                   [1.],
                   [1.]])

def fn(value):
    out = jnp.matmul(attn_weights, value)
    return jnp.sum(out), out

(sum, out), grads = jax.value_and_grad(fn, has_aux=True)(value)
print('out')
print(out)
print('----------')
print('sum')
print(sum)
print('----------')
print('grads')
print(grads)

mask
[[[1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [0. 0. 0. 0.]]]
----------
attn_weights
[[[0.33333334 0.33333334 0.33333334 0.        ]
  [0.33333334 0.33333334 0.33333334 0.        ]
  [0.33333334 0.33333334 0.33333334 0.        ]
  [0.25       0.25       0.25       0.25      ]]]
----------
out
[[[1.]
  [1.]
  [1.]
  [1.]]]
----------
sum
4.0
----------
grads
[[1.25]
 [1.25]
 [1.25]
 [0.25]]

Like the code in wmt example here, when masking both for the key and query at the same time, the mask has a padded row. (see the mask in the output.)

flax/examples/wmt/models.py

Lines 491 to 492 in be3c846

    
           encoder_mask = nn.make_attention_mask( 
        
               inputs > 0, inputs > 0, dtype=config.dtype)

After softmax, the padded positions won't be 0. in that row. (see the attn_weights in the output.)
In this case, after multiplying with value and some calculation afterward, the value will get gradients with padded positions in backpropagation. (The last row of the grads isn't 0. in the output.)

However, in practice, the padded positions are masked in the loss function by multiplying the weights, like the following. It's can be seen in the wmt example, too.

flax/examples/wmt/train.py

Line 142 in be3c846

loss = loss * weights

# in loss fun
inputs = jnp.array([8, 7, 1, 0])
mask = nn.make_attention_mask(
  inputs > 0, inputs > 0, dtype=jnp.float32)

print('mask')
print(mask)
print('----------')

attn_weights = jnp.full((4, 4), 2.0)
big_neg = jnp.finfo(jnp.float32).min
attn_weights = jnp.where(mask, attn_weights, big_neg)

attn_weights = jax.nn.softmax(attn_weights)
print('attn_weights')
print(attn_weights)
print('----------')

value = jnp.array([[1.],
                   [1.],
                   [1.],
                   [1.]])


weights = jnp.where(inputs > 0, 1, 0)
print('weights')
print(weights)
print('----------')

def loss_fn(value):
    out = jnp.matmul(attn_weights, value)
    out = jnp.squeeze(out) * weights
    return jnp.sum(out), out

(sum, out), grads = jax.value_and_grad(loss_fn, has_aux=True)(value)
print('out')
print(out)
print('----------')
print('sum')
print(sum)
print('----------')
print('grads')
print(grads)

mask
[[[1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [0. 0. 0. 0.]]]
----------
attn_weights
[[[0.33333334 0.33333334 0.33333334 0.        ]
  [0.33333334 0.33333334 0.33333334 0.        ]
  [0.33333334 0.33333334 0.33333334 0.        ]
  [0.25       0.25       0.25       0.25      ]]]
----------
weights
[1 1 1 0]
----------
out
[1. 1. 1. 0.]
----------
sum
3.0
----------
grads
[[1.]
 [1.]
 [1.]
 [0.]]

This time, the last row of the grads is 0. in the output. So, it doesn't matter if the weights are multiplied in the loss function.

My question is why Flax adopts this policy: mask for the query directions in attention calculation even though there could be strange behavior. And my guess for this answer is to make it easier to apply complexly shaped attention masks in the attention calculation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The way to mask attention matrices in Flax #2915

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The way to mask attention matrices in Flax #2915

kenkenpa2126 Mar 1, 2023

Replies: 2 comments · 4 replies

cgarciae Mar 1, 2023 Maintainer

kenkenpa2126 Mar 2, 2023 Author

kenkenpa2126 Mar 2, 2023 Author

kenkenpa2126 Mar 2, 2023 Author

cgarciae Mar 2, 2023 Maintainer

kenkenpa2126 Mar 3, 2023 Author

kenkenpa2126
Mar 1, 2023

Replies: 2 comments 4 replies

cgarciae
Mar 1, 2023
Maintainer

kenkenpa2126 Mar 2, 2023
Author

kenkenpa2126 Mar 2, 2023
Author

kenkenpa2126 Mar 2, 2023
Author

cgarciae
Mar 2, 2023
Maintainer

kenkenpa2126 Mar 3, 2023
Author