Help me understand how to use the flax.linen.make_attention_mask function #3163

davidshen84 · 2023-06-25T08:13:41Z

davidshen84
Jun 25, 2023

Hi,

I am trying to build a SelfAttention layer with a mask parameter. I think I should use the make_attention_mask function to create the mask variable. I create a trivial example to help me understand the matrix shapes. But the output is not what I expected.

import ...

rng = jax.random.PRNGKey(0)
key1, _ = jax.random.split(rng)

m = nn.SelfAttention(num_heads=1)
params = m.init(key1, jnp.ones((3, 3)))

q = jnp.ones([3, 3]).at[:, 1].set(0)
kv = jnp.ones([3, 3])
mask = nn.make_attention_mask(q, kv)

m.apply(params, inputs_q, inputs_q, mask)
m.apply(params, inputs_q, inputs_q)

No matter how I set the mask in the q variable, the outputs of the last two lines are always the same. It looks like the mask parameter has no effect.

I was expecting one column of the SelfAttention layer's output to be set to negative infinity.

peterdavidfagan · 2023-07-13T15:01:40Z

peterdavidfagan
Jul 13, 2023

Hi @davidshen84,

No matter how I set the mask in the q variable, the outputs of the last two lines are always the same. It looks like the mask parameter has no effect.

The masking function will mask attention weights in QK^T depending on what tokens in the input sequence are being masked ( see here for brief review ). The following code may provide clarification based on the example you provided.

Code Example

import chex
import jax
import jax.numpy as jnp
import flax.linen as nn

# check masking function
# here the last element of the sequence is masked
# resulting mask outlines masking of attention weights QK^T
q = jnp.array([1.0, 1.0, 1.0, 0.0])
mask = nn.make_attention_mask(q>0, q>0) # note we are using self attention
print(mask)

### example demonstrating the difference between masking and not masking ### 

# initialise random key and dummy data
key = jax.random.PRNGKey(0)
dummy_data =  jax.random.randint(key, (5, 10, 10), 0, 100)

# initialise attention layer
attention_layer = nn.SelfAttention(num_heads=1)
params = attention_layer.init(key, dummy_data)

# create attention mask
q = jnp.ones([5, 10]).at[:, 1:].set(0) # mask all tokens except the first in each sequence
mask = nn.make_attention_mask(q>0, q>0)

# check equality
chex.assert_trees_all_close( 
    m.apply(params, inputs_q = input_q, mask = mask),
    m.apply(params, inputs_q = input_q, mask=None)
)

Outputs

[[[1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 0.]
  [0. 0. 0. 0.]]]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-52-09aa0b282fc6>](https://localhost:8080/#) in <cell line: 28>()
     26 
     27 # check equality
---> 28 chex.assert_trees_all_close( 
     29     m.apply(params, inputs_q = input_q, mask = mask),
     30     m.apply(params, inputs_q = input_q, mask=None)

1 frames
[/usr/local/lib/python3.10/dist-packages/chex/_src/asserts_internal.py](https://localhost:8080/#) in _assert_on_host(custom_message, custom_message_format_vars, include_default_message, exception_type, *args, **kwargs)
    195           error_msg = f"{error_msg} [{custom_message}]"
    196 
--> 197         raise exception_type(error_msg)
    198 
    199   return _assert_on_host

AssertionError: [Chex] Assertion assert_trees_all_close failed:  Trees (arrays) 0 and 1 differ: 
Not equal to tolerance rtol=1e-06, atol=0
Error in value equality check: Values not approximately equal
Mismatched elements: 490 / 500 (98%)
Max absolute difference: 95.96582
Max relative difference: 72.39601
 x: array([[[  15.209232,  -49.67708 ,   12.772506,  -15.214786,
          -18.351358,  -30.340567,   99.35389 ,    5.684303,
           68.530304,   10.606355],...
 y: array([[[  45.18163 ,  -56.958733,   31.838036,  -72.78322 ,
           31.149902,  -13.123426,   67.21693 ,   11.920732,
           98.9878  ,    6.246772],... 
Original dtypes: float32, float32.

I was expecting one column of the SelfAttention layer's output to be set to negative infinity.

In the actual implementation of the dot_product_attention_weights function they set the value to a constant negative value before applying the softmax operation as seen here.

In your code there are some errors,

m.apply(params, inputs_q, inputs_q, mask, mask)

The SelfAttention module is a special case of MultiHeadedAttention and hence it takes just one data parameter inputs_q. The above code isn't executable as it doesn't obey this, it provides two arrays of data.

In addition your input embeddings and the input parameters to the make_attention_mask function have the same dimensions, these need to be revised as the dimensions should be different. In particular, the make_attention_mask method doesn't expect an embedding dimension just flags for whether tokens are masked or not.

If I can provide further explanation on the actual make_attention_mask method please let me know as I'd be happy to as I haven't really referenced the actual code for it in this answer.

Hope this helps.

0 replies

davidshen84 · 2023-07-15T10:28:36Z

davidshen84
Jul 15, 2023
Author

Thanks a lot!

…

On Fri, 14 Jul 2023 at 01:01, Peter David Fagan ***@***.***> wrote: Hi @davidshen84 <https://github.com/davidshen84>, The masking function will mask attention weights in QK^T depending on what elements of the sequence are being masked see here <https://lukesalamone.github.io/posts/what-are-attention-masks/> . The following code <https://colab.research.google.com/drive/1TS6A7y2ALgeqDLWnlGtDKKHY-t-13DKW?usp=sharing> may provide clarification. import chex import jax import jax.numpy as jnp import flax.linen as nn # check masking function # here the last element of the sequence is masked # resulting mask outlines masking of attention weights QK^T q = jnp.array([1.0, 1.0, 1.0, 0.0]) nn.make_attention_mask(q>0, q>0) # example demonstrating the difference between masking and not masking rng = jax.random.PRNGKey(0) key1, _ = jax.random.split(rng) m = nn.SelfAttention(num_heads=1) input_q = jax.random.randint(key1, (5, 10, 10), 0, 100) params = m.init(key1, input_q) q = jnp.ones([5, 10]).at[:, 1:].set(0) mask = nn.make_attention_mask(q>0, q>0) chex.assert_trees_all_equal( m.apply(params, inputs_q = input_q, mask = mask), m.apply(params, inputs_q = input_q, mask=None) ) In your code there are some errors, m.apply(params, inputs_q, inputs_q, mask, mask) The SelfAttention module is a special case of MultiHeadedAttention and hence it takes just one inputs_q parameter. The above code isn't executable as it doesn't obey this. In addition your input embeddings and the input parameters to the make_attention_mask function have the same dimensions, these need to be revised as the dimensions should be different. In particular, the make_attention_make method doesn't expect an embedding. — Reply to this email directly, view it on GitHub <#3163 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAQBTIX5QVL562CGXCDTSLXQAEV5ANCNFSM6AAAAAAZTAATRM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help me understand how to use the flax.linen.make_attention_mask function #3163

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help me understand how to use the flax.linen.make_attention_mask function #3163

Uh oh!

davidshen84 Jun 25, 2023

Replies: 2 comments

Uh oh!

Uh oh!

peterdavidfagan Jul 13, 2023

Uh oh!

davidshen84 Jul 15, 2023 Author

davidshen84
Jun 25, 2023

peterdavidfagan
Jul 13, 2023

davidshen84
Jul 15, 2023
Author