Attention Plots when n_layers != n_heads

Hi,

I was using the Vit-Huge model(google/vit-huge-patch14-224-in21k)

After using the following code:
```model = HookedViT.from_pretrained(model_name='google/vit-huge-patch14-224-in21k)', is_timm=False)```

and 
```
output, cache = model.run_with_cache(images)
all_attentions = [cache['pattern', head_idx] for head_idx in range(num_heads)]
BATCH_IDX = 0
all_attentions = [attn[BATCH_IDX] for attn in all_attentions]
all_attentions = torch.cat(all_attentions, dim=0)
all_attentions.shape # n_heads x n_patches x n_patches
```

This makes all_attentions have shape[0]= 256, but n_heads=16 and n_layers=32, so should it not be 16 * 32 instead of 16*16, if we follow the plot in the Interactive Attention Head Tour tutorial

This breaks the code in `plot_attn_heads`
because it iterates over 
```
for i in range(n_layers*n_heads):
            data = total_activations[i, :, :]
```

I also wanted to know how the above differs from just using something like:
```
output = model(pixel_values=inputs, output_attentions=True)
att_mat = output.attentions
att_mat = torch.stack(att_mat).squeeze(1)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Plots when n_layers != n_heads #124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention Plots when n_layers != n_heads #124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions