-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
Hi,
I was using the Vit-Huge model(google/vit-huge-patch14-224-in21k)
After using the following code:
model = HookedViT.from_pretrained(model_name='google/vit-huge-patch14-224-in21k)', is_timm=False)
and
output, cache = model.run_with_cache(images)
all_attentions = [cache['pattern', head_idx] for head_idx in range(num_heads)]
BATCH_IDX = 0
all_attentions = [attn[BATCH_IDX] for attn in all_attentions]
all_attentions = torch.cat(all_attentions, dim=0)
all_attentions.shape # n_heads x n_patches x n_patches
This makes all_attentions have shape[0]= 256, but n_heads=16 and n_layers=32, so should it not be 16 * 32 instead of 16*16, if we follow the plot in the Interactive Attention Head Tour tutorial
This breaks the code in plot_attn_heads
because it iterates over
for i in range(n_layers*n_heads):
data = total_activations[i, :, :]
I also wanted to know how the above differs from just using something like:
output = model(pixel_values=inputs, output_attentions=True)
att_mat = output.attentions
att_mat = torch.stack(att_mat).squeeze(1)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels