Skip to content

Suspected incorrect pixel_tokens_with_pos_embed reshaping in dinov2_with_windowed_attn.py with non-standard input sizes #398

@mattias-pp

Description

@mattias-pp

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

I am finetuning RF-DETR with custom dataset transforms since I have some very specific requirements regarding to image sizes. I run into the following error:

│                                                                                                  │
│    312 │   │   │   num_w_patches_per_window = num_w_patches // self.config.num_windows           │
│    313 │   │   │   num_h_patches_per_window = num_h_patches // self.config.num_windows           │
│    314 │   │   │   num_windows = self.config.num_windows                                         │
│ ❱  315 │   │   │   windowed_pixel_tokens = pixel_tokens_with_pos_embed.reshape(batch_size * num  │
│    316 │   │   │   windowed_pixel_tokens = windowed_pixel_tokens.permute(0, 2, 1, 3, 4)          │
│    317 │   │   │   windowed_pixel_tokens = windowed_pixel_tokens.reshape(batch_size * num_windo  │
│    318 │   │   │   windowed_cls_token_with_pos_embed = cls_token_with_pos_embed.repeat(num_wind  │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │                           _ = 3                                                              │ │
│ │                  batch_size = 4                                                              │ │
│ │             bool_masked_pos = None                                                           │ │
│ │    cls_token_with_pos_embed = <Tensor shape=(4, 1, 384), dtype=torch.float32, device=cuda:0> │ │
│ │                  cls_tokens = <Tensor shape=(4, 1, 384), dtype=torch.float32, device=cuda:0> │ │
│ │                  embeddings = <Tensor shape=(4, 4161, 384), dtype=torch.float32,             │ │
│ │                               device=cuda:0>                                                 │ │
│ │                      height = 832                                                            │ │
│ │               num_h_patches = 52                                                             │ │
│ │    num_h_patches_per_window = 26                                                             │ │
│ │               num_w_patches = 80                                                             │ │
│ │    num_w_patches_per_window = 40                                                             │ │
│ │                 num_windows = 2                                                              │ │
│ │ pixel_tokens_with_pos_embed = <Tensor shape=(4, 52, 80, 384), dtype=torch.float32,           │ │
│ │                               device=cuda:0>                                                 │ │
│ │                pixel_values = <Tensor shape=(4, 3, 832, 1280), dtype=torch.float32,          │ │
│ │                               device=cuda:0>                                                 │ │
│ │                        self = WindowedDinov2WithRegistersEmbeddings(                         │ │
│ │                                 (patch_embeddings): Dinov2WithRegistersPatchEmbeddings(      │ │
│ │                               │   (projection): Conv2d(3, 384, kernel_size=(16, 16),         │ │
│ │                               stride=(16, 16))                                               │ │
│ │                                 )                                                            │ │
│ │                                 (dropout): Dropout(p=0.0, inplace=False)                     │ │
│ │                               )                                                              │ │
│ │                target_dtype = torch.float32                                                  │ │
│ │                       width = 1280                                                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[8, 26, 2, 26, -1]' is invalid for input of size 6389760

I am not 100% sure this is actually a bug since my input sizes might be incorrect. I don't think so though since if line 315 would be using both num_h_patches_per_window and num_w_patches_per_window, instead of num_h_patches_per_window twice this reshaping would actually be successful.

To sum up, my suspicion is line 315 should be

windowed_pixel_tokens = pixel_tokens_with_pos_embed.reshape(batch_size * num_windows, num_h_patches_per_window, num_windows, num_w_patches_per_window, -1)

This effectively fixes the issue AFAIK
Let me know if I'm wrong!

Environment

  • RF-DETR: 1.3.0
  • Ubuntu 24.04.3 LTS
  • RTX 3090
  • CUDA Version: 12.9
  • Python 3.12.7

Minimal Reproducible Example

import torch
from rfdetr import RFDETRNano

model = RFDETRNano().model.model.to("cpu")
ins = torch.randn(4, 3, 832, 1280)
outs = model(ins)

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions