Suspected incorrect pixel_tokens_with_pos_embed reshaping in dinov2_with_windowed_attn.py with non-standard input sizes

### Search before asking

- [x] I have searched the RF-DETR issues and found no similar bug report.


### Bug

I am finetuning RF-DETR with custom dataset transforms since I have some very specific requirements regarding to image sizes. I run into the following error:

```
│                                                                                                  │
│    312 │   │   │   num_w_patches_per_window = num_w_patches // self.config.num_windows           │
│    313 │   │   │   num_h_patches_per_window = num_h_patches // self.config.num_windows           │
│    314 │   │   │   num_windows = self.config.num_windows                                         │
│ ❱  315 │   │   │   windowed_pixel_tokens = pixel_tokens_with_pos_embed.reshape(batch_size * num  │
│    316 │   │   │   windowed_pixel_tokens = windowed_pixel_tokens.permute(0, 2, 1, 3, 4)          │
│    317 │   │   │   windowed_pixel_tokens = windowed_pixel_tokens.reshape(batch_size * num_windo  │
│    318 │   │   │   windowed_cls_token_with_pos_embed = cls_token_with_pos_embed.repeat(num_wind  │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │                           _ = 3                                                              │ │
│ │                  batch_size = 4                                                              │ │
│ │             bool_masked_pos = None                                                           │ │
│ │    cls_token_with_pos_embed = <Tensor shape=(4, 1, 384), dtype=torch.float32, device=cuda:0> │ │
│ │                  cls_tokens = <Tensor shape=(4, 1, 384), dtype=torch.float32, device=cuda:0> │ │
│ │                  embeddings = <Tensor shape=(4, 4161, 384), dtype=torch.float32,             │ │
│ │                               device=cuda:0>                                                 │ │
│ │                      height = 832                                                            │ │
│ │               num_h_patches = 52                                                             │ │
│ │    num_h_patches_per_window = 26                                                             │ │
│ │               num_w_patches = 80                                                             │ │
│ │    num_w_patches_per_window = 40                                                             │ │
│ │                 num_windows = 2                                                              │ │
│ │ pixel_tokens_with_pos_embed = <Tensor shape=(4, 52, 80, 384), dtype=torch.float32,           │ │
│ │                               device=cuda:0>                                                 │ │
│ │                pixel_values = <Tensor shape=(4, 3, 832, 1280), dtype=torch.float32,          │ │
│ │                               device=cuda:0>                                                 │ │
│ │                        self = WindowedDinov2WithRegistersEmbeddings(                         │ │
│ │                                 (patch_embeddings): Dinov2WithRegistersPatchEmbeddings(      │ │
│ │                               │   (projection): Conv2d(3, 384, kernel_size=(16, 16),         │ │
│ │                               stride=(16, 16))                                               │ │
│ │                                 )                                                            │ │
│ │                                 (dropout): Dropout(p=0.0, inplace=False)                     │ │
│ │                               )                                                              │ │
│ │                target_dtype = torch.float32                                                  │ │
│ │                       width = 1280                                                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[8, 26, 2, 26, -1]' is invalid for input of size 6389760
```

I am not 100% sure this is actually a bug since my input sizes might be incorrect. I don't think so though since if line 315 would be using both `num_h_patches_per_window` and `num_w_patches_per_window`, instead of `num_h_patches_per_window` twice this reshaping would actually be successful.

To sum up, my suspicion is line 315 should be

```
windowed_pixel_tokens = pixel_tokens_with_pos_embed.reshape(batch_size * num_windows, num_h_patches_per_window, num_windows, num_w_patches_per_window, -1)
```
This effectively fixes the issue AFAIK
Let me know if I'm wrong!

### Environment

- RF-DETR: 1.3.0
- Ubuntu 24.04.3 LTS
- RTX 3090
- CUDA Version: 12.9 
- Python 3.12.7


### Minimal Reproducible Example

```
import torch
from rfdetr import RFDETRNano

model = RFDETRNano().model.model.to("cpu")
ins = torch.randn(4, 3, 832, 1280)
outs = model(ins)
```

### Additional

_No response_

### Are you willing to submit a PR?

- [x] Yes, I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suspected incorrect pixel_tokens_with_pos_embed reshaping in dinov2_with_windowed_attn.py with non-standard input sizes #398

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suspected incorrect pixel_tokens_with_pos_embed reshaping in dinov2_with_windowed_attn.py with non-standard input sizes #398

Description

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions