Using MaxViT as a backbone for object detection #2430

osvaldoNavarroFaro · 2025-01-30T11:17:24Z

osvaldoNavarroFaro
Jan 30, 2025

Hello, I am trying to use MaxVit ('maxvit_base_tf_512.in1k') as a backbone for object detection, using a FasterRCNN detector. I'm basically following the approach shown here: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#modifying-the-model-to-add-a-different-backbone.

However, when I test the model's inference with some example images I get the error message: AssertionError: height (200) must be divisible by window (16). This is coming from the MaxVit forward function. However, I don't understand where the height=200 is coming from, since my images are 512x512.

Can anybody shed some light into this issue? where is this height=200 coming from? or more in general, can someone point me to an example of using MaxVit as a backbone for object detection?

Here's my code

import timm
import torch
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

backbone = timm.create_model('maxvit_base_tf_512.in1k', features_only=True, pretrained=True)
backbone.out_channels = backbone.feature_info.channels()[-1]

anchor_generator = AnchorGenerator(
    sizes=((32, 64, 128, 256, 512),),
    aspect_ratios=((0.5, 1.0, 2.0),)
)

roi_pooler = torchvision.ops.MultiScaleRoIAlign(
    featmap_names=['0'],
    output_size=7,
    sampling_ratio=2
)

model = FasterRCNN(
    backbone,
    num_classes=91,
    rpn_anchor_generator=anchor_generator,
    box_roi_pool=roi_pooler
)

model.eval()
with torch.no_grad():
    predictions = model([image.to(device) for image in images]) # 512x512 images coming from PennFudanPed dataset

Answered by rwightman

Jan 30, 2025

Took a quick look while waiting for a result, the fasterrcnn wrapper does it's own transforms, so you have to change that https://github.com/pytorch/vision/blob/0d68c7df8640abff43355afd57c494cf5d74f4a9/torchvision/models/detection/faster_rcnn.py#L171-L175

View full answer

rwightman · 2025-01-30T15:59:39Z

rwightman
Jan 30, 2025
Maintainer

So, I don't know what's going on within your combined model, but the backbone itself works fine if it gets a 512x512 image, so seems like the images aren't actually 512... height = 200, would suggest something around 800 if it's after the first stage.. or 400 if it's right after the stem... or there is something happening within FasterRCNN that's altering them.

import torch
import timm
backbone = timm.create_model('maxvit_base_tf_512.in1k', features_only=True, pretrained=True)
out = backbone(torch.randn(2, 3, 512, 512))
for o in out:
    print(o.shape)
    
torch.Size([2, 64, 256, 256])
torch.Size([2, 96, 128, 128])
torch.Size([2, 192, 64, 64])
torch.Size([2, 384, 32, 32])
torch.Size([2, 768, 16, 16])

2 replies

rwightman Jan 30, 2025
Maintainer

Took a quick look while waiting for a result, the fasterrcnn wrapper does it's own transforms, so you have to change that https://github.com/pytorch/vision/blob/0d68c7df8640abff43355afd57c494cf5d74f4a9/torchvision/models/detection/faster_rcnn.py#L171-L175

Answer selected by osvaldoNavarroFaro

osvaldoNavarroFaro Feb 6, 2025
Author

Hi @rwightman, thank you for the quick reply. You were right, after changing the min_size parameter of fasterrcnn to 512 it didn't throw the error anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using MaxViT as a backbone for object detection #2430

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using MaxViT as a backbone for object detection #2430

Uh oh!

osvaldoNavarroFaro Jan 30, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

rwightman Jan 30, 2025 Maintainer

Uh oh!

rwightman Jan 30, 2025 Maintainer

Uh oh!

osvaldoNavarroFaro Feb 6, 2025 Author

osvaldoNavarroFaro
Jan 30, 2025

Replies: 1 comment 2 replies

rwightman
Jan 30, 2025
Maintainer

rwightman Jan 30, 2025
Maintainer

osvaldoNavarroFaro Feb 6, 2025
Author