Add BEiT3 #2489

brianhou0208 · 2025-05-12T16:02:38Z

(CVPR 2023) BEiT-3 is a multimodal model. Although it does not stand out on ImageNet, it achieves impressive results in other domains. Leveraging its powerful pretraining data, it can deliver strong performance on downstream tasks.

Model Issue & Request

timm: Unknown model (beit3_large_patch16_384_retrieval) #1927 Microsoft BEITv3 appears to use timm but model configs not in this repo #1751
smp: Add Beit segmentation model qubvel-org/segmentation_models.pytorch#1024

Result(ImageNet)

https://github.com/microsoft/unilm/tree/master/beit3#fine-tuning-on-imagenet-1k-image-classification

Model	Weight	Acc@1	Acc@5	FLOPs(G)	MACs(G)	Params(M)
beit3_base_patch16_224	in22k_ft_in1k	85.370	97.640	35.13	15.57	86.66
	in22k_indomain_ft_in1k	85.446	97.616
beit3_large_patch16_224	in22k_ft_in1k	87.624	98.332	123.12	61.56	304.57
	in22k_indomain_ft_in1k	87.538	98.362
beit3_giant_patch14_224				534.09	267.05	1000.1
beit3_giant_patch14_336				1240.70	620.35	1000.1

Note

The performance reported in the paper is based on the Giant model, and the authors do not plan to release its weights.
microsoft/unilm#1031, microsoft/unilm#1382, microsoft/unilm#1435

test code

from typing import Any, Dict, Union, List
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import torchvision.transforms as transforms
import timm
from timm.utils.metrics import AverageMeter, accuracy

device = torch.device('mps')
torch.mps.empty_cache()

def auto_unit(x: float, unit: str = '') -> str:
    if x >= 1e9:
        return f"{x / 1e9:.2f}G {unit}"
    elif x >= 1e6:
        return f"{x / 1e6:.2f}M {unit}"
    elif x >= 1e3:
        return f"{x / 1e3:.2f}K {unit}"
    else:
        return f"{x:.2f} {unit}"
 
 
def get_model_info(model: torch.nn.Module, imgsz: Union[int, List[int]] = 224) -> Dict[str, str]:
    """
    Compute model FLOPs, MACs, and Params using torch profiler.

    Args:
        model (nn.Module): The model to calculate for.
        imgsz (int | List[int], optional): Input image size. Defaults to 224.

    Returns:
        dict: Dictionary containing FLOPs, MACs, and Params with auto units.
    """
    p = next(model.parameters())
    if not isinstance(imgsz, list):
        imgsz = [imgsz, imgsz]

    im = torch.empty((1, 3, *imgsz), device=p.device)

    with torch.profiler.profile(with_flops=True) as prof:
        model(im)

    flops = sum(e.flops for e in prof.key_averages())
    macs = flops / 2
    params = sum(p.numel() for p in model.parameters())

    return {
        "FLOPs": auto_unit(flops, ""),
        "MACs": auto_unit(macs, ""),
        "Params": auto_unit(params, ""),
    }


def get_model_acc(model: torch.nn.Module):
    cfg: Dict[str, Any]= model.default_cfg
    _, height, width = cfg['input_size'] if 'test_input_size' not in cfg else cfg['test_input_size']
    crop_pct = cfg['crop_pct'] if 'test_crop_pct' not in cfg else cfg['test_crop_pct']
    imgsz = height if height == width else (height, width)
    interp_mode = {"nearest": 0, "bilinear": 2, "bicubic": 3}

    val_dataset = datasets.ImageFolder(
        './imagenet/val',
        transforms.Compose([
            transforms.Resize(int(imgsz / crop_pct), interpolation=interp_mode[cfg['interpolation']]),
            transforms.CenterCrop(imgsz),
            transforms.ToTensor(),
            transforms.Normalize(cfg['mean'], cfg['std'])])
    )
    val_loader = DataLoader(
        val_dataset, batch_size=64, shuffle=False, pin_memory=False, prefetch_factor=4, num_workers=4,
        persistent_workers=True#, pin_memory_device='mps'
    )

    top1 = AverageMeter()
    top5 = AverageMeter()

    model.eval()
    model.to(device)
    torch.mps.synchronize()
    with torch.no_grad():
        for images, target in tqdm(val_loader):
            images = images.to(device)
            target = target.to(device)
            output = model(images)
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            top1.update(acc1, images.size(0))
            top5.update(acc5, images.size(0))
    torch.mps.synchronize()
    return {"ACC@1": round(top1.avg.item(), 4), "ACC@5": round(top5.avg.item(), 4)}
 
 
if __name__ == "__main__":
    for name in timm.list_models('beit3*', pretrained=True):
        model = timm.create_model(name, pretrained=True).eval()
        result = get_model_acc(model)
        print(name, result)

Reference

paper: https://arxiv.org/abs/2208.10442
unilm: https://github.com/microsoft/unilm/tree/master/beit3
torchscale : https://github.com/microsoft/torchscale

HuggingFaceDocBuilderDev · 2025-05-12T16:16:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rwightman · 2025-05-27T01:57:49Z

@brianhou0208 on this one, have wanted to have these models but was hoping to avoid another separate vit impl, I feel this could be adapted to the vision_transformer.py, eva.py, or existing beit.py with some slight mods, I think it's

merging q,k,v projections
adding the scale norm to Attention (moving this into timm/layers now) so can add impl like the AttentionRope
rejigging the pos embed to be more like other timm vits and skip the Embedding module (I think that will be equivalent)?

Aside from just code duplication concerns, I'm trying to figure out a way to have all existing vit models adaptable the NaFlex ViTs and I need to constrain the number of architectures I support transformations for ... so vision_transformer.py and eva.py will probably be the highest priorities to support.

brianhou0208 · 2025-05-28T21:39:48Z

Hi @rwightman ,

Thanks for reviewing this PR. I agree with your suggestions, but I'm not quite sure how to proceed with the integration.
If someone else could help adapt these changes to beit.py or vision_transformer.py, that would be even better.

brianhou0208 added 4 commits May 6, 2025 01:30

add BEIT3

7aeebf2

Merge branch 'main' into beit3

0085149

update BEiT3

afe4375

add giant model param

b5a814e

Merge branch 'main' into beit3

9790fea

rwightman mentioned this pull request May 29, 2025

A cleaned up beit3 remap onto vision_transformer.py vit #2503

Merged

rwightman closed this May 30, 2025

brianhou0208 mentioned this pull request Jun 3, 2025

Add Beit segmentation model qubvel-org/segmentation_models.pytorch#1024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add BEiT3 #2489

Add BEiT3 #2489

Uh oh!

brianhou0208 commented May 12, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

rwightman commented May 27, 2025

Uh oh!

brianhou0208 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Add BEiT3 #2489

Add BEiT3 #2489

Uh oh!

Conversation

brianhou0208 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Issue & Request

Result(ImageNet)

Reference

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

rwightman commented May 27, 2025

Uh oh!

brianhou0208 commented May 28, 2025

Uh oh!

Uh oh!

brianhou0208 commented May 12, 2025 •

edited

Loading