BEIT-3-Large  - Layer fusion

Hi thanks for your great work, exploring BEIT as an alternative to CLIP.

I find it very well motivated in the paper, but I struggle to reproduce the BEIT3 results in my independent training codebase.
So far I can match / surpass clip results, and the addition of CLIP_Image in Late Concat is beneficial.

However, so far BEIT3 underperforms clip. So I'm wondering if I am missing something.

For your BEIT experiments, what do you mean by Late Concat and Early(L1-L12), Early(L1-L24)? I can't find reference to this in the code, and neither in the beit repo or torchscale repo. If you could share a code sample you would really help to articulate your point

Thank you for your time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEIT-3-Large - Layer fusion #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BEIT-3-Large - Layer fusion #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions