Questions about TokLIP Reconstruction and Generative Capabilities

Hi,
The reconstruction part of TokLIP (codebook & decoder) wasn't trained. Why do VQGAN and TokLIP have different FID scores in Table 4?  If it was trained, what would the rFID score be? Also, why not include generative capabilities in the multimodal model? Could you explain the reasoning behind this?

<img width="363" alt="Image" src="https://github.com/user-attachments/assets/439b87b6-1d10-4f69-ab12-e58416879a4c" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about TokLIP Reconstruction and Generative Capabilities #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about TokLIP Reconstruction and Generative Capabilities #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions