Hi,
The reconstruction part of TokLIP (codebook & decoder) wasn't trained. Why do VQGAN and TokLIP have different FID scores in Table 4? If it was trained, what would the rFID score be? Also, why not include generative capabilities in the multimodal model? Could you explain the reasoning behind this?
