Skip to content

The code implementation does not match the description in the paper. #24

@chenxn2020

Description

@chenxn2020

@zhangshaolei1998
Hi there, thanks for the great work. I attempted to reproduce your work. While reading your code, I found that the implementation differs from the description in the paper.
Specifically, first, the paper describes the Pre-fusion module as taking the original visual input $H^{V}$ (as shown in Figure 6 of the paper). However, in llavamini/model/llavamini_arch.py, at line 369 and 410, the compressed visual tokens (named as compressed_image_features) are also fed into the pre-fusion module. as shown in below:

#line 369
x=torch.cat([global_image_features,compressed_image_features,text_embedding],dim=1)
#line 410
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
    x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]

Secondly, the compressed visual tokens (named as compressed_image_features in your code) ultimately come from the $x$ of the pre-fusion module, which does not align with the description in the paper. See llavamini/model/llavamini_arch.py, at lines 410-417.

# modality pre-fusion
for layer in self.get_model().prefusion_layers:
    x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]

fusion_text_features=x[:,-1*input_ids.size(1):,:]
compressed_image_features=x[:,-1*input_ids.size(1)-1*compressed_image_features.size(1):-1*input_ids.size(1),:]
fusion_text_features=fusion_text_features*(~padding_mask).unsqueeze(-1).int()+all_text_embedding*padding_mask.unsqueeze(-1)

return compressed_image_features,fusion_text_features

Is there anyone who can explain the reason for this inconsistency? Or did I misunderstand something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions