-
Notifications
You must be signed in to change notification settings - Fork 31
The code implementation does not match the description in the paper. #24
Description
@zhangshaolei1998
Hi there, thanks for the great work. I attempted to reproduce your work. While reading your code, I found that the implementation differs from the description in the paper.
Specifically, first, the paper describes the Pre-fusion module as taking the original visual input llavamini/model/llavamini_arch.py, at line 369 and 410, the compressed visual tokens (named as compressed_image_features) are also fed into the pre-fusion module. as shown in below:
#line 369
x=torch.cat([global_image_features,compressed_image_features,text_embedding],dim=1)
#line 410
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]Secondly, the compressed visual tokens (named as compressed_image_features in your code) ultimately come from the llavamini/model/llavamini_arch.py, at lines 410-417.
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]
fusion_text_features=x[:,-1*input_ids.size(1):,:]
compressed_image_features=x[:,-1*input_ids.size(1)-1*compressed_image_features.size(1):-1*input_ids.size(1),:]
fusion_text_features=fusion_text_features*(~padding_mask).unsqueeze(-1).int()+all_text_embedding*padding_mask.unsqueeze(-1)
return compressed_image_features,fusion_text_featuresIs there anyone who can explain the reason for this inconsistency? Or did I misunderstand something?