-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlavaForConditionalGeneration._merge_input_ids_with_image_features is incorrect #33976
Comments
Hey! Thanks for the report. I think this piece of code has been changed quite a lot since initial porting so @zucchini-nlp will have a better idea of what is going on! Note that this is the |
Ah, I hadn't noticed! Good point, and it looks like this code should disappear soon based on the comment |
Hey! Yes, it is worth fixing as I still didn't update our files on the hub and we fall to the legacy logic. I am planning to gradually update files and in the meanwhile we need to be sure the legacy branch is working correctly Interesting that the padding="right" is not correct, as it should have been covered when adding the model. I'll get back to this on Monday/Tuesday, thanks |
Yeah I remember that we had some tests for this! |
Okey, verified that indeed the merging is not correct when we do right padding, and it is really weird noone noticed that earlier. We added handling for padding side only in llava-next because the number of image tokens can be different and needs unpadding. And the other models it was assumed that the following line handles it correctly if left_padding:
new_token_positions += nb_image_pad[:, None] # offset for left padding Also related to #33662 which I didn't had time to look due to absence of clear reproducer. @fpgaminer would you like to open a PR and fix this in all llava models? We might also need a test with dummy inputs and tiny model, to make sure it doesn't happen in the future :) |
System Info
commit: 38f9f10
Who can help?
@amyeroberts @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The function
_merge_input_ids_with_image_features
of classLlavaForConditionalGeneration
gives incorrect results if it has to pad the result (which arises when the batch has different amounts of image tokens in each row).transformers/src/transformers/models/llava/modeling_llava.py
Line 297 in 38f9f10
I created the following reproduction example to show the failure mode simply. Note that it uses a copy of the function to add print statements for debugging, but the code is otherwise an identical copy at HEAD.
Running this gives:
The last set of prints shows the difference between what the function outputs and what is expected, by comparing the different pieces of the result to the original inputs. As can be seen here, the result is almost correct but has misplaced the first image.
The reason is this line:
transformers/src/transformers/models/llava/modeling_llava.py
Line 352 in 38f9f10
It looks like this line is trying to account for padding in the outputs, but assumes the outputs are going to be left padded, whereas the rest of the function dynamically switches between left and right padding depending on the
left_padding
variable. If the outputs are being right padded, the comparison on this line is incorrect. That can be seen in the debug output above whereimage_to_overwrite
shows the image being scattered around the tensor.Importantly, this incorrect behavior can occur for any padding, because
left_padding
is "automatically" detected from the input. Even if the model is setup for left padding, if a particular batch doesn't have any padding going into this function, the function will assume right padding and thus mangle the output.One solution is to add an if statement on that line based on
left_padding
and use a different condition to handle right padding. I think this code (for the right padding case) would work, but I haven't tested it extensively:Side note:
I think the current implementation for handling the left padding case could also be replaced with arange based logic. To me it would be more clear, as it indicates the intention of that line explicitly (set all padded indexes to False). Whereas I found the use of
image_to_overwrite.cumsum(-1)
opaque and difficult to decipher what it was trying to calculate.Expected behavior
See above
The text was updated successfully, but these errors were encountered: