Hi Authors,
in these lines, the function build_labels masked all the labels in positions up to the seq length of the embeddings. What differences would it make if one just use the caption?
To be more specific, now the code build a label with first part of the sequence (which has sequence length the same as the image) all set to -100, then the second part would be the actual text labels. Why would we need all the -100s? Why couldn't we just use text label ids?
Thanks a lot!