Skip to content

Integration of Resnet features with Transformer #2

@yfeng997

Description

@yfeng997

I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:

  1. for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.

  2. each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W*H 'words' from each image.

  3. Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.

Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions