Integration of Resnet features with Transformer

I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:

1. for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.

2. each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W\*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W\*H 'words' from each image.

3. Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W\*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.

Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration of Resnet features with Transformer #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Integration of Resnet features with Transformer #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions