About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper

Dear authors,
Regarding the experimental results in Section 4.2, I noticed that the authors compared the performance of models using SWA and models using the SelfExtend method on the passkey retrieval task. Although SWA limits the attention window size between tokens, there are many layers in the LLM. It is possible that the last token do not attend to the tokens where the passkey is located in the first layer, but the tokens at the passkey position can propagate backward to the tokens within the SWA window size. This backward propagation continues layer by layer until the final layer. Why can't the passkey information be propagated to the tokens that need to be generated? I am really curious about this question and look forward to your response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions