Dear authors,
Regarding the experimental results in Section 4.2, I noticed that the authors compared the performance of models using SWA and models using the SelfExtend method on the passkey retrieval task. Although SWA limits the attention window size between tokens, there are many layers in the LLM. It is possible that the last token do not attend to the tokens where the passkey is located in the first layer, but the tokens at the passkey position can propagate backward to the tokens within the SWA window size. This backward propagation continues layer by layer until the final layer. Why can't the passkey information be propagated to the tokens that need to be generated? I am really curious about this question and look forward to your response!