I'm having some trouble understanding a part of the _normalize_attentions function. Specifically, I'm unsure about the following line of code:
mean_centered = (self.attentions - self.post_ln_mean[:, :, np.newaxis, np.newaxis] / (len_intermediates * normalization_term))
In this context, len_intermediates is set to 47 when _normalize_attentions is called. Could someone explain in detail what this code is doing? In particular, I'm unclear on why we divide by len_intermediates.