Hi, Can you explain how your ATP dynamically scale transformer'width i.e number of tokens as you stated in your paper? From your source code in core/model/transformer, I think it can only scale number of transformer layers while numberr of tokens in each sequence still the same.
Hi, Can you explain how your ATP dynamically scale transformer'width i.e number of tokens as you stated in your paper? From your source code in core/model/transformer, I think it can only scale number of transformer layers while numberr of tokens in each sequence still the same.