Hi.
In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "
But I am still confused. What is 10^3 mean, and how 0.1 was got?
Hi.
In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "
But I am still confused. What is 10^3 mean, and how 0.1 was got?