Thanks for this great project!
I noticed that no padding was used for pre-training, while padding=2 was set for all fine-tuning tasks. IIUC, the image size used in fine-tuning (1024x768) is perfectly divisible by the patch size (16). Why did you choose to intentionally pad the input to make it not divisible by the patch size?
|
patch_cfg=dict(padding=2), |