Has anyone successfully fine-tuned this model with a small dataset (e.g., a few thousand images) and managed to reduce or avoid tail distribution issues at lower resolutions (≈224×224)?
If so, what adjustments — if any — were effective (e.g., patch size, learning rate, or EMA decay)?
Has anyone successfully fine-tuned this model with a small dataset (e.g., a few thousand images) and managed to reduce or avoid tail distribution issues at lower resolutions (≈224×224)?
If so, what adjustments — if any — were effective (e.g., patch size, learning rate, or EMA decay)?