Hi, thanks for your great work on MatchAnything!
I saw that your model builds on RoMa (which uses DINOv2) — since DINOv2 expects RGB images, I’m curious how you handle single-channel inputs (e.g., depth or infrared).
Do you replicate the channel to 3, or use some projection/adaptation before the encoder?
Thanks a lot for your time!