Open
Description
Hello, thank you very much for your excellent work. I have two questions to ask.
- I noticed that in your paper it is mentioned that the Mixture of vision encoders has brought performance improvements on 12 - 14 benchmarks. However, I haven't seen the ablation experiment regarding the Mixture of vision encoders. May I ask if there are specific data that can be shared? How much performance improvement has it actually brought?
- I noticed that the tasks where the Mixture of vision encoders brings the greatest performance improvement are OCR and Chart/Document VQA tasks, which seem to be small - target recognition and understanding tasks. May I ask if this is because tailing is used as input, which retains more picture details?
Metadata
Metadata
Assignees
Labels
No labels