Evaluation metrics

Hi, I'm trying to reproduce the results from your paper. For accuracy on general VLM Benchmarks (mme and pope), I was wondering how you were able to get a 100% accuracy with both llava-phi and llama-vision models? Do you fine-tune the VLMs on these datasets in addition to the FIUBench dataset?