A bilingual benchmark for evaluating Vision-Language Models (VLMs) on biomedical imaging tasks in English and Modern Standard Arabic.
Install LLaVA-pp by following the installation guide.
export OPENAI_API_KEY="your-api-key-here"./eval.shEvaluates MBZUAI/BiMediX2-8B model on English test set.
./eval_ara.shEvaluates MBZUAI/BiMediX2-8B-BI bilingual model on Arabic test set.
The evaluation pipeline has three steps:
python gen_ans.py <model_path> <language>language:engorara- Output:
./data/eval_out_files/{model_name}/{language}_ans.jsonl
English:
python eval/eval_multimodal_chat_gpt_score.py \
--answers-file data/eval_out_files/{model_name}/eng_ans.jsonl \
--question-file data/test_sets/bimed-mbench_eng.jsonl \
--scores-file data/eval_out_files/{model_name}/eng_score.jsonlArabic:
python eval/eval_multimodal_chat_gpt_score_ara.py \
--answers-file data/eval_out_files/{model_name}/ara_ans.jsonl \
--question-file data/test_sets/bimed-mbench_ara.jsonl \
--scores-file data/eval_out_files/{model_name}/ara_score.jsonlpython eval/summarize_gpt_review.py \
--scores-file data/eval_out_files/{model_name}/{language}_score.jsonlOutput: ./data/eval_out_files/{model_name}/{language}_results.txt
For custom models, implement your own inference in gen_ans.py by replacing the bimedix_inference.Inference class with your model's inference code.