You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Data construction process of OmniBrainBench, consisting of (a) data collection, (b) QA standardization, and (c) data filtering. Finally, we implement (d) model evaluation on OmniBrainBench. </p>
239
+
<p>Construction process of OmniBrainBench with (a) data collection, (b) question augmentation, and (c) data filtering. </p>
<h2class="title is-4">Results of different MLLMs on 12 clinical tasks.</h2>
263
+
<h2class="title is-4">Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench.</h2>
<p>Table 1: Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench. The best-performing model in each category is highlighted in bold, and the second best is highlighted in underlined.</p>
@@ -271,7 +271,7 @@ <h2 class="title is-4">Results of different MLLMs on 12 clinical tasks.</h2>
<p>Table 2: Performance of different MLLMs on open-ended VQA of OmniBrainBench. Higher values indicate better performance in generation quality, semantic similarity, and fluency.</p>
@@ -282,7 +282,7 @@ <h2 class="title is-4">Results of different MLLMs on 4 different endoscopy scena
<p><b>Observation 1: Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise.</b> Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%. This highlights the inherent difficulty of endoscopy, which demands both precise visual interpretation and specialized medical knowledge. Proprietary models consistently outperform open-source models overall, yet open-source models show a surprising edge in surgical scenarios, where their accuracy improves markedly compared to random baselines. In contrast, for non-surgical tasks like landmark and organ identification, open-source models perform no better than random guessing. This disparity suggests that while open-source models can leverage structured contexts, they falter in knowledge-intensive tasks, pointing to a need for enhanced domain-specific capabilities.</p>
351
+
<p><b>Observation 1: Brain imaging analysis is challenging for MLLMs, with significant gaps between MLLMs and physicians.</b> Physicians achieve an average accuracy of 91.35\% across all tasks, whereas the highest-performing model, Gemini-2.5-Pro, attained only 66.58\%, reflecting a substantial performance gap of approximately 24.77\%. This disparity underscores the intrinsic complexity of brain imaging analysis, which necessitates both precise visual interpretation and specialized clinical expertise. It indicates that, while open-source models benefit from structured contextual inputs, they exhibit limitations in knowledge-intensive and reasoning-dependent domains, highlighting the critical need for domain-specific pretraining and reasoning capabilities.
352
+
</p>
352
353
</div>
353
354
354
355
<!-- Observation 2 -->
355
356
<divclass="content has-text-justified">
356
-
<p><b>Observation 2: Medical domain-specific Supervised Fine-Tuning markedly boosts model performance.</b> Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models. This indicates that domain pretraining effectively equips models with essential medical knowledge, enhancing their competitiveness in specialized tasks. However, some medical models exhibit limitations in instruction-following capabilities and suffer from overfitting, which restricts their performance in broader application scenarios. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.</p>
357
+
<p><b>Observation 2: Medical MLLMs exhibit heterogeneous performance.</b> The highest-performing HuatuoGPT-V-34B achieves a mean accuracy of 63.56\%, rendering it competitive with leading proprietary MLLMs, where it demonstrates superior performance in the clinical phases of IMI (69.55\%) and RS (40.84\%). In contrast, other medical MLLMs, e.g., MedGemma-4B (48.04\%) and Llava-Med-7B (38.84\%), display markedly lower aggregate scores, consistent with the observed general performance deficit. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.
<p class="mt-3">Figure 5: The influence of visual prompt in lesion quantification task among different MLLMs.</p>
363
-
</div>
365
+
</div> -->
364
366
365
367
<!-- Observation 3 -->
366
368
<divclass="content has-text-justified">
367
-
<p><b>Observation 3: Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension.</b> The ability of models to understand spatial information varies significantly based on how visual prompts are formatted, rather than being consistently robust across different scenarios. To explore this, we test the same images across 3 tasks with different visual prompts, as shown in Figure \ref{fig:visual_prompt}. The results in Table \ref{tab:comparison} and Table \ref{tab:comparison_scene} reveal that most models, especially proprietary ones, excelled in the ROI Selection task, indicating strong visual comprehension in distinguishing between regions. However, they struggled to accurately classify lesion types within those regions, pointing to a lack of medical knowledge as the main source of errors rather than poor visual processing. This suggests that while models can spatially differentiate areas, their interpretation hinges on both the prompt format and their limited medical expertise. Ultimately, models’ spatial understanding is not broadly applicable but depends heavily on prompt structure, with insufficient medical knowledge acting as a key limitation.</p>
369
+
<p><b>Observation 3: MLLMs expose the variation in task difficulty, exposing a gap between visual perception and medical comprehension.</b> MLLMs and physicians consistently achieve high scores in tasks like prognostic factor analysis, clinical sign prediction, drug response prediction, and postoperative outcome assessment, where perfect scores of 100.00\% are seen. Conversely, tasks like risk stratification and preoperative assessment appear much more difficult, with significantly lower scores across all MLLMs (e.g., the highest-performing MLLM scores 40.84\% in risk stratification). Our findings highlight the importance of integrating medical knowledge and clinical reasoning beyond visual perception to bridge the performance gap in complex diagnostic and decision-making tasks.
370
+
</p>
368
371
</div>
369
372
370
373
<!-- Observation 4 -->
371
374
<divclass="content has-text-justified">
372
-
<p><b>Observation 4: Polyp counting exposes dual challenges in lesion segmentation and numerical reasoning.</b> Polyp counting, a task that requires both spatial localization of lesions and numerical reasoning, remains challenging for most models, with all models achieving accuracy below 30%. To further analyze the sources of model errors, we introduced a new visual prompt format (Figure \ref{fig:polyp_case}), which led to modest improvements in accuracy across models. Notably, Gemini-2.5-Pro achieved a remarkable accuracy of 92% under this new prompting strategy. This significant improvement suggests that Gemini possesses strong capabilities in spatial recognition and counting, indicating that the primary limitation across models lies not in computational or spatial reasoning but rather in lesion identification. This finding underscores the critical need to enhance the integration of domain-specific medical knowledge in vision-language models to better address tasks that combine visual analysis with clinical understanding.</p>
375
+
<p><b>Observation 4: Open-source MLLMs exhibit far greater performance variance than their proprietary counterparts.</b> While trailblazers like Lingshu claim the top spots across ROUGE1, ROUGEL, and BERTScore, many others—especially medical variants—languish at the bottom, which indicates that the open ecosystem's rapid, decentralized innovation fuels both groundbreaking advances and pronounced instability in model quality.
In this section, we present a case study analyzing the performance of multiple Multimodal Large Language Models (MLLMs) on OmniBrainBench across various endoscopic scenarios. In addition to showcasing correct responses, we categorize errors into three distinct types: <strong>Perceptual Errors</strong>, <strong>Lack of Knowledge</strong>, <strong>Irrelevant Response</strong>, and <strong>Refusal to Answer</strong>. The following figures illustrate these case studies: correct samples are presented in Figures 7 through 12, while error samples are shown in Figures 13 through 20.
396
+
In this section, we conduct a comprehensive case study analysis of multiple MLLMs in our OmniBrainBench under various scenarios. The evaluation is structured into two primary tracks: closed-ended VQA and open-ended VQA, allowing for a nuanced assessment of model capabilities across different task formats.
393
397
</p>
394
-
<p>
398
+
<!-- <p>
395
399
<strong>Correct Samples (Figures 7–12):</strong> These figures highlight exemplary performances by leading models such as Gemini-2.5-Pro and GPT-4o. These models demonstrate robust capabilities in accurately interpreting endoscopic images and providing clinically relevant responses, underscoring their potential for assisting in real-world endoscopic analysis.
These case studies emphasize the need for improved medical knowledge integration and enhanced perceptual capabilities to bridge the gap between current MLLM performance and clinical requirements.
417
-
</p>
421
+
</p> -->
418
422
</div>
419
-
<divclass="columns is-multiline">
423
+
<!-- <div class="columns is-multiline">
420
424
<div class="column is-half">
421
425
<figure class="image">
422
426
<img src="static/images/Case Study_01.jpg" alt="Case Study 01" class="center">
0 commit comments