You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. We introduce OmniBrainBench, the first comprehensive benchmark specifically designed to evaluate MLLMs across the complete spectrum of endoscopy, covering 4 endoscopic scenarios, 12 specialized tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities.
154
+
1. We introduce OmniBrainBench, the first comprehensive multimodal benchmark specifically designed to evaluate MLLMs across the complete spectrum of brain imaging analysis with closed- and open-ended evaluations, covering {\textbf{9,527} clinically verified VQA pairs, \textbf{31,706} images, and \textbf{15} modalities}.
155
155
</p>
156
156
<p>
157
157
2. We develop the multi-dimensional evaluation framework that mirrors the clinical workflow progression from basic anatomical recognition to advanced surgical intervention, assessing MLLMs' capabilities across the full spectrum of endoscopic analysis skills.
Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic modalities and the full range of skills needed in clinical workflows. To address these issues, we introduce OmniBrainBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. OmniBrainBenchencompasses 4 distinct endoscopic modalities, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow—spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations—to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. OmniBrainBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.
<p> Data construction process of OmniBrainBench, consisting of (a) data collection, (b) QA standardization, and (c) data filtering. Finally, we implement (d) model evaluation on OmniBrainBench. </p>
<p>Table 1: Results of different MLLMs on 12 clinical tasks in OmniBrainBench. The best-performing model in each category is in-bold, and the second best is underlined.</p>
<p>Table 1: Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench. The best-performing model in each category is highlighted in bold, and the second best is highlighted in underlined.</p>
267
267
</div>
268
268
</div>
269
269
</div>
@@ -273,8 +273,8 @@ <h2 class="title is-4">Results of different MLLMs on 12 clinical tasks.</h2>
273
273
<divclass="column is-four-fifths">
274
274
<h2class="title is-4">Results of different MLLMs on 4 different endoscopy scenarios and 4 different visual prompts.</h2>
<p>Table 2: Results of different MLLMs on 4 different endoscopy scenarios and 4 different visual prompts in OmniBrainBench. The best-performing model in each category is in-bold, and the second best is underlined.</p>
<p>Table 2: Performance of different MLLMs on open-ended VQA of OmniBrainBench. Higher values indicate better performance in generation quality, semantic similarity, and fluency.</p>
278
278
</div>
279
279
</div>
280
280
</div>
@@ -284,8 +284,8 @@ <h2 class="title is-4">Results of different MLLMs on 4 different endoscopy scena
284
284
<divclass="column is-four-fifths">
285
285
<h2class="title is-4">Results of different MLLMs on 12 subtasks in OmniBrainBench.</h2>
0 commit comments