Skip to content

Commit 4b622e9

Browse files
Update index.html
1 parent 47aa506 commit 4b622e9

1 file changed

Lines changed: 29 additions & 25 deletions

File tree

index.html

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ <h2 class="subtitle is-3 publication-subtitle">
7676
<div class="is-size-5 publication-authors">
7777
<span class="author-block"><sup style="color:#ffac33;">1</sup>Department of Electronic Engineering, The Chinese University of Hong Kong</span>
7878
<span class="author-block"><sup style="color:#6fbf73;">2</sup>Sun Yat-sen Memorial Hospital, Sun Yat-sen University</span>
79-
<span class="author-block"><sup style="color:#ff00f2;">3</sup>School of Biomedical Engineering, Southern Medical University</span>
79+
<span class="author-block"><sup style="color:#ff00f2;">3</sup>School of Biomedical Engineering, Southern Medical University</span> </br>
8080
<span class="author-block"><sup style="color:#9b51e0;">4</sup>Zhongshan Hospital, Fudan University</span>
8181
<span class="author-block"><sup style="color:#ed4b82;">5</sup>Department of Neurosurgery, Prince of Wales Hospital</span></br>
8282
</div>
@@ -236,7 +236,7 @@ <h2 class="title is-3">Construction Process</h2>
236236
</p> -->
237237
<div class="content has-text-centered">
238238
<img src="static/images/OmniBrainBench_construction.png" alt="algebraic reasoning" width="100%"/ class="center">
239-
<p> Data construction process of OmniBrainBench, consisting of (a) data collection, (b) QA standardization, and (c) data filtering. Finally, we implement (d) model evaluation on OmniBrainBench. </p>
239+
<p> Construction process of OmniBrainBench with (a) data collection, (b) question augmentation, and (c) data filtering. </p>
240240
</div>
241241
</div>
242242
</div>
@@ -260,7 +260,7 @@ <h1 class="title is-3 mmmu">Experiment Results</h1>
260260
<!-- 第一个图表 -->
261261
<div class="columns is-centered has-text-centered">
262262
<div class="column is-four-fifths">
263-
<h2 class="title is-4">Results of different MLLMs on 12 clinical tasks.</h2>
263+
<h2 class="title is-4">Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench.</h2>
264264
<div class="content has-text-centered">
265265
<img src="static/images/OmniBrainBench_closedVQA.png" alt="algebraic reasoning" width="80%" class="center">
266266
<p>Table 1: Performance of different MLLMs on five specialized clinical phases with 15 secondary subtasks on closed-ended VQA of OmniBrainBench. The best-performing model in each category is highlighted in bold, and the second best is highlighted in underlined.</p>
@@ -271,7 +271,7 @@ <h2 class="title is-4">Results of different MLLMs on 12 clinical tasks.</h2>
271271
<!-- 第二个图表 -->
272272
<div class="columns is-centered has-text-centered">
273273
<div class="column is-four-fifths">
274-
<h2 class="title is-4">Results of different MLLMs on 4 different endoscopy scenarios and 4 different visual prompts.</h2>
274+
<h2 class="title is-4">Performance of different MLLMs on open-ended VQA of OmniBrainBench.</h2>
275275
<div class="content has-text-centered">
276276
<img src="static/images/OmniBrainBench_openVQA.png" alt="algebraic reasoning" width="75%" class="center">
277277
<p>Table 2: Performance of different MLLMs on open-ended VQA of OmniBrainBench. Higher values indicate better performance in generation quality, semantic similarity, and fluency.</p>
@@ -282,7 +282,7 @@ <h2 class="title is-4">Results of different MLLMs on 4 different endoscopy scena
282282
<!-- 第三个图表 -->
283283
<div class="columns is-centered has-text-centered">
284284
<div class="column is-four-fifths">
285-
<h2 class="title is-4">Results of different MLLMs on 12 subtasks in OmniBrainBench.</h2>
285+
<h2 class="title is-4">Diverse Modality Evaluation.</h2>
286286
<div class="content has-text-centered">
287287
<img src="static/images/OmniBrainBench_analysis_DiffModality.png" alt="algebraic reasoning" width="75%" class="center">
288288
<p>Table 3: Diverse Modality Evaluation.</p>
@@ -293,7 +293,7 @@ <h2 class="title is-4">Results of different MLLMs on 12 subtasks in OmniBrainBen
293293
<!-- 第三个图 -->
294294
<div class="columns is-centered has-text-centered">
295295
<div class="column is-four-fifths">
296-
<h2 class="title is-4">Performance comparison of several leading MLLMs and Clinicians.</h2>
296+
<h2 class="title is-4">Performance of models on different numbers of images.</h2>
297297
<div class="content has-text-centered">
298298
<img src="static/images/OmniBrainBench_analysis_DiffImages.png" alt="algebraic reasoning" width="80%" class="center">
299299
<p>Figure 1: Performance of models on different numbers of images.</p>
@@ -302,36 +302,36 @@ <h2 class="title is-4">Performance comparison of several leading MLLMs and Clini
302302
</div>
303303

304304
<!-- <!-- 第四个图 -->
305-
<div class="columns is-centered has-text-centered">
305+
<!-- <div class="columns is-centered has-text-centered">
306306
<div class="column is-four-fifths">
307307
<h2 class="title is-4">Performance comparison across four major categories.</h2>
308308
<div class="content has-text-centered">
309309
<img src="static/images/figure1.jpg" alt="algebraic reasoning" width="80%" class="center">
310310
<p>Figure 2: Performance comparison across 4 major categories in OmniBrainBench among existing MLLMs.</p>
311311
</div>
312-
</div>
312+
</div> -->
313313
</div> -->
314314

315315
<!-- <!-- 第五个图 -->
316-
<div class="columns is-centered has-text-centered">
316+
<!-- <div class="columns is-centered has-text-centered">
317317
<div class="column is-four-fifths">
318318
<h2 class="title is-4">Performance comparison across four endoscopic scenarios.</h2>
319319
<div class="content has-text-centered">
320320
<img src="static/images/sup_fig1.jpg" alt="algebraic reasoning" width="80%" class="center">
321321
<p>Figure 3: Performance comparison across 4 endoscopic scenarios in OmniBrainBench among existing MLLMs.</p>
322322
</div>
323-
</div>
323+
</div> -->
324324
</div> -->
325325

326326
<!-- <!-- 第六个图 -->
327-
<div class="columns is-centered has-text-centered">
327+
<!-- <div class="columns is-centered has-text-centered">
328328
<div class="column is-four-fifths">
329329
<h2 class="title is-4">Performance comparison across five different visual prompts.</h2>
330330
<div class="content has-text-centered">
331331
<img src="static/images/sup_fig2.jpg" alt="algebraic reasoning" width="80%" class="center">
332332
<p>Figure 4: Performance comparison across 5 different visual prompts in OmniBrainBench among existing MLLMs.</p>
333333
</div>
334-
</div>
334+
</div> -->
335335
</div> -->
336336

337337
</div>
@@ -348,35 +348,39 @@ <h2 class="title is-3 has-text-centered">Analysis</h2>
348348

349349
<!-- Observation 1 -->
350350
<div class="content has-text-justified">
351-
<p><b>Observation 1: Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise.</b> Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%. This highlights the inherent difficulty of endoscopy, which demands both precise visual interpretation and specialized medical knowledge. Proprietary models consistently outperform open-source models overall, yet open-source models show a surprising edge in surgical scenarios, where their accuracy improves markedly compared to random baselines. In contrast, for non-surgical tasks like landmark and organ identification, open-source models perform no better than random guessing. This disparity suggests that while open-source models can leverage structured contexts, they falter in knowledge-intensive tasks, pointing to a need for enhanced domain-specific capabilities.</p>
351+
<p><b>Observation 1: Brain imaging analysis is challenging for MLLMs, with significant gaps between MLLMs and physicians.</b> Physicians achieve an average accuracy of 91.35\% across all tasks, whereas the highest-performing model, Gemini-2.5-Pro, attained only 66.58\%, reflecting a substantial performance gap of approximately 24.77\%. This disparity underscores the intrinsic complexity of brain imaging analysis, which necessitates both precise visual interpretation and specialized clinical expertise. It indicates that, while open-source models benefit from structured contextual inputs, they exhibit limitations in knowledge-intensive and reasoning-dependent domains, highlighting the critical need for domain-specific pretraining and reasoning capabilities.
352+
</p>
352353
</div>
353354

354355
<!-- Observation 2 -->
355356
<div class="content has-text-justified">
356-
<p><b>Observation 2: Medical domain-specific Supervised Fine-Tuning markedly boosts model performance.</b> Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models. This indicates that domain pretraining effectively equips models with essential medical knowledge, enhancing their competitiveness in specialized tasks. However, some medical models exhibit limitations in instruction-following capabilities and suffer from overfitting, which restricts their performance in broader application scenarios. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.</p>
357+
<p><b>Observation 2: Medical MLLMs exhibit heterogeneous performance.</b> The highest-performing HuatuoGPT-V-34B achieves a mean accuracy of 63.56\%, rendering it competitive with leading proprietary MLLMs, where it demonstrates superior performance in the clinical phases of IMI (69.55\%) and RS (40.84\%). In contrast, other medical MLLMs, e.g., MedGemma-4B (48.04\%) and Llava-Med-7B (38.84\%), display markedly lower aggregate scores, consistent with the observed general performance deficit. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.
358+
</p>
357359
</div>
358360

359361
<!-- 图表展示 -->
360-
<div class="content has-text-centered my-6">
362+
<!-- <div class="content has-text-centered my-6">
361363
<img src="static/images/figure3.jpg" alt="error distribution" width="100%">
362364
<p class="mt-3">Figure 5: The influence of visual prompt in lesion quantification task among different MLLMs.</p>
363-
</div>
365+
</div> -->
364366

365367
<!-- Observation 3 -->
366368
<div class="content has-text-justified">
367-
<p><b>Observation 3: Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension.</b> The ability of models to understand spatial information varies significantly based on how visual prompts are formatted, rather than being consistently robust across different scenarios. To explore this, we test the same images across 3 tasks with different visual prompts, as shown in Figure \ref{fig:visual_prompt}. The results in Table \ref{tab:comparison} and Table \ref{tab:comparison_scene} reveal that most models, especially proprietary ones, excelled in the ROI Selection task, indicating strong visual comprehension in distinguishing between regions. However, they struggled to accurately classify lesion types within those regions, pointing to a lack of medical knowledge as the main source of errors rather than poor visual processing. This suggests that while models can spatially differentiate areas, their interpretation hinges on both the prompt format and their limited medical expertise. Ultimately, models’ spatial understanding is not broadly applicable but depends heavily on prompt structure, with insufficient medical knowledge acting as a key limitation.</p>
369+
<p><b>Observation 3: MLLMs expose the variation in task difficulty, exposing a gap between visual perception and medical comprehension.</b> MLLMs and physicians consistently achieve high scores in tasks like prognostic factor analysis, clinical sign prediction, drug response prediction, and postoperative outcome assessment, where perfect scores of 100.00\% are seen. Conversely, tasks like risk stratification and preoperative assessment appear much more difficult, with significantly lower scores across all MLLMs (e.g., the highest-performing MLLM scores 40.84\% in risk stratification). Our findings highlight the importance of integrating medical knowledge and clinical reasoning beyond visual perception to bridge the performance gap in complex diagnostic and decision-making tasks.
370+
</p>
368371
</div>
369372

370373
<!-- Observation 4 -->
371374
<div class="content has-text-justified">
372-
<p><b>Observation 4: Polyp counting exposes dual challenges in lesion segmentation and numerical reasoning.</b> Polyp counting, a task that requires both spatial localization of lesions and numerical reasoning, remains challenging for most models, with all models achieving accuracy below 30%. To further analyze the sources of model errors, we introduced a new visual prompt format (Figure \ref{fig:polyp_case}), which led to modest improvements in accuracy across models. Notably, Gemini-2.5-Pro achieved a remarkable accuracy of 92% under this new prompting strategy. This significant improvement suggests that Gemini possesses strong capabilities in spatial recognition and counting, indicating that the primary limitation across models lies not in computational or spatial reasoning but rather in lesion identification. This finding underscores the critical need to enhance the integration of domain-specific medical knowledge in vision-language models to better address tasks that combine visual analysis with clinical understanding.</p>
375+
<p><b>Observation 4: Open-source MLLMs exhibit far greater performance variance than their proprietary counterparts.</b> While trailblazers like Lingshu claim the top spots across ROUGE1, ROUGEL, and BERTScore, many others—especially medical variants—languish at the bottom, which indicates that the open ecosystem's rapid, decentralized innovation fuels both groundbreaking advances and pronounced instability in model quality.
376+
</p>
373377
</div>
374378

375379
<!-- 第二个图表 -->
376-
<div class="content has-text-centered my-6">
380+
<!-- <div class="content has-text-centered my-6">
377381
<img src="static/images/figure4.jpg" alt="polyp counting results" width="100%">
378382
<p class="mt-3">Figure 6: The influence of visual prompt in lesion quantification task among different MLLMs.</p>
379-
</div>
383+
</div> -->
380384
</div>
381385
</div>
382386
</div>
@@ -389,9 +393,9 @@ <h2 class="title is-3 has-text-centered">Analysis</h2>
389393
<h2 class="title is-3">Case Study</h2>
390394
<div class="content has-text-justified">
391395
<p>
392-
In this section, we present a case study analyzing the performance of multiple Multimodal Large Language Models (MLLMs) on OmniBrainBench across various endoscopic scenarios. In addition to showcasing correct responses, we categorize errors into three distinct types: <strong>Perceptual Errors</strong>, <strong>Lack of Knowledge</strong>, <strong>Irrelevant Response</strong>, and <strong>Refusal to Answer</strong>. The following figures illustrate these case studies: correct samples are presented in Figures 7 through 12, while error samples are shown in Figures 13 through 20.
396+
In this section, we conduct a comprehensive case study analysis of multiple MLLMs in our OmniBrainBench under various scenarios. The evaluation is structured into two primary tracks: closed-ended VQA and open-ended VQA, allowing for a nuanced assessment of model capabilities across different task formats.
393397
</p>
394-
<p>
398+
<!-- <p>
395399
<strong>Correct Samples (Figures 7–12):</strong> These figures highlight exemplary performances by leading models such as Gemini-2.5-Pro and GPT-4o. These models demonstrate robust capabilities in accurately interpreting endoscopic images and providing clinically relevant responses, underscoring their potential for assisting in real-world endoscopic analysis.
396400
</p>
397401
<p>
@@ -414,9 +418,9 @@ <h2 class="title is-3">Case Study</h2>
414418
</ul>
415419
<p>
416420
These case studies emphasize the need for improved medical knowledge integration and enhanced perceptual capabilities to bridge the gap between current MLLM performance and clinical requirements.
417-
</p>
421+
</p> -->
418422
</div>
419-
<div class="columns is-multiline">
423+
<!-- <div class="columns is-multiline">
420424
<div class="column is-half">
421425
<figure class="image">
422426
<img src="static/images/Case Study_01.jpg" alt="Case Study 01" class="center">
@@ -501,7 +505,7 @@ <h2 class="title is-3">Case Study</h2>
501505
<figcaption>Figure 20: Error sample demonstrating Refusal to Answer (Grok-3).</figcaption>
502506
</figure>
503507
</div>
504-
</div>
508+
</div> -->
505509
</div>
506510
</div>
507511
</div>

0 commit comments

Comments
 (0)