Skip to content

Commit 5c27d4a

Browse files
authored
Update README.md
1 parent 103a108 commit 5c27d4a

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

mlcd/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
1010

1111

12+
13+
1214
| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
1315
| :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
1416
| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
@@ -20,6 +22,30 @@ To evaluate MLCD’s performance within multimodal large language models (MLLMs)
2022

2123

2224

25+
26+
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
27+
| :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
28+
| LLM | Qwen2.5-7B | Qwen2.5-7B |
29+
| AI2D | **76.98** | 73.15 |
30+
| GQA | **64.17** | 63.31 |
31+
| ScienceQA-Img | **78.09** | 76.35 |
32+
| InfoVQA-Val | **43.48** | 38.88 |
33+
| MMBenchCN-Dev | **74.83** | 72.51 |
34+
| MMBenchEN-Dev | **76.37** | 74.57 |
35+
| SeedBench | **68.20** | 66.80 |
36+
| SeedBench-Img | **73.75** | 72.72 |
37+
| MMStar | **50.98** | 48.98 |
38+
| MMMU | **44.30** | 44.20 |
39+
| POPE | 88.69 | **88.83** |
40+
| ChartQA | **67.84** | 66.52 |
41+
| DocVQA-Val | **76.46** | 75.21 |
42+
| TextVQA-Val | 61.69 | **62.47** |
43+
| OCRBench | **531** | 525 |
44+
| MME(cognition) | **432** | 384 |
45+
| MME(perception) | **1598** | 1512 |
46+
47+
48+
2349
#### B. Linear Probe Evaluation Results
2450
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
2551

0 commit comments

Comments
 (0)