Update README.md

anxiangsir · web-flow · commit 5c27d4a78a08 · 2025-04-06T20:46:17.000+08:00
diff --git a/mlcd/README.md b/mlcd/README.md
@@ -9,6 +9,8 @@
 To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
 
 
+
+
 | Vision Tower                                                                                  | RoPE2D | ChartQA   | DocVQA    | InfoVQA   | OCRBench   | MMMU      |
 | :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
 | CLIP (ViT-L-14-336px)                                                                         |   ×    | 66.52     | 75.21     | 38.88     | 525.00     | 44.20     |
@@ -20,6 +22,30 @@ To evaluate MLCD’s performance within multimodal large language models (MLLMs)
 
 
 
+
+| Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
+| :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
+| LLM             | Qwen2.5-7B                                                                              | Qwen2.5-7B            |
+| AI2D            | **76.98**                                                                               | 73.15                 |
+| GQA             | **64.17**                                                                               | 63.31                 |
+| ScienceQA-Img   | **78.09**                                                                               | 76.35                 |
+| InfoVQA-Val     | **43.48**                                                                               | 38.88                 |
+| MMBenchCN-Dev   | **74.83**                                                                               | 72.51                 |
+| MMBenchEN-Dev   | **76.37**                                                                               | 74.57                 |
+| SeedBench       | **68.20**                                                                               | 66.80                 |
+| SeedBench-Img   | **73.75**                                                                               | 72.72                 |
+| MMStar          | **50.98**                                                                               | 48.98                 |
+| MMMU            | **44.30**                                                                               | 44.20                 |
+| POPE            | 88.69                                                                                   | **88.83**             |
+| ChartQA         | **67.84**                                                                               | 66.52                 |
+| DocVQA-Val      | **76.46**                                                                               | 75.21                 |
+| TextVQA-Val     | 61.69                                                                                   | **62.47**             |
+| OCRBench        | **531**                                                                                 | 525                   |
+| MME(cognition)  | **432**                                                                                 | 384                   |
+| MME(perception) | **1598**                                                                                | 1512                  |
+
+
+
 #### B. Linear Probe Evaluation Results
 This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.