You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mlcd_vl/README.md
+61-5Lines changed: 61 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,42 @@
1
+
# MLCD-LLaVA-NeXT: A Multimodal Model with Enhanced Vision Capabilities
1
2
2
-
## Train MLCD-LLaVA-NeXT
3
+
## Overview
4
+
5
+
MLCD-LLaVA-NeXT is our implementation that integrates the powerful MLCD vision encoder with the LLaVA-NeXT architecture. Our model leverages [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) as the language model and introduces several variants of the MLCD vision tower to achieve superior performance across multiple vision-language benchmarks.
6
+
7
+
We built upon the [official LLaVA-NeXT framework](https://github.com/LLaVA-VL/LLaVA-NeXT) and trained using the [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) dataset to ensure a fair comparison with other vision-language models.
8
+
9
+
## Performance Comparison
10
+
11
+
Our MLCD vision encoders demonstrate significant improvements across various vision-language benchmarks when compared to other vision encoders:
0 commit comments