MLCD_VL (#136)

anxiangsir · web-flow · commit b720209f5e37 · 2025-04-06T19:54:13.000+08:00
* mlcd_vl
diff --git a/README.md b/README.md
@@ -48,7 +48,7 @@ The results of the ImageNet linear probe are as follows:
 <a name="mlcd-embodied"></a>
 [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/DeepGlint-AI/MLCD-Embodied-7B)  
 
-More details about MLCD-Embodied can be found in the [MLCD-Embodied.md](mlcd/MLCD_Embodied.md) file.  
+More details about MLCD-Embodied can be found in the [MLCD-Embodied.md](mlcd_vl/README.md) file.  
 
 
 ### 1. General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
diff --git a/mlcd_vl/README.md b/mlcd_vl/README.md
@@ -1,7 +1,42 @@
+# MLCD-LLaVA-NeXT: A Multimodal Model with Enhanced Vision Capabilities
 
-## Train MLCD-LLaVA-NeXT
+## Overview
+
+MLCD-LLaVA-NeXT is our implementation that integrates the powerful MLCD vision encoder with the LLaVA-NeXT architecture. Our model leverages [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) as the language model and introduces several variants of the MLCD vision tower to achieve superior performance across multiple vision-language benchmarks.
+
+We built upon the [official LLaVA-NeXT framework](https://github.com/LLaVA-VL/LLaVA-NeXT) and trained using the [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) dataset to ensure a fair comparison with other vision-language models.
+
+## Performance Comparison
+
+Our MLCD vision encoders demonstrate significant improvements across various vision-language benchmarks when compared to other vision encoders:
+
+| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
+|:------------|:------:|:--------|:--------|:--------|:---------|:------|
+| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
+| SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
+| DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
+| **[MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
+| **[MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** | ✓ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
+| **[MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** | ✓ | **73.80** | **83.34** | **46.59** | **582.00** | 46.00 |
+
+*Note: Bold values indicate the best performance for each benchmark.*
+
+### Key Highlights
+
+- **Best Performance**: MLCD (ViT-bigG-14-448px) achieves state-of-the-art results on 4 out of 5 benchmarks
+- **RoPE2D Integration**: Our larger models utilize Rotary Position Embedding in 2D (RoPE2D) for improved spatial understanding
+
+## Getting Started
+
+
+### Prerequisites
+
+- NVIDIA GPUs with at least 80GB VRAM (recommended: A100 or H100)
+
+
+### Installation
+We provide a Docker environment to ensure reproducibility and ease of use:
 
-### 1. Installation
 
 Clone this repository and navigate to the LLaVA folder: 
 
@@ -20,7 +55,8 @@ docker run --gpus all \
 --shm-size=64g -it train_mlcd_llava bash
 ```
 
-### 2. Training
+
+### Training
 
 **Stage 1: MLCD-LLaVA-NeXT Pretraining**
 ```bash
@@ -33,14 +69,34 @@ bash scripts/finetune_mlcd.sh
 ```
 
 
-### 3. Evaluation  
-Install the evaluation tool and execute the evaluation script:
+### Evaluation  
+We evaluate MLCD-LLaVA-NeXT using the `lmms-eval` framework to ensure fair and comprehensive performance assessment:
+
 ```bash
 pip install lmms-eval==0.2.0
 bash eval.sh
 ```
 ---
 
+### Model Variants
+
+All MLCD vision tower variants are available on the Hugging Face Hub:
+
+- [MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336): Our base model with 336px resolution
+- [MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336): Enhanced architecture with RoPE2D at 336px
+- [MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448): Our flagship model with 448px resolution
+
+
+### Citing MLCD
+```latex
+@inproceedings{anxiang_2024_mlcd,
+  title={Multi-label Cluster Discrimination for Visual Representation Learning},
+  author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
+  booktitle={ECCV},
+  year={2024}
+}
+```
+
 
 ## MLCD-Embodied-7B 🤖
 
diff --git a/mlcd_vl/dockerfile b/mlcd_vl/dockerfile
@@ -20,15 +20,14 @@ RUN apt-get update && \
 COPY requirements.txt .
 
 # Install the packages
-# RUN pip install --no-cache-dir --upgrade pip && \
-#     pip install --no-cache-dir -r requirements.txt
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
 
-# RUN pip install flash-attn --no-build-isolation
 
 # Install the packages using Tencent Cloud mirror
-RUN pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com --upgrade pip && \
-    pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com -r requirements.txt
-#     pip install -i flash-attn --no-build-isolation
+# RUN pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com --upgrade pip && \
+#     pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com -r requirements.txt
+
 
 # Install the packages using Alibaba Cloud mirror
 # RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com --upgrade pip && \