Skip to content

Commit b720209

Browse files
authored
MLCD_VL (#136)
* mlcd_vl
1 parent c06c4b6 commit b720209

File tree

3 files changed

+67
-12
lines changed

3 files changed

+67
-12
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The results of the ImageNet linear probe are as follows:
4848
<a name="mlcd-embodied"></a>
4949
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/DeepGlint-AI/MLCD-Embodied-7B)
5050

51-
More details about MLCD-Embodied can be found in the [MLCD-Embodied.md](mlcd/MLCD_Embodied.md) file.
51+
More details about MLCD-Embodied can be found in the [MLCD-Embodied.md](mlcd_vl/README.md) file.
5252

5353

5454
### 1. General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

mlcd_vl/README.md

Lines changed: 61 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,42 @@
1+
# MLCD-LLaVA-NeXT: A Multimodal Model with Enhanced Vision Capabilities
12

2-
## Train MLCD-LLaVA-NeXT
3+
## Overview
4+
5+
MLCD-LLaVA-NeXT is our implementation that integrates the powerful MLCD vision encoder with the LLaVA-NeXT architecture. Our model leverages [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) as the language model and introduces several variants of the MLCD vision tower to achieve superior performance across multiple vision-language benchmarks.
6+
7+
We built upon the [official LLaVA-NeXT framework](https://github.com/LLaVA-VL/LLaVA-NeXT) and trained using the [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) dataset to ensure a fair comparison with other vision-language models.
8+
9+
## Performance Comparison
10+
11+
Our MLCD vision encoders demonstrate significant improvements across various vision-language benchmarks when compared to other vision encoders:
12+
13+
| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
14+
|:------------|:------:|:--------|:--------|:--------|:---------|:------|
15+
| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
16+
| SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
17+
| DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
18+
| **[MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
19+
| **[MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** || 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
20+
| **[MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** || **73.80** | **83.34** | **46.59** | **582.00** | 46.00 |
21+
22+
*Note: Bold values indicate the best performance for each benchmark.*
23+
24+
### Key Highlights
25+
26+
- **Best Performance**: MLCD (ViT-bigG-14-448px) achieves state-of-the-art results on 4 out of 5 benchmarks
27+
- **RoPE2D Integration**: Our larger models utilize Rotary Position Embedding in 2D (RoPE2D) for improved spatial understanding
28+
29+
## Getting Started
30+
31+
32+
### Prerequisites
33+
34+
- NVIDIA GPUs with at least 80GB VRAM (recommended: A100 or H100)
35+
36+
37+
### Installation
38+
We provide a Docker environment to ensure reproducibility and ease of use:
339

4-
### 1. Installation
540

641
Clone this repository and navigate to the LLaVA folder:
742

@@ -20,7 +55,8 @@ docker run --gpus all \
2055
--shm-size=64g -it train_mlcd_llava bash
2156
```
2257

23-
### 2. Training
58+
59+
### Training
2460

2561
**Stage 1: MLCD-LLaVA-NeXT Pretraining**
2662
```bash
@@ -33,14 +69,34 @@ bash scripts/finetune_mlcd.sh
3369
```
3470

3571

36-
### 3. Evaluation
37-
Install the evaluation tool and execute the evaluation script:
72+
### Evaluation
73+
We evaluate MLCD-LLaVA-NeXT using the `lmms-eval` framework to ensure fair and comprehensive performance assessment:
74+
3875
```bash
3976
pip install lmms-eval==0.2.0
4077
bash eval.sh
4178
```
4279
---
4380

81+
### Model Variants
82+
83+
All MLCD vision tower variants are available on the Hugging Face Hub:
84+
85+
- [MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336): Our base model with 336px resolution
86+
- [MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336): Enhanced architecture with RoPE2D at 336px
87+
- [MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448): Our flagship model with 448px resolution
88+
89+
90+
### Citing MLCD
91+
```latex
92+
@inproceedings{anxiang_2024_mlcd,
93+
title={Multi-label Cluster Discrimination for Visual Representation Learning},
94+
author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
95+
booktitle={ECCV},
96+
year={2024}
97+
}
98+
```
99+
44100

45101
## MLCD-Embodied-7B 🤖
46102

mlcd_vl/dockerfile

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,14 @@ RUN apt-get update && \
2020
COPY requirements.txt .
2121

2222
# Install the packages
23-
# RUN pip install --no-cache-dir --upgrade pip && \
24-
# pip install --no-cache-dir -r requirements.txt
23+
RUN pip install --no-cache-dir --upgrade pip && \
24+
pip install --no-cache-dir -r requirements.txt
2525

26-
# RUN pip install flash-attn --no-build-isolation
2726

2827
# Install the packages using Tencent Cloud mirror
29-
RUN pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com --upgrade pip && \
30-
pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com -r requirements.txt
31-
# pip install -i flash-attn --no-build-isolation
28+
# RUN pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com --upgrade pip && \
29+
# pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple/ --trusted-host mirrors.cloud.tencent.com -r requirements.txt
30+
3231

3332
# Install the packages using Alibaba Cloud mirror
3433
# RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com --upgrade pip && \

0 commit comments

Comments
 (0)