From eedb3a337b46a2bc794c08bdc80fdddb6d16c92b Mon Sep 17 00:00:00 2001 From: ZhangShaolei <2512857469@qq.com> Date: Fri, 10 Jan 2025 13:09:27 +0800 Subject: [PATCH] Add LLaVA-Mini LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Paper: https://arxiv.org/abs/2501.03895 Code & Demo: https://github.com/ictnlp/LLaVA-Mini --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 2b6d237b..ae3f0ea2 100644 --- a/README.md +++ b/README.md @@ -94,6 +94,7 @@ A speech-to-speech dialogue model with both low-latency and high intelligence wh ## Multimodal Instruction Tuning | Title | Venue | Date | Code | Demo | |:--------|:--------:|:--------:|:--------:|:--------:| +| ![Star](https://img.shields.io/github/stars/ICTNLP/LLaVA-Mini.svg?style=social&label=Star)
[**LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token**](https://arxiv.org/pdf/2501.03895)
| arXiv | 2025-01-03 | [Github](https://github.com/ictnlp/LLaVA-Mini) | Local Demo | | ![Star](https://img.shields.io/github/stars/VITA-MLLM/VITA.svg?style=social&label=Star)
[**VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**](https://arxiv.org/pdf/2501.01957)
| arXiv | 2025-01-03 | [Github](https://github.com/VITA-MLLM/VITA) | - | | ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2-VL.svg?style=social&label=Star)
[**QVQ: To See the World with Wisdom**](https://qwenlm.github.io/blog/qvq-72b-preview/)
| Qwen | 2024-12-25 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://qwenlm.github.io/blog/qvq-72b-preview/) | | ![Star](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-VL2.svg?style=social&label=Star)
[**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**](https://arxiv.org/pdf/2412.10302)
| arXiv | 2024-12-13 | [Github](https://github.com/deepseek-ai/DeepSeek-VL2) | - |