diff --git a/README.md b/README.md index f82bb53..0b4ca14 100644 --- a/README.md +++ b/README.md @@ -145,6 +145,7 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/inst-it/inst-it.svg?style=social&label=Star)
[**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https://arxiv.org/pdf/2412.03565)
| arXiv | 2024-12-04 | [Github](https://github.com/inst-it/inst-it) | - | | ![Star](https://img.shields.io/github/stars/TimeMarker-LLM/TimeMarker.svg?style=social&label=Star)
[**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https://arxiv.org/pdf/2411.18211)
| arXiv | 2024-11-27 | [Github](https://github.com/TimeMarker-LLM/TimeMarker/) | - | | ![Star](https://img.shields.io/github/stars/IDEA-Research/ChatRex.svg?style=social&label=Star)
[**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363)
| arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo | +| ![Star](https://img.shields.io/github/stars/ai4colonoscopy/IntelliScope.svg?style=social&label=Star)
[**[ColonGPT] Frontiers in Intelligent Colonoscopy**](https://arxiv.org/abs/2410.17241)
| arXiv | 2024-10-22 | [Github](https://github.com/ai4colonoscopy/IntelliScope) | Local Demo | | ![Star](https://img.shields.io/github/stars/Vision-CAIR/LongVU.svg?style=social&label=Star)
[**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https://arxiv.org/pdf/2410.17434)
| arXiv | 2024-10-22 | [Github](https://github.com/Vision-CAIR/LongVU) | [Demo](https://huggingface.co/spaces/Vision-CAIR/LongVU) | | ![Star](https://img.shields.io/github/stars/shikiw/Modality-Integration-Rate.svg?style=social&label=Star)
[**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/pdf/2410.07167)
| arXiv | 2024-10-09 | [Github](https://github.com/shikiw/Modality-Integration-Rate) | - | | ![Star](https://img.shields.io/github/stars/rese1f/aurora.svg?style=social&label=Star)
[**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https://arxiv.org/pdf/2410.03051)
| arXiv | 2024-10-04 | [Github](https://github.com/rese1f/aurora) | Local Demo | @@ -595,6 +596,7 @@ This is the first work to correct hallucination in multimodal large language mod ## Datasets of Multimodal Instruction Tuning | Name | Paper | Link | Notes | |:-----|:-----:|:----:|:-----:| +| **ColonINST** | [Frontiers in Intelligent Colonoscopy](https://arxiv.org/abs/2410.17241) | [Link](https://github.com/ai4colonoscopy/IntelliScope) | A medical multimodal instruction tuning dataset (62 categories, 300K+ colonoscopy images, 450K+ tuning pairs) | | **Inst-IT Dataset** | [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://arxiv.org/pdf/2412.03565) | [Link](https://github.com/inst-it/inst-it) | An instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images | | **E.T. Instruct 164K** | [E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding](https://arxiv.org/pdf/2409.18111) | [Link](https://github.com/PolyU-ChenLab/ETBench) | An instruction-tuning dataset for time-sensitive video understanding | | **MSQA** | [Multi-modal Situated Reasoning in 3D Scenes](https://arxiv.org/pdf/2409.02389) | [Link](https://msr3d.github.io/) | A large scale dataset for multi-modal situated reasoning in 3D scenes |