Open
Description
Description:
To help users understand how multi-modal models process text along with other data types (e.g., images, audio), add a notebook that compares different multi-modal NLP techniques.
Tasks:
- Compare CLIP (Contrastive Language-Image Pretraining), BLIP, Flamingo, and OpenAI’s GPT-4V.
- Apply models to text-to-image retrieval, image captioning, and multi-modal reasoning tasks.
- Evaluate results using BLEU, CIDEr, and retrieval precision metrics.
- Summarize key takeaways for different applications.
- Name the notebook multi_modal_nlp_comparison.ipynb.
- Update the README file with relevant references.