This repository provides an Image Caption Generation Model that combines computer vision and natural language processing to generate captions for images. The model uses a pre-trained Inception-ResNet-V2 for image feature extraction and a GRU-based architecture with Attention to generate meaningful captions from the extracted features.
- Pre-trained Feature Extractor: Utilizes Inception-ResNet-V2 for extracting rich image features.
- Attention Mechanism: A GRU-based decoder with an Attention mechanism to improve caption accuracy by focusing on relevant parts of the image.
- Text Processing: The captions are preprocessed by tokenizing and adding special tokens to maintain structure.
- Trained on COCO: The model is trained on the COCO Captions dataset, one of the most widely used datasets for image captioning tasks.
-
Image Feature Extraction: The model first extracts high-level features from images using Inception-ResNet-V2 (pre-trained on ImageNet). This step allows the model to understand the content of the image.
-
Caption Generation: The model uses a GRU-based decoder, which generates a sequence of words based on the image features. The Attention mechanism helps the model focus on specific parts of the image to make the generated captions more relevant.
-
Training Process: The model is trained using the COCO Captions dataset, which contains images paired with detailed captions. The training process adjusts the model to better understand the correlation between image features and textual descriptions.
-
Generating Captions: Once trained, the model can generate captions for any input image. The model uses the
<start>token to begin caption generation and stops when it reaches the<end>token.
- Download the COCO Captions dataset for training (if not already downloaded).
- Run the
model.pyscript to train the model. It will save the trained model to a file namedimage_caption_model.h5. - Use the
predict_caption()function to generate captions for new images. Simply provide the path to the image you'd like to caption.
- Streamlit Integration: Currently, the model is implemented via a Python script. You can easily integrate a Streamlit UI to allow users to upload images and generate captions through a web interface.
- Fine-tuning: The model can be fine-tuned on a different dataset to improve performance for specific image domains, such as medical images or product images.
Feel free to fork this repository, make changes, and open pull requests! Contributions to improve the model or the documentation are always welcome.