This project aims to generate descriptive captions for images using the Flickr8k dataset. The approach combines deep learning techniques with natural language processing to achieve high-quality image captions.
- data/: Contains the Flickr8k dataset images and captions.
- notebooks/: Jupyter notebooks used for data preprocessing, model training, and evaluation.
- models/: Trained models and checkpoints.
- src/: Source code for data loading, preprocessing, and model definition.
To run this project, you will need the following packages installed:
- Python 3.7+
- TensorFlow
- Keras
- Numpy
- Pandas
- Matplotlib
- scikit-learn
- NLTK
- tqdm
- OpenCV
You can install the required packages using the following command:
pip install -r requirements.txtThe Flickr8k dataset is used for this project. It consists of 8,000 images and 40,000 captions. Each image has five different captions, providing diverse descriptions.
The data preprocessing involves:
- Loading Images: Images are loaded and resized to a fixed size.
- Loading Captions: Captions are loaded and tokenized.
- Data Augmentation: Images are augmented using random flips, rotations, and contrast adjustments.
- Text Vectorization: Captions are vectorized using a custom standardization function.
The model is trained using a combination of Convolutional Neural Networks (CNN) for image features and Recurrent Neural Networks (RNN) for text generation. Key steps include:
- Image Feature Extraction: Using a pre-trained CNN (e.g., InceptionV3) to extract features from images.
- Sequence Modeling: Using an RNN (e.g., LSTM) to generate captions based on the image features.
- Training: The model is trained with a custom loss function that combines categorical cross-entropy and BLEU scores.
The model's performance is evaluated using BLEU scores. BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of text generated by the model.
To run the notebook and train the model, execute the following command in the notebooks directory:
jupyter notebook flickr.ipynbEnsure that the dataset is placed in the data/ directory, and the notebook has access to the required resources.
The model achieves competitive BLEU scores on the validation and test sets, demonstrating its ability to generate coherent and relevant captions for images.
This project is based on various research papers and open-source projects in the field of image captioning and deep learning. Special thanks to the authors of these works for their valuable contributions.
This project is licensed under the MIT License. See the LICENSE file for more details.