This project implements an image captioning model using a CNN-LSTM architecture. The model takes an image as input and generates a descriptive caption using natural language processing techniques. It is trained on a dataset containing images and their corresponding textual descriptions.
- The model is trained on Flickr8k dataset.
- It consists of 8000 images with multiple captions per image.
To improve model performance, images were horizontally flipped.
The model consists of three main components:
- Image Feature Extractor (CNN)
- Uses Xception to extract feature from images.
- Sequence Processor (LSTM)
- An embedding layer processes input text sequences.
- An LSTM network learns dependencies between words in a sentence.
- Decoder (Dense Layer with Softmax)
- Combines image features and text sequences.
- Generates the next word in the caption.
To view the model architecture in detail you may use Netron by uploading saved model.
The model is evaluated using the following metrics:
📌 BLEU-1: 0.6131
📌 BLEU-2: 0.5453
📌 BLEU-3: 0.4483
📌 BLEU-4: 0.3635
📌 ROUGE-L: 0.3314
📌 CIDEr: 0.0497
📌 SPICE: 0.0451
git clone https://github.com/yourusername/image-captioning.git
cd image-captioningpip install -r requirements.txtmkdir data
python utils/preprocess.py
python utils/feature_extract.py
python utils/data_loader.pyYou can also use pretrained weigths.
python train.pyTo test the model with your own images:
python test.py --image_path path/to/image.jpgRun the Streamlit interface for uploading images and generating captions:
streamlit run Streamlit.pyEvaluate the model based on some NLP metrics commonly used for :
python evaluation/test_cap.py
python evaluation/evaluation.pyExample output from the model:
| Input Image | ![]() |
|---|---|
| Generated Caption | "man in the water" |
🔹 Train on a larger dataset for improved generalization.
🔹 Experiment with Transformer-based models (e.g., ViT + GPT-2, BLIP).
👤 Aditya Nikam student at IIT Kanpur - contact ([email protected] / [email protected])
