- The idea of this project is to create a Deep Learning model that deliver textual description from given photographs. Thus a combination of different techniques from both Computer Vision and Natural Langue Processing are conducted.
- I used pre-trained weight on imagenet dataset of Resnet50 architecture to extract training features. Later LSTM and Reset50 are combined as one deep CNN to train this caption generator model.
- The dataset used for this dataset can be downloaded from Kaggle.
-
In order to execute notebook
test_model_image_generators.ipynbor python scripttest_model_image_generators.py, it is required to conduct several preprocessing stages and train the model in advance. Folder/preprocesscontains python scripts with function described as below:extract_image_features.py: extract features with pre-trained, fine-tune ResNet50 using Imagenet weight from training dataset. All features extracted are save as a pickle file/preprocess/features.pklfor later use (the file is not uploaded since it is too heavy)generate_tokenizer.py: generate a tokenizer as pickle file/preprocess/tokenizer.pklpreprocess_text_data: preprocess descriptions of every training image and save them as/preprocess/descriptions.txt
-
After obtaining 3 files
features.pkl,tokenizer.pkl,descriptions.txtin folder/preprocess, runtrain.pyto star training model. It should be notice that the training process consume lots of time and computational power (GPU and >8Gb is required). After the training is finished, the entire model and its weight would be save as fileimage_captioning_model.h5(this file is not uploaded since it is too heavy).
