This project aims to develop the Char-CNN-RNN model using our own dataset and to embed our textual data.
pip3 install pytorch==2.4.0 torchvision==0.19.0 pillow==10.4.0 tqdm==4.66.5
Dataset used in the project:
First, you need to create a dataset with txt
and img
files. The dataset structure should be in the following format.
Note
The dataset can be single-class or multi-class, and images can be .jpg
, .png
, or .jpeg
.
dataset/
├── text/
│ ├── class1/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
│ ├── class2/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
│ ├── class3/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
└── images/
├── class1/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
├── class2/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
├── class3/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
Image data was prepared as stated in Section 5 of the Learning Deep Representations of Fine-grained Visual Descriptions Paper.
Each image will be divided into 10 parts by cropping the top left, bottom left, top right, bottom right, and center parts. This process is repeated with a horizontal flip, resulting in 10 images.
These images are then converted to feature vectors of 1024 dimensions using GoogleNet. Each cropped part of the image will be processed, resulting in a .t7
file of size 60 (number of images) x 1024 (feature vector) x 10 (number of parts) for each class.
To preprocess images, enter the path of your image
folder on line 72 in the img2t7.py
file, and then run it.
python3 img2t7.py
Text data was prepared as specified in Section 5 of the Learning Deep Representations of Fine-grained Visual Descriptions Paper.
Each .txt
file will be read line by line, with each containing 10 lines. These lines are processed to contain 201 characters each. Longer lines are truncated, and shorter lines are zero-padded.
Each character is assigned a numerical value, converting character data into numerical form. All .txt
files are combined into a single .t7
file of size 60 (number of txt files) x 201 (character count) x 10 (line count) for each class.
To preprocess text files, enter the path of your text
folder on line 49 in the txt2t7.py
file, and then run it.
python3 txt2t7.py
Once image and text files are prepared, each class should have a .t7
file.
The folder structure should be as follows:
dataset/
├── text/
│ ├── class1/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
│ ├── class2/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
│ ├── class3/
│ │ ├── dosya1.txt
│ │ ├── dosya2.txt
│ │ └── ...
│ ├── class1.t7
│ ├── class2.t7
│ ├── class3.t7
└── images/
├── class1/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
├── class2/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
├── class3/
│ ├── dosya1.jpg
│ ├── dosya2.jpg
│ └── ...
├── class1.t7
├── class2.t7
├── class3.t7
To train the model, run the code in the project folder. Enter the path of your dataset in data_dir
.
Note
For a multi-class model, line 41 in the sje_train.py
file should read MultimodalDataset
. For a single class, change line 41 to SinglemodalDataset
.
python3 sje_train.py --seed 123 --use_gpu True --dataset birds --model_type cvpr --data_dir "file path" --train_split trainval --learning_rate 0.0007 --symmetric True --epochs 200 --checkpoint_dir ckpt --save_file sje_cub_c10_hybrid
After training, your model will be in the ckpt
folder. To test it, run the code below, adding the dataset path to data_dir
and the trained model path to model_path
.
python3 sje_eval.py --seed 123 --use_gpu True --dataset birds --model_type cvpr --data_dir "file path" --eval_split test --num_txts_eval 0 --print_class_stats True --batch_size 40 --model_path "file path"
Note
Download pre-trained models
If you see the error _pickle.UnpicklingError: invalid load key, '\x03' or a similar problem when loading pre-trained models, try installing the torchfile module and using torchfile.load() instead of torch.load().
To embed text with the trained model, enter the model path in model_path
on line 42 of the Text_embedding.py
file. Then, enter the path of the txt files you want to embed in root_dir
, and then run it.
python3 Text_embedding.py