Hello, Vladimir.
First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
./sample/best_cap_model.pt
[{'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}]
I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:
Hello, Vladimir.
First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.
I ran the sample code
single_video_prediction.pyon the given example (women_long_jump.mp4) without major issues (had to change CUDA and PyTorch versions from the conda environment as reported in #45).However, when I tried the code on a custom video, let's call it
my_video.mp4, I got some errors.VGGish was unable to extract a
.wavfile from the audio because it had noaaccodec (I checked withffprobe my_video.mp4and the audio usedopuscodec instead ofaac). So, I changed these 2 lines in BMT/submodules/video_features/models/vggish/utils/utils.py for the following, which resolved the issue:After obtaining the
i3dandvggishfeatures I tried running BMT on the video using the following command:Obtaining:
Checking it was iterating over a 0-d tensor, I tried removing the
NMSand ran it again with:Obtaining a list of sentences with the token "UNK":
I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:
torch1.4.0 instead of 1.2.0, as if was the closest version that could work with my GPU. I kepttorchtextat version 0.3.1 (same as in yours). However, the code works for the example video you provide, so it seems unlikely that this is the root cause..wavfile directly from the.mp4, skipping the intermediate step of obtaining an.aacfile. I do not see any inconvenient in doing so, in fact, it seems like a more portable option. However, I remain unsure whether you did this for a specific reason I am unaware of.Desktop (please complete the following information):
You
condaenvironment