This is very meaningful work, but I have a question: Why did you choose to train the caption module on a subset? Thank you for your response.