I tested the robustness of the VIMA model for various words.
For example, I modified this task
Put the {dragged_texture} object in {scene} into the {base_texture} object.
into
jfasfo jdfjs {dragged_texture} aosdj sdfj {scene} asoads jsidf {base_texture} aidfoads.
which is not making any sense for human.
I expected the model not to perform well, however, the success rate was almost 100%
I need further investigation but I think this model only sees images, overfitted for only images.