Hi, solid work! I have a question I would like to ask for your guidance. We retrained Octo on new simulation environment data, and when using text as the goal, the success rate reaches 60%. However, when using image as the goal, the success rate drops to 0. Is this a limitation of the Octo model itself? The paper mentions that using images as the goal should perform better than using text, so I am quite confused about this discrepancy.