You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found persistent <b>intention-action gaps</b> in all VLAs. While they often recognize what to do when faced with out-of-distribution objects or instructions, potentially helped by the underlying VLM trained on internet-scale data, their execution accruacy drops significantly.
344
+
All VLAs exhibit persistent <b>intention-action gaps</b>. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.
We found persistent <b>intention-action gaps</b> in all VLAs. While they often recognize what to do when faced with out-of-distribution objects or instructions, potentially helped by the underlying VLM trained on internet-scale data, their execution accruacy drops significantly.`;
367
+
All VLAs exhibit persistent <b>intention-action gaps</b>. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.`;
When faced with out-of-distribution objects, VLAs shows <b>robustness with intention</b>: They still knows which item to approach. However, they <b>struggle with execution</b> as the grasping often falls short.
376
-
<br>
377
-
Interestingly, we noticed that when the source object remains unchanged and the target object is changed to something with similar size and shape, which should not increase task difficulty, the grasp and task success rate can still drop significantly. Our hypothsis is that the end-to-end nature of VLAs may be the reason.`;
376
+
<br><br>
377
+
Interestingly, even when the source is unchanged and the target is swapped for one of similar size and shape, which shouldn't raise difficulty, the grasp and task success rate can still fall sharply. We hypothesize this stems from the end-to-end nature of VLAs.`;
While the underlying VLMs shows robustness with language complexity, the VLAs shows <b>significant performance drop</b> when faced with complex language instructions.
386
-
<br>Magma, which employs joint vision-language co-training, appears to be the most robust, suggesting that its training recipe may help VLAs better preserve the advanced language capability of their underlying VLMs`;
385
+
VLAs suffer a <b>significant performance drop</b> on complex language instructions, even though their underlying VLMs handle such complexity well.
386
+
<br><br>
387
+
Magma, using joint vision-language co-training, appears relatively robust, suggesting this approach helps VLAs retain their VLM's advanced linguistic capabilities.`;
While the underlying VLMs often exhibit strong vision-language reasoning, we observed that VLAs can <b>struggle</b> with <b>commonsense</b> and <b>visual-language thinking</b>, especially when <b>coupled with distractor objects</b>.
394
-
<br> In the example below, VLAs fail to distinguish between <b>orange juice</b> and <b>orange</b> when both are present, despite the fact that when only one of them is presented it can consistently succeed and the underlying VLM can also easily recognize the difference.`;
394
+
Although the underlying VLMs often demonstrate strong vision-language reasoning, we observed that VLAs <b>struggle</b> with <b>commonsense</b> and <b>visual-language thinking</b>, especially in the presence of <b>distractor objects</b>.<br>
395
+
<br>
396
+
For example, when both <b>orange juice</b> and <b>orange</b> appear together, VLAs frequently confuse them, even though they succeed reliably when only one is present and the underlying VLM can easily tell them apart.`;
0 commit comments