Skip to content

Commit 1986fc5

Browse files
committed
website typo fixed
1 parent 731a7a4 commit 1986fc5

File tree

1 file changed

+10
-8
lines changed

1 file changed

+10
-8
lines changed

index.html

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,7 @@ <h2 class="title is-3 has-text-centered">Benchmarking Results</h2>
341341

342342
<div class="content has-text-justified">
343343
<p id="text-content">
344-
We found persistent <b>intention-action gaps</b> in all VLAs. While they often recognize what to do when faced with out-of-distribution objects or instructions, potentially helped by the underlying VLM trained on internet-scale data, their execution accruacy drops significantly.
344+
All VLAs exhibit persistent <b>intention-action gaps</b>. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.
345345
</p>
346346
<center>
347347
<img id="image-content" src="./static/images/radarmap.png" style="width: 62vw;" />
@@ -364,7 +364,7 @@ <h2 class="title is-3 has-text-centered">Benchmarking Results</h2>
364364

365365
if (buttonNumber === 1) {
366366
textElement.innerHTML = `
367-
We found persistent <b>intention-action gaps</b> in all VLAs. While they often recognize what to do when faced with out-of-distribution objects or instructions, potentially helped by the underlying VLM trained on internet-scale data, their execution accruacy drops significantly.`;
367+
All VLAs exhibit persistent <b>intention-action gaps</b>. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.`;
368368
imageElement.classList.remove('hidden');
369369
imageElement.src = "./static/images/radarmap.png";
370370
imageElement.style.width = "62vw";
@@ -373,25 +373,27 @@ <h2 class="title is-3 has-text-centered">Benchmarking Results</h2>
373373
} else if (buttonNumber === 2) {
374374
textElement.innerHTML = `
375375
When faced with out-of-distribution objects, VLAs shows <b>robustness with intention</b>: They still knows which item to approach. However, they <b>struggle with execution</b> as the grasping often falls short.
376-
<br>
377-
Interestingly, we noticed that when the source object remains unchanged and the target object is changed to something with similar size and shape, which should not increase task difficulty, the grasp and task success rate can still drop significantly. Our hypothsis is that the end-to-end nature of VLAs may be the reason.`;
376+
<br><br>
377+
Interestingly, even when the source is unchanged and the target is swapped for one of similar size and shape, which shouldn't raise difficulty, the grasp and task success rate can still fall sharply. We hypothesize this stems from the end-to-end nature of VLAs.`;
378378
imageElement.classList.remove('hidden');
379379
imageElement.src = "./static/images/ood_objects.svg";
380380
// imageElement.style.width = "20vw";
381381
// imageElement.style.height = "auto";
382382
// imageElement.alt = "Image 2";
383383
} else if (buttonNumber === 3) {
384384
textElement.innerHTML = `
385-
While the underlying VLMs shows robustness with language complexity, the VLAs shows <b>significant performance drop</b> when faced with complex language instructions.
386-
<br>Magma, which employs joint vision-language co-training, appears to be the most robust, suggesting that its training recipe may help VLAs better preserve the advanced language capability of their underlying VLMs`;
385+
VLAs suffer a <b>significant performance drop</b> on complex language instructions, even though their underlying VLMs handle such complexity well.
386+
<br><br>
387+
Magma, using joint vision-language co-training, appears relatively robust, suggesting this approach helps VLAs retain their VLM's advanced linguistic capabilities.`;
387388
imageElement.classList.remove('hidden');
388389
imageElement.src = "./static/images/language_complexity_result.png";
389390
// imageElement.style.width = "62vw";
390391
// imageElement.style.height = "auto";
391392
} else if (buttonNumber === 4) {
392393
textElement.innerHTML = `
393-
While the underlying VLMs often exhibit strong vision-language reasoning, we observed that VLAs can <b>struggle</b> with <b>commonsense</b> and <b>visual-language thinking</b>, especially when <b>coupled with distractor objects</b>.
394-
<br> In the example below, VLAs fail to distinguish between <b>orange juice</b> and <b>orange</b> when both are present, despite the fact that when only one of them is presented it can consistently succeed and the underlying VLM can also easily recognize the difference.`;
394+
Although the underlying VLMs often demonstrate strong vision-language reasoning, we observed that VLAs <b>struggle</b> with <b>commonsense</b> and <b>visual-language thinking</b>, especially in the presence of <b>distractor objects</b>.<br>
395+
<br>
396+
For example, when both <b>orange juice</b> and <b>orange</b> appear together, VLAs frequently confuse them, even though they succeed reliably when only one is present and the underlying VLM can easily tell them apart.`;
395397
imageElement.classList.remove('hidden');
396398
imageElement.src = "./static/images/distract.svg";
397399
// imageElement.style.width = "62vw";

0 commit comments

Comments
 (0)