You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Truly generalist policies require perceptual ability <b>beyond</b> the object distributions <b>encountered during training or fine-tuning</b>.
270
270
<br><br>
271
-
In SimplerEnv, which assume the fine-tuning dataset is BridgeV2, all manipulation tasks are <code>Put {Source} on {Target}</code>. Therefore, We introduce <b>four</b> categories of <b>out-of-distribution objects</b> that resemble original objects in affordances/grasping difficulty.\
271
+
In SimplerEnv, which assume the fine-tuning dataset is BridgeV2, all manipulation tasks are <code>Put {Source} on {Target}</code>. Therefore, We introduce <b>four</b> categories of <b>out-of-distribution objects</b> that resemble original objects in affordances/grasping difficulty.
272
272
<br>
273
273
<br><b>OOD Source:</b> Source object not present in BridgeV2, but target object is.
274
274
<br><b>OOD Target:</b> Target object not present in BridgeV2, but source object is.
<br><b>(1) High Gradient Points:</b> By design, in 3D Gaussian Splatting training, points with high gradient indicate rapid changes in spatial/geometric features and larger discrepancies between the rendering and the ground truth image, which means a need for further optimization. \
283
-
Therefore, we select a few clusters of points with high gradients as seen in <i>Rabbit 3</i>.\
284
-
<br><br><b>2.2 Common-Sense Reranking:</b> \
285
-
<br><b>(2) Semantic Grounding:</b> In the first step, along with classification, we also asked GPT-4o to give us a list of parts of this object. Using zero-shot open-vocabulary part segmentation model such as <a href=\"https://arxiv.org/abs/2212.01558\">PartSLIP</a>, we can ground such semantic information to our reconstruction, as can be seen in <i>Rabbit 2</i>.\
286
-
<br><b>(3) Commen-Sense Ranking:</b> We then ask GPT-4o which part should have priority when it comes to touching. \
287
-
Then, without violating the geometric ranking, we rank the points within the same cluster based on their part priority. This ensures that even if the part segmentation fails, the robot will still have points to touch, as seen in <i>Rabbit 4</i>`;
282
+
To probe whether VLAs inherit the advanced <b>language generalization</b> abilities of their underlying VLMs, we augment the original SimplerEnv instructions with <b>three</b> types of complex <b>linguistic variations</b>.
283
+
<br>
284
+
<br><b>Language Action:</b> Paraphrase verbs to be compositional and less frequent than in BridgeV2. (e.g., <code>Put {Source} on {Target}</code> → <code>Pick up {Source} and lay on top of {Target}</code>).
285
+
<br><b>Language Negation:</b> Add negation such as <code>not, don't</code> to irrelevant objects. (e.g., <code>Put {Source} on {Target}</code> → <code> Put {Source} on {Target}, not {Irrelevant}</code>).
286
+
<br><b>Language Appearance:</b> Replace object with descriptive words. (e.g., <code>Put eggplant on {Target}</code> → <code>Put the purple object on {Source}</code>).`;
<br><b>(1) Touch Patch Transformation:</b> Based on photometric stereo, we can acquire depth and normal of the touch spot, which gives us a dense point cloud. \
<br><b>(2) Anchor Gaussian Points:</b> The dense point cloud are added as anchor Gaussian points.\
296
-
<br><b>(3) Optimization:</b> We then apply Gaussian normal supervision to further optimize the 3D Gaussians.";
291
+
textElement.innerHTML=`
292
+
To be useful in the real world, a generalist policy should be able to operate in <b>visually clustered</b> environments and possess <b>commonsense</b>.
293
+
<br><br>
294
+
SimplerEnv focuses on minimalist scenes with no semantic ambiguity. We therefore add <b> three </b> types of advanced tasks that require visual-language thinking and commonsense.
295
+
<br>
296
+
<br><b>Object Distraction:</b> Introduce objects not relevant to the task.
297
+
<br><b>Commonsense:</b> Modify instructions to require commonsense and reasoning (e.g., <code>Put carrot on {Target}</code> → <code> Put the vegetable that rabbits like on {Target}</code>).
298
+
<br><b>Commonsense + Object Distraction:</b> Introduce distractor objects that needs commonsense to distinguish (e.g., Introduce an orange object for task <code>Put orange juice on {Target}</code>).`;
0 commit comments