Thanks for releasing this great work. I was reading the paper and noticed a potential inconsistency in Fig. 8 (Latent action analysis). In Group B (Place Things), the LIBERO image pair seems to show an action that is visually more similar to a pick-up motion rather than a place motion. Specifically:
The robot gripper appears to be approaching and lifting an object. The visual progression resembles a grasping/picking trajectory rather than placing This makes it slightly confusing from the reader’s perspective since the other examples in Group B are clearly place actions.
Would you mind checking whether:
-
This LIBERO image pair was intended to be in Group A (Pick Up Things) instead, or
-
The latent-action clustering indeed placed it in the “place” cluster for some reason?
Either clarification or corrected visual would be helpful for understanding the latent action consistency analysis. Thanks again for the excellent work!