In the code for distilling visual strategies through Dagger, I observed the use of difficult-to-obtain real-world parameters such as object position and rotation - all sourced identically from state-based strategies. This contradicts the fundamental purpose of visual distillation, as a theoretically sound vision-based policy should achieve comparable performance by exclusively utilizing the inputs required by state-based policies while ignoring redundant visual information. More critically, given the influential status of this baseline series (UniDexGrasp and UniDexGrasp++), subsequent studies (e.g., Resdex) have adopted identical visual configurations. This implies that all reported performance metrics for vision-based policies in these works are invalid (potentially significantly overestimated), rendering them impractical for real-world deployment.
In the code for distilling visual strategies through Dagger, I observed the use of difficult-to-obtain real-world parameters such as object position and rotation - all sourced identically from state-based strategies. This contradicts the fundamental purpose of visual distillation, as a theoretically sound vision-based policy should achieve comparable performance by exclusively utilizing the inputs required by state-based policies while ignoring redundant visual information. More critically, given the influential status of this baseline series (UniDexGrasp and UniDexGrasp++), subsequent studies (e.g., Resdex) have adopted identical visual configurations. This implies that all reported performance metrics for vision-based policies in these works are invalid (potentially significantly overestimated), rendering them impractical for real-world deployment.