After looking at the code, it seems that I didn't see any estimation or supervision of pose during the training process. Additionally, the "Iterative" mentioned in the paper seems to correspond to the number of iterations for self-cross attention. I'm not sure if I misunderstood this.