I attempted to reproduce the scores for UI-TARS-250705 (41.8%) using the script available at run_multienv_uitars15_v1.py.
However, I encountered some bugs, for example, when constructing messages using the last 5 images during the reproduction process. After resolving these issues, I was unable to achieve the scores reported on the leaderboard (our reproduction 31.3%).