|
325 | 325 | - π Key: [framework], [dataset], [benchmark], [instruction synthesis], [UI-E2I-Synth], [UI-I2E-Bench], [GPT-4o] |
326 | 326 | - π TLDR: This paper introduces **UI-E2I-Synth**, a large-scale data synthesis pipeline that generates diverse GUI grounding instructions using GPT-4o, addressing challenges like implicit instructions and unbalanced element types. It also presents **UI-I2E-Bench**, a new benchmark with detailed annotations for evaluating GUI instruction grounding. Models trained on the synthesized data demonstrate superior performance across multiple platforms, highlighting the effectiveness of the proposed approach. |
327 | 327 |
|
328 | | -- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127) |
329 | | - - Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He |
330 | | - - ποΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST |
331 | | - - π
Date: April 14, 2025 |
332 | | - - π Publisher: arXiv |
333 | | - - π» Env: [GUI] |
334 | | - - π Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid] |
335 | | - - π TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development. |
336 | | - |
337 | 328 | - [RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users](https://scai.cs.jhu.edu/projects/RealWebAssist/) |
338 | 329 | - Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu |
339 | 330 | - ποΈ Institutions: JHU, Amazon |
|
343 | 334 | - π Key: [benchmark], [dataset], [GUI grounding], [speech input], [spatial reasoning], [temporal reasoning], [multi-step planning], [routine learning], [RealWebAssist] |
344 | 335 | - π TLDR: RealWebAssist introduces a benchmark for evaluating AI agents' ability to assist with long-horizon web tasks using sequential instructions from real-world users. The dataset includes 1,885 instructions across 107 tasks on 66 websites, featuring challenges like ambiguous instructions, GUI grounding, and evolving user goals. Evaluations show that current state-of-the-art models struggle with these complex, realistic scenarios. |
345 | 336 |
|
| 337 | +- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127) |
| 338 | + - Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He |
| 339 | + - ποΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST |
| 340 | + - π
Date: April 14, 2025 |
| 341 | + - π Publisher: arXiv |
| 342 | + - π» Env: [GUI] |
| 343 | + - π Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid] |
| 344 | + - π TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development. |
| 345 | + |
346 | 346 | - [AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories](https://agent-reward-bench.github.io/) |
347 | 347 | - Xing Han LΓΉ, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina StaΕczak, Peter Shaw, Christopher J. Pal, Siva Reddy |
348 | 348 | - ποΈ Institutions: McGill, Mila, Google DeepMind, Polytechnique MontrΓ©al, ServiceNow Research |
|
478 | 478 | - π Key: [framework], [dataset], [benchmark], [AndroidLab] |
479 | 479 | - π TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. |
480 | 480 |
|
481 | | -- [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/) |
482 | | - - Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao |
483 | | - - ποΈ Institutions: Shanghai AI Lab, SJTU, HKU, MIT |
484 | | - - π
Date: October 30, 2024 |
485 | | - - π Publisher: ICLR 2025 |
486 | | - - π» Env: [GUI] |
487 | | - - π Key: [model], [dataset], [benchmark], [OS-Atlas] |
488 | | - - π TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms. |
489 | | - |
490 | 481 | - [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) |
491 | 482 | - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu |
492 | 483 | - ποΈ Institutions: UCLA, Salesforce AI Research |
|
496 | 487 | - π Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] |
497 | 488 | - π TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. |
498 | 489 |
|
| 490 | +- [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/) |
| 491 | + - Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao |
| 492 | + - ποΈ Institutions: Shanghai AI Lab, SJTU, HKU, MIT |
| 493 | + - π
Date: October 30, 2024 |
| 494 | + - π Publisher: ICLR 2025 |
| 495 | + - π» Env: [GUI] |
| 496 | + - π Key: [model], [dataset], [benchmark], [OS-Atlas] |
| 497 | + - π TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms. |
| 498 | + |
499 | 499 | - [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) |
500 | 500 | - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu |
501 | 501 | - ποΈ Institutions: XJTU, Shanghai AI Lab, HKU |
|
0 commit comments