Skip to content

Commit a25c242

Browse files
author
github-actions
committed
auto updates by github workflow
1 parent 6c2f220 commit a25c242

10 files changed

Lines changed: 156 additions & 120 deletions

File tree

β€ŽREADME.mdβ€Ž

Lines changed: 46 additions & 37 deletions
Large diffs are not rendered by default.

β€Žpaper_by_env/paper_gui.mdβ€Ž

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -226,15 +226,6 @@
226226
- πŸ”‘ Key: [framework], [dataset], [benchmark], [instruction synthesis], [UI-E2I-Synth], [UI-I2E-Bench], [GPT-4o]
227227
- πŸ“– TLDR: This paper introduces **UI-E2I-Synth**, a large-scale data synthesis pipeline that generates diverse GUI grounding instructions using GPT-4o, addressing challenges like implicit instructions and unbalanced element types. It also presents **UI-I2E-Bench**, a new benchmark with detailed annotations for evaluating GUI instruction grounding. Models trained on the synthesized data demonstrate superior performance across multiple platforms, highlighting the effectiveness of the proposed approach.
228228

229-
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
230-
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
231-
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
232-
- πŸ“… Date: April 14, 2025
233-
- πŸ“‘ Publisher: arXiv
234-
- πŸ’» Env: [GUI]
235-
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
236-
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
237-
238229
- [GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents](https://arxiv.org/abs/2504.10458)
239230
- Xiaobo Xia, Run Luo
240231
- πŸ›οΈ Institutions: Unknown
@@ -244,6 +235,15 @@
244235
- πŸ”‘ Key: [framework], [reinforcement learning], [unified action space], [GRPO], [GUI-R1]
245236
- πŸ“– TLDR: GUI-R1 introduces a reinforcement learning framework that enhances the capabilities of Large Vision-Language Models (LVLMs) for GUI agents. By employing a unified action space and a rule-based reward function, the model efficiently learns to perform high-level tasks across multiple platforms, including Windows, Linux, MacOS, Android, and Web. Utilizing only 0.02% of the data compared to previous methods, GUI-R1 demonstrates superior performance on eight benchmarks, showcasing the potential of reinforcement learning in GUI agent tasks.
246237

238+
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
239+
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
240+
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
241+
- πŸ“… Date: April 14, 2025
242+
- πŸ“‘ Publisher: arXiv
243+
- πŸ’» Env: [GUI]
244+
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
245+
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
246+
247247
- [On the Robustness of GUI Grounding Models Against Image Attacks](https://arxiv.org/abs/2504.04716)
248248
- Haoren Zhao, Tianyi Chen, Zhen Wang
249249
- πŸ›οΈ Institutions: HDU, Microsoft

β€Žpaper_by_env/paper_web.mdβ€Ž

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
- [PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction](https://arxiv.org/abs/2510.15863)
2+
- Simon Yu, Gang Li, Weiyan Shi, Peng Qi
3+
- πŸ›οΈ Institutions: NEU, Uniphore
4+
- πŸ“… Date: October 17, 2025
5+
- πŸ“‘ Publisher: arXiv
6+
- πŸ’» Env: [Web]
7+
- πŸ”‘ Key: [framework], [skill learning], [compositional skills], [transfer generalization], [Polymorphism], [PolySkill], [continual learning]
8+
- πŸ“– TLDR: This paper introduces **PolySkill**, a framework that enables LLM-based web agents to acquire generalizable and composable skills by decoupling a skill's abstract goal from its concrete implementation, inspired by polymorphic abstraction in software engineering. Evaluated on the Mind2Web benchmark, PolySkill significantly enhances skill reuse, improves task success rates on both seen and unseen websites, and reduces the number of interaction steps, demonstrating effective continual learning and skill transfer capabilities.
9+
110
- [Magentic-UI: Towards Human-in-the-loop Agentic Systems](https://arxiv.org/abs/2507.22358)
211
- Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, Saleema Amershi
312
- πŸ›οΈ Institutions: MSR

β€Žpaper_by_key/paper_benchmark.mdβ€Ž

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -325,15 +325,6 @@
325325
- πŸ”‘ Key: [framework], [dataset], [benchmark], [instruction synthesis], [UI-E2I-Synth], [UI-I2E-Bench], [GPT-4o]
326326
- πŸ“– TLDR: This paper introduces **UI-E2I-Synth**, a large-scale data synthesis pipeline that generates diverse GUI grounding instructions using GPT-4o, addressing challenges like implicit instructions and unbalanced element types. It also presents **UI-I2E-Bench**, a new benchmark with detailed annotations for evaluating GUI instruction grounding. Models trained on the synthesized data demonstrate superior performance across multiple platforms, highlighting the effectiveness of the proposed approach.
327327

328-
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
329-
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
330-
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
331-
- πŸ“… Date: April 14, 2025
332-
- πŸ“‘ Publisher: arXiv
333-
- πŸ’» Env: [GUI]
334-
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
335-
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
336-
337328
- [RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users](https://scai.cs.jhu.edu/projects/RealWebAssist/)
338329
- Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
339330
- πŸ›οΈ Institutions: JHU, Amazon
@@ -343,6 +334,15 @@
343334
- πŸ”‘ Key: [benchmark], [dataset], [GUI grounding], [speech input], [spatial reasoning], [temporal reasoning], [multi-step planning], [routine learning], [RealWebAssist]
344335
- πŸ“– TLDR: RealWebAssist introduces a benchmark for evaluating AI agents' ability to assist with long-horizon web tasks using sequential instructions from real-world users. The dataset includes 1,885 instructions across 107 tasks on 66 websites, featuring challenges like ambiguous instructions, GUI grounding, and evolving user goals. Evaluations show that current state-of-the-art models struggle with these complex, realistic scenarios.
345336

337+
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
338+
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
339+
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
340+
- πŸ“… Date: April 14, 2025
341+
- πŸ“‘ Publisher: arXiv
342+
- πŸ’» Env: [GUI]
343+
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
344+
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
345+
346346
- [AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories](https://agent-reward-bench.github.io/)
347347
- Xing Han LΓΉ, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina StaΕ„czak, Peter Shaw, Christopher J. Pal, Siva Reddy
348348
- πŸ›οΈ Institutions: McGill, Mila, Google DeepMind, Polytechnique MontrΓ©al, ServiceNow Research
@@ -478,15 +478,6 @@
478478
- πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab]
479479
- πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates.
480480

481-
- [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/)
482-
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
483-
- πŸ›οΈ Institutions: Shanghai AI Lab, SJTU, HKU, MIT
484-
- πŸ“… Date: October 30, 2024
485-
- πŸ“‘ Publisher: ICLR 2025
486-
- πŸ’» Env: [GUI]
487-
- πŸ”‘ Key: [model], [dataset], [benchmark], [OS-Atlas]
488-
- πŸ“– TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms.
489-
490481
- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252)
491482
- Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu
492483
- πŸ›οΈ Institutions: UCLA, Salesforce AI Research
@@ -496,6 +487,15 @@
496487
- πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting]
497488
- πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages.
498489

490+
- [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/)
491+
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
492+
- πŸ›οΈ Institutions: Shanghai AI Lab, SJTU, HKU, MIT
493+
- πŸ“… Date: October 30, 2024
494+
- πŸ“‘ Publisher: ICLR 2025
495+
- πŸ’» Env: [GUI]
496+
- πŸ”‘ Key: [model], [dataset], [benchmark], [OS-Atlas]
497+
- πŸ“– TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms.
498+
499499
- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603)
500500
- Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu
501501
- πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU

β€Žpaper_by_key/paper_dataset.mdβ€Ž

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -253,15 +253,6 @@
253253
- πŸ”‘ Key: [framework], [dataset], [benchmark], [instruction synthesis], [UI-E2I-Synth], [UI-I2E-Bench], [GPT-4o]
254254
- πŸ“– TLDR: This paper introduces **UI-E2I-Synth**, a large-scale data synthesis pipeline that generates diverse GUI grounding instructions using GPT-4o, addressing challenges like implicit instructions and unbalanced element types. It also presents **UI-I2E-Bench**, a new benchmark with detailed annotations for evaluating GUI instruction grounding. Models trained on the synthesized data demonstrate superior performance across multiple platforms, highlighting the effectiveness of the proposed approach.
255255

256-
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
257-
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
258-
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
259-
- πŸ“… Date: April 14, 2025
260-
- πŸ“‘ Publisher: arXiv
261-
- πŸ’» Env: [GUI]
262-
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
263-
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
264-
265256
- [RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users](https://scai.cs.jhu.edu/projects/RealWebAssist/)
266257
- Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
267258
- πŸ›οΈ Institutions: JHU, Amazon
@@ -271,6 +262,15 @@
271262
- πŸ”‘ Key: [benchmark], [dataset], [GUI grounding], [speech input], [spatial reasoning], [temporal reasoning], [multi-step planning], [routine learning], [RealWebAssist]
272263
- πŸ“– TLDR: RealWebAssist introduces a benchmark for evaluating AI agents' ability to assist with long-horizon web tasks using sequential instructions from real-world users. The dataset includes 1,885 instructions across 107 tasks on 66 websites, featuring challenges like ambiguous instructions, GUI grounding, and evolving user goals. Evaluations show that current state-of-the-art models struggle with these complex, realistic scenarios.
273264

265+
- [Breaking the Data Barrier -- Building GUI Agents Through Task Generalization](https://arxiv.org/abs/2504.10127)
266+
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
267+
- πŸ›οΈ Institutions: ZJU, WestlakeU, Shanghai AI Lab, HKU, HKUST
268+
- πŸ“… Date: April 14, 2025
269+
- πŸ“‘ Publisher: arXiv
270+
- πŸ’» Env: [GUI]
271+
- πŸ”‘ Key: [framework], [dataset], [benchmark], [mid-training], [reasoning], [GUIMid]
272+
- πŸ“– TLDR: This paper introduces a mid-training approach to enhance GUI agents by leveraging diverse, reasoning-intensive datasets such as mathematical and coding tasks. The authors demonstrate that training Vision Language Models (VLMs) on these data-rich tasks before fine-tuning on limited GUI trajectory data significantly improves performance on GUI benchmarks like WebArena and AndroidWorld. Notably, text-only mathematical reasoning data led to substantial cross-modal gains, highlighting the effectiveness of task generalization in overcoming data scarcity challenges in GUI agent development.
273+
274274
- [AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories](https://agent-reward-bench.github.io/)
275275
- Xing Han LΓΉ, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina StaΕ„czak, Peter Shaw, Christopher J. Pal, Siva Reddy
276276
- πŸ›οΈ Institutions: McGill, Mila, Google DeepMind, Polytechnique MontrΓ©al, ServiceNow Research

0 commit comments

Comments
Β (0)