Skip to content

Commit daecca0

Browse files
committed
update 30 march 2025
1 parent ce5699a commit daecca0

File tree

6 files changed

+220
-9
lines changed

6 files changed

+220
-9
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ If you find our work useful, please consider citing:
4949

5050
The **[Searchable Paper Page](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/)** is a web-based interface that allows you to search and filter through the papers in our survey. You can also view the papers by category, platform, and date.
5151

52-
Last updated: **March 1st, 2025**.
52+
Last updated: **March 30th, 2025**.
5353

5454
---
5555

data/benchmark.json

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -409,7 +409,7 @@
409409
},
410410
{
411411
"Name": "Beyond Pass or Fail: A Multi-dimensional Benchmark for Mobile UI Navigation",
412-
"Platform": "Android",
412+
"Platform": "Mobile Android",
413413
"Date": "January 2025",
414414
"Paper_Url": "https://arxiv.org/abs/2501.02863",
415415
"Highlight": "Provides a fully automated benchmarking suite and introduces a multi-dimensional evaluation framework.",
@@ -422,5 +422,37 @@
422422
"Paper_Url": "https://arxiv.org/abs/2502.08047",
423423
"Highlight": "First GUI benchmark designed to evaluate dynamic GUI interactions by incorporating various initial states.",
424424
"Code_Url": ""
425+
},
426+
{
427+
"Name": "AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks",
428+
"Platform": "Mobile Android",
429+
"Date": "February 2025",
430+
"Paper_Url": "https://arxiv.org/abs/2502.13053",
431+
"Highlight": "Introduces the Active Environment Injection Attack (AEIA) framework that actively manipulates environmental elements (e.g., notifications) in mobile operating systems to mislead multimodal LLM-powered agents.",
432+
"Code_Url": ""
433+
},
434+
{
435+
"Name": "WebGames: Challenging General-Purpose Web-Browsing AI Agents",
436+
"Platform": "Web",
437+
"Date": "February 2025",
438+
"Paper_Url": "https://arxiv.org/abs/2502.18356",
439+
"Highlight": "A comprehensive benchmark designed to evaluate the capabilities of general-purpose web-browsing AI agents through 50+ interactive challenges. It uniquely provides a hermetic testing environment with verifiable ground-truth solutions.",
440+
"Code_Url": "https://github.com/convergence-ai/webgames"
441+
},
442+
{
443+
"Name": "AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents",
444+
"Platform": "Mobile Android",
445+
"Date": "March 2025",
446+
"Paper_Url": "https://arxiv.org/abs/2503.02403",
447+
"Highlight": "Introduces a fully autonomous evaluation framework for mobile agents, eliminating the need for manual task reward signal definition and extensive evaluation code development.",
448+
"Code_Url": ""
449+
},
450+
{
451+
"Name": "SafeArena: Evaluating the Safety of Autonomous Web Agents",
452+
"Platform": "Web",
453+
"Date": "March 2025",
454+
"Paper_Url": "https://arxiv.org/abs/2503.04957",
455+
"Highlight": "The first benchmark specifically designed to evaluate the deliberate misuse of web agents by testing their ability to complete both safe and harmful tasks.",
456+
"Code_Url": "https://safearena.github.io"
425457
}
426-
]
458+
]

data/dataset.json

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,14 @@
262262
"Paper_Url": "https://arxiv.org/abs/2404.16048",
263263
"Highlight": "Integrates images, action sequences, task descriptions, and spatial grounding into a unified dataset.",
264264
"Code_Url": "https://github.com/superagi/GUIDE"
265+
},
266+
{
267+
"Name": "Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents",
268+
"Platform": "Web",
269+
"Date": "February 2025",
270+
"Paper_Url": "https://arxiv.org/abs/2502.11357",
271+
"Highlight": "Largest-scale web trajectory dataset to date; dynamically explores web pages to create contextually relevant tasks",
272+
"Code_Url": ""
265273
}
266274

267-
]
275+
]

data/framework.json

Lines changed: 92 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -494,5 +494,96 @@
494494
"Paper_Url": "https://dl.acm.org/doi/abs/10.1145/3716132",
495495
"Highlight": "Enables UI automation through free-form textual prompts, eliminating the need for users to script automation tasks.",
496496
"Code_Url": "Https://github.com/PromptRPA/Prompt2TaskDataset"
497+
},
498+
{
499+
"Name": "Symbiotic Cooperation for Web Agents: Harnessing Complementary Strengths of Large and Small LLMs",
500+
"Platform": "Web",
501+
"Date": "February 2025",
502+
"Paper_Url": "https://arxiv.org/abs/2502.07942",
503+
"Highlight": "Multi-agent iterative architecture & Introduces an iterative, symbiotic learning process between large and small LLMs for web automation. Enhances both data synthesis and task performance through speculative data synthesis, multi-task learning, and privacy-preserving hybrid modes.",
504+
"Code_Url": ""
505+
},
506+
{
507+
"Name": "PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC",
508+
"Platform": "Windows computers",
509+
"Date": "February 2025",
510+
"Paper_Url": "https://arxiv.org/abs/2502.14282",
511+
"Highlight": "PC-Agent's hierarchical multi-agent design enables efficient decomposition of complex PC tasks. Its Active Perception Module enhances fine-grained GUI understanding by combining accessibility structures, OCR, and intention grounding.",
512+
"Code_Url": "https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent"
513+
},
514+
{
515+
"Name": "Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration",
516+
"Platform": "Android",
517+
"Date": "February 2025",
518+
"Paper_Url": "https://arxiv.org/abs/2502.17110",
519+
"Highlight": " Introduces video-guided learning, allowing the agent to acquire operational knowledge efficiently.",
520+
"Code_Url": "https://github.com/X-PLUG/MobileAgent"
521+
},
522+
{
523+
"Name": "MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions",
524+
"Platform": "Android",
525+
"Date": "February 2025",
526+
"Paper_Url": "https://arxiv.org/abs/2502.16796",
527+
"Highlight": "Introduces an app-oriented multi-agent framework with self-evolution, overcoming the complexity of cross-app interactions by dynamically recruiting specialized agents.",
528+
"Code_Url": "https://github.com/XiaoMi/MobileSteward"
529+
},
530+
{
531+
"Name": "Programming with Pixels: Computer-Use Meets Software Engineering",
532+
"Platform": "Computers",
533+
"Date": "February 2025",
534+
"Paper_Url": "https://arxiv.org/abs/2502.18525",
535+
"Highlight": "Shifts software engineering agents from API-based tool interactions to direct GUI-based computer use, allowing agents to interact with an IDE as a human developer would.",
536+
"Code_Url": "https://programmingwithpixels.com"
537+
},
538+
{
539+
"Name": "AppAgentX: Evolving GUI Agents as Proficient Smartphone Users",
540+
"Platform": "Mobile Android",
541+
"Date": "March 2025",
542+
"Paper_Url": "https://arxiv.org/abs/2503.02268",
543+
"Highlight": "Introduces an evolutionary mechanism that enables dynamic learning from past interactions and replaces inefficient low-level operations with high-level actions.",
544+
"Code_Url": "https://appagentx.github.io/"
545+
},
546+
{
547+
"Name": "LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications",
548+
"Platform": "Web",
549+
"Date": "March 2025",
550+
"Paper_Url": "https://arxiv.org/abs/2503.02950",
551+
"Highlight": "First open-source, production-ready web agent integrating tree search for multi-step task execution.",
552+
"Code_Url": "https://github.com/PathOnAI/LiteWebAgent"
553+
},
554+
{
555+
"Name": "CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning",
556+
"Platform": "Mobile Android",
557+
"Date": "March 2025",
558+
"Paper_Url": "https://arxiv.org/abs/2503.03743",
559+
"Highlight": "Introduces a basis subtask framework, where subtasks are predefined based on human task decomposition patterns, ensuring better executability and efficiency.",
560+
"Code_Url": "https://github.com/Yuqi-Zhou/CHOP"
561+
},
562+
{
563+
"Name": "Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization",
564+
"Platform": "Mobile Android, Web",
565+
"Date": "February 2025",
566+
"Paper_Url": "https://arxiv.org/abs/2502.14496",
567+
"Highlight": "A multi-agent reinforcement learning framework that introduces a Credit Re-Assignment (CR) strategy, using LLMs instead of environment-specific rewards to enhance performance and generalization.",
568+
"Code_Url": "https://github.com/THUNLP-MT/CollabUIAgents"
569+
},
570+
{
571+
"Name": "Automating the enterprise with foundation models",
572+
"Platform": "Web",
573+
"Date": "May 2024",
574+
"Paper_Url": "https://arxiv.org/abs/2405.03710",
575+
"Highlight": "Eliminates the high setup costs, brittle execution, and burdensome maintenance associated with traditional RPA by learning from video and text documentation.",
576+
"Code_Url": "https://github.com/HazyResearch/eclair-agents"
577+
578+
},
579+
{
580+
"Name": "Towards Ethical and Personalized Web Navigation Agents: A Framework for User-Aligned Task Execution",
581+
"Platform": "Web",
582+
"Date": "March 2025",
583+
"Paper_Url": "https://dl.acm.org/doi/abs/10.1145/3701551.3707420",
584+
"Highlight": "User-aligned task execution, where the agent adapts to individual user preferences in an ethical manner.",
585+
"Code_Url": "/"
497586
}
498-
]
587+
588+
589+
]

data/gui-testing.json

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,5 +102,30 @@
102102
"Paper_Url": "https://arxiv.org/abs/2411.17933",
103103
"Highlight": "Leverages multimodal LLMs to perform UI test transfers without requiring source code access",
104104
"Code_Url": ""
105+
},
106+
{
107+
"Name": "UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design",
108+
"Platform": "Web platforms",
109+
"Date": "February 2025",
110+
"Paper_Url": "https://arxiv.org/abs/2502.12561",
111+
"Highlight": "Enables LLM-powered automated usability testing by simulating thousands of user interactions, collecting both qualitative and quantitative data, and providing researchers with early feedback before real-user studies.",
112+
"Code_Url": "https://uxagent.hailab.io"
113+
},
114+
{
115+
"Name": "Guardian: A Runtime Framework for LLM-Based UI Exploration",
116+
"Platform": "Mobile Android",
117+
"Date": "September 2024",
118+
"Paper_Url": "https://dl.acm.org/doi/abs/10.1145/3650212.3680334",
119+
"Highlight": "Autonomously explores mobile applications, interacting with the UI to validate core functionalities.",
120+
"Code_Url": ""
121+
},
122+
{
123+
"Name": "Test-Agent: A Multimodal App Automation Testing Framework Based on the Large Language Model",
124+
"Platform": "Mobile Android, iOS, Harmony OS",
125+
"Date": "October 2024",
126+
"Paper_Url": "https://ieeexplore.ieee.org/abstract/document/10778901/",
127+
"Highlight": "Eliminates the need for pre-written test scripts by leveraging LLMs and multimodal perception to generate and execute test cases automatically.",
128+
"Code_Url": ""
105129
}
106-
]
130+
]
131+

data/models.json

Lines changed: 58 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -308,9 +308,64 @@
308308
"Platform": "Web, Desktop (Windows, MacOS, Linux), Mobile (Android, iOS)",
309309
"Date": "May 2024",
310310
"Paper_Url": "https://arxiv.org/abs/2410.05243",
311-
"Highlight": "Utilizes hierarchical screen parsing and spatially enhanced element descriptions to enhance LVLMs without additional training.",
311+
"Highlight": "A universal GUI grounding model that relies solely on vision, eliminating the need for text-based representations",
312312
"Code_Url": ""
313+
},
314+
{
315+
"Name": "VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning",
316+
"Platform": "Mobile",
317+
"Date": "February 2025",
318+
"Paper_Url": "https://arxiv.org/abs/2502.07949",
319+
"Highlight": "Addresses sparse-reward, long-horizon tasks for RL by autonomously breaking a complicated goal into subgoals",
320+
"Code_Url": "https://ai-agents-2030.github.io/VSC-RL"
321+
},
322+
{
323+
"Name": "Magma: A Foundation Model for Multimodal AI Agents",
324+
"Platform": "Web, Mobile, Desktop, Robotics",
325+
"Date": "February 2025",
326+
"Paper_Url": "https://arxiv.org/abs/2502.13130",
327+
"Highlight": "Jointly trains on heterogeneous datasets, enabling generalization across digital and physical tasks",
328+
"Code_Url": "https://microsoft.github.io/Magma/"
329+
},
330+
{
331+
"Name": "Digi-Q: Learning Q-Value Functions for Training Device-Control Agents",
332+
"Platform": "Mobile Android",
333+
"Date": "February 2025",
334+
"Paper_Url": "https://arxiv.org/abs/2502.15760",
335+
"Highlight": "Introduces a VLM-based Q-function for GUI agent training, enabling reinforcement learning without online interactions.",
336+
"Code_Url": "https://github.com/DigiRL-agent/digiq"
337+
},
338+
{
339+
"Name": "VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model",
340+
"Platform": "Mobile Android",
341+
"Date": "February 2025",
342+
"Paper_Url": "https://arxiv.org/abs/2502.18906",
343+
"Highlight": "Unlike traditional RL methods that require environment interactions, VEM enables training purely on offline data with a Value Environment Model.",
344+
"Code_Url": "https://github.com/microsoft/GUI-Agent-RL"
345+
},
346+
{
347+
"Name": "AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs",
348+
"Platform": " Web, Mobile",
349+
"Date": "February 2025",
350+
"Paper_Url": "https://arxiv.org/abs/2502.01977",
351+
"Highlight": "Automatically labels UI elements based on interaction-induced changes, making it scalable and high-quality.",
352+
"Code_Url": "https://autogui-project.github.io/"
353+
},
354+
{
355+
"Name": "Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks",
356+
"Platform": "Mobile Android",
357+
"Date": "March 2025",
358+
"Paper_Url": "https://arxiv.org/abs/2503.00401",
359+
"Highlight": "Improves reasoning without requiring large-scale training data.",
360+
"Code_Url": "https://github.com/ZrW00/GUIPivot"
361+
},
362+
{
363+
"Name": "WinClick: GUI Grounding with Multimodal Large Language Models",
364+
"Platform": "Windows Computer",
365+
"Date": "March 2025",
366+
"Paper_Url": "https://arxiv.org/abs/2503.04730",
367+
"Highlight": "The first GUI grounding model specifically tailored for Windows.",
368+
"Code_Url": "https://github.com/zackhuiiiii/WinSpot"
313369
}
370+
]
314371

315-
316-
]

0 commit comments

Comments
 (0)