GUI-Agents-Paper-List/adjacent.yaml at main · OSU-NLP-Group/GUI-Agents-Paper-List · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# adjacent.yaml — non-canonical entries useful as supporting context.

- title: 'Seed1.8 Model Card: Towards Generalized Real-World Agency'
  link: https://arxiv.org/abs/2603.20633
  authors:
  - Bytedance Seed
  institutions:
  - ByteDance Seed
  date: '2026-03-21'
  publisher: arXiv
  envs: []
  keywords:
  - foundation model
  - generalist agency
  - tool use
  - code execution
  - GUI interaction
  - Seed1.8
  tldr: |-
    Seed1.8 is a foundation model for generalized real-world agency that unifies search, code generation and execution, and GUI interaction under one agentic interface. The model card emphasizes strong language-vision performance plus latency- and cost-aware inference modes for deployment.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2603.20633'
  sources:
    arxiv: https://arxiv.org/abs/2603.20633
- title: 'MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning'
  link: https://arxiv.org/abs/2603.12266
  authors:
  - Haozhan Shen
  - Shilin Yan
  - Hongwei Xue
  - Shuaiqi Lu
  - Xiaojun Tang
  - Guannan Zhang
  - Tiancheng Zhao
  - Jianwei Yin
  institutions:
  - Accio Team
  - Alibaba Group
  - Zhejiang University
  - ZJU-BJ
  date: '2026-03-12'
  publisher: arXiv
  envs: []
  keywords:
  - benchmark
  - compositional reasoning
  - visual grounding
  - VPIR
  - programmatic verification
  - MM-CondChain
  tldr: |-
    MM-CondChain is a benchmark for visually grounded deep compositional reasoning built from multi-layer conditional chains whose steps are programmatically verified through VPIR. It spans natural images, charts, and GUI trajectories, and shows that even the strongest MLLMs remain weak on deep chained reasoning.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2603.12266'
  sources:
    arxiv: https://arxiv.org/abs/2603.12266
- title: 'You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents'
  link: https://arxiv.org/abs/2603.11862
  authors:
  - Ching-Yu Kao
  - Xinfeng Li
  - Shenyu Dai
  - Tianze Qiu
  - Pengcheng Zhou
  - Eric Hanchen Jiang
  - Philip Sperl
  institutions:
  - Fraunhofer AISEC
  - NTU
  - KTH
  - NUS
  - UCLA
  date: '2026-03-12'
  publisher: arXiv
  envs: []
  keywords:
  - security
  - documentation injection
  - privacy
  - data leakage
  - ReadSecBench
  - Trusted Executor Dilemma
  tldr: |-
    This paper studies documentation-embedded instruction injection in high-privilege LLM agents and frames the failure mode as the Trusted Executor Dilemma. It introduces ReadSecBench, shows exfiltration success up to 85%, and finds that both rule-based and LLM-based defenses still fail to catch the attacks reliably.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2603.11862'
  sources:
    arxiv: https://arxiv.org/abs/2603.11862
- title: 'OpenClaw-RL: Train Any Agent Simply by Talking'
  link: https://arxiv.org/abs/2603.10165
  authors:
  - Yinjie Wang
  - Xuyang Chen
  - Xiaolong Jin
  - Mengdi Wang
  - Ling Yang
  institutions:
  - University of Chicago
  - Princeton University
  - Peking University
  date: '2026-03-10'
  publisher: arXiv
  envs: []
  keywords:
  - reinforcement learning
  - agent training
  - next-state signals
  - process reward model
  - on-policy distillation
  - OpenClaw-RL
  tldr: |-
    OpenClaw-RL is an asynchronous RL framework that treats next-state signals from live interactions as a universal learning source. It combines scalar rewards from a process-reward judge with hindsight-guided on-policy distillation, and trains agents across conversations, terminals, GUI tasks, SWE, and tool use in one online loop.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2603.10165'
  sources:
    arxiv: https://arxiv.org/abs/2603.10165
- title: 'Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward'
  link: https://arxiv.org/abs/2602.12430
  authors:
  - Renjun Xu
  - Yang Yan
  institutions:
  - Zhejiang University
  date: '2026-02-12'
  publisher: arXiv
  envs: []
  keywords:
  - survey
  - agent skills
  - SKILL.md
  - MCP
  - skill acquisition
  - agent security
  tldr: |-
    This survey reviews the emerging agent-skills ecosystem for LLMs, covering architectural foundations such as SKILL.md and MCP, methods for acquiring and refining skills, deployment patterns for agent systems, and the security problems introduced by portable, dynamically loaded capabilities.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2602.12430'
  sources:
    arxiv: https://arxiv.org/abs/2602.12430
- title: Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
  link: https://arxiv.org/abs/2512.09706
  authors:
  - Kaichen He
  - Zihao Wang
  - Muyao Li
  - Anji Liu
  - Yitao Liang
  institutions:
  - Peking University
  - National University of Singapore
  date: '2025-12-10'
  publisher: arXiv
  envs: []
  keywords:
  - reinforcement learning
  - heterogeneous action space
  - multi-turn GRPO
  - Minecraft agent
  - cross-level actions
  - CrossAgent
  tldr: |-
    CrossAgent studies how a single agent model can switch among heterogeneous action spaces, including APIs, GUI events, and lower-level commands, without hand-written routing rules. Its training pipeline combines supervised fine-tuning with multi-turn GRPO and reports state-of-the-art results on 800+ Minecraft tasks, making it relevant to GUI work as a broader action-space unification result rather than a direct GUI paper.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2512.09706'
  sources:
    arxiv: https://arxiv.org/abs/2512.09706
- title: 'Single-Agent Scaling Fails Multi-Agent Intelligence: Towards Foundation Models with Native Multi-Agent Intelligence'
  link: https://arxiv.org/abs/2512.08743
  authors:
  - Shuyue Hu
  - Haoyang Yan
  - Yiqun Zhang
  - Yang Chen
  - Dongzhan Zhou
  - Lei Bai
  institutions:
  - Shanghai Artificial Intelligence Laboratory
  date: '2025-12-09'
  publisher: arXiv
  envs: []
  keywords:
  - multi-agent systems
  - foundation models
  - multi-agent intelligence
  - evaluation
  - scaling
  - survey
  tldr: |-
    This paper argues that stronger single-agent foundation models do not automatically become strong multi-agent systems, and evaluates 41 open models on seven single-agent and multi-agent benchmarks to show the gap directly. It uses GUI interaction as one example of native single-agent capability, but its main contribution is a broader multi-agent intelligence agenda rather than GUI research itself.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2512.08743'
  sources:
    arxiv: https://arxiv.org/abs/2512.08743
- title: 'Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents'
  link: https://arxiv.org/abs/2510.24702
  authors:
  - Yueqi Song
  - Ketan Ramaneti
  - Zaid Sheikh
  - Ziru Chen
  - Boyu Gou
  - Tianbao Xie
  - Yiheng Xu
  - Danyang Zhang
  - Apurva Gandhi
  - Fan Yang
  - Joseph Liu
  - Tianyue Ou
  - Zhihao Yuan
  - Frank Xu
  - Shuyan Zhou
  - Xingyao Wang
  - Xiang Yue
  - Tao Yu
  - Huan Sun
  - Yu Su
  - Graham Neubig
  institutions:
  - Carnegie Mellon University
  - The Ohio State University
  - University of Hong Kong
  - Duke University
  - Fujitsu Research
  - All Hands AI
  date: '2025-10-28'
  publisher: ICLR 2026 (Oral)
  envs: []
  keywords:
  - framework
  - data protocol
  - training data
  - supervised fine-tuning
  - dataset unification
  - ADP
  tldr: |-
    Agent Data Protocol (ADP) standardizes heterogeneous agent trajectories into a lightweight schema and conversion pipeline so diverse agent datasets can plug into multiple SFT pipelines without per-dataset engineering. Converting 13 existing datasets into ADP and fine-tuning on the unified corpus improves base models by about 20% on average across coding, browsing, tool-use, and research benchmarks.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2510.24702'
  sources:
    arxiv: https://arxiv.org/abs/2510.24702
    openreview: https://openreview.net/forum?id=tG6301ORHd
    homepage: https://agentdataprotocol.com
- title: Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
  link: https://arxiv.org/abs/2510.13913
  authors:
  - Shrey Pandit
  - Xuan-Phi Nguyen
  - Yifei Ming
  - Austin Xu
  - Jiayu Wang
  - Caiming Xiong
  - Shafiq Joty
  institutions:
  - Salesforce AI Research
  - University of Wisconsin-Madison
  date: '2025-10-15'
  publisher: arXiv
  envs: []
  keywords:
  - training data
  - data synthesis
  - progressive difficulty
  - deep research
  - tool-use diversity
  tldr: |-
    This paper synthesizes training data for deep-research web agents by progressively increasing question difficulty until a baseline agent fails, then using that agent again for validation and filtering. The resulting corpus is aimed at long-horizon online-tool use rather than browser-native GUI interaction, but it is relevant as adjacent training-data work for agent systems.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2510.13913'
  sources:
    arxiv: https://arxiv.org/abs/2510.13913
- title: 'Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation'
  link: https://arxiv.org/abs/2510.11977
  authors:
  - Sayash Kapoor
  - Benedikt Stroebl
  - Peter Kirgis
  - Nitya Nadgir
  - Zachary S. Siegel
  - Boyi Wei
  - Tianci Xue
  - Ziru Chen
  - Felix Chen
  - Saiteja Utpala
  - Franck Ndzomga
  - Dheeraj Oruganty
  - Sophie Luskin
  - Kangheng Liu
  - Botao Yu
  - Amit Arora
  - Dongyoon Hahm
  - Harsh Trivedi
  - Huan Sun
  - Juyong Lee
  - Tengjun Jin
  - Yifan Mai
  - Yifei Zhou
  - Yuxuan Zhu
  - Rishi Bommasani
  - Daniel Kang
  - Dawn Song
  - Peter Henderson
  - Yu Su
  - Percy Liang
  - Arvind Narayanan
  institutions:
  - Princeton University
  - Independent Researcher
  - The Ohio State University
  - Microsoft Research
  - Amazon
  - Georgetown University
  - KAIST
  - Stony Brook University
  - University of Illinois Urbana-Champaign
  - Stanford University
  - xAI
  - University of California
  - Berkeley
  date: '2025-10-13'
  publisher: ICLR 2026 (Poster)
  envs: []
  keywords:
  - evaluation infrastructure
  - leaderboard
  - evaluation harness
  - cost tracking
  - log inspection
  - agent traces
  - HAL
  tldr: |-
    HAL provides standardized infrastructure for evaluating agents across models, scaffolds, and benchmarks rather than introducing a new agent. It reports results from 21,730 rollouts across 9 models and 9 benchmarks, tracks costs and full traces, and uses LLM-aided log inspection to surface behaviors such as benchmark gaming and unsafe actions.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2510.11977'
  sources:
    arxiv: https://arxiv.org/abs/2510.11977
- title: Reliable Weak-to-Strong Monitoring of LLM Agents
  link: https://arxiv.org/abs/2508.19461
  authors:
  - Neil Kale
  - Chen Bo Calvin Zhang
  - Kevin Zhu
  - Ankit Aich
  - Paula Rodriguez
  - Scale Red Team
  - Christina Q. Knight
  - Zifan Wang
  institutions:
  - Scale AI
  - Carnegie Mellon University
  - Massachusetts Institute of Technology
  date: '2025-08-26'
  publisher: ICLR 2025
  envs: []
  keywords:
  - safety
  - LLM agent
  - monitoring
  - red-teaming
  tldr: |-
    Stress-tests LLM agent monitoring systems for detecting covert misbehavior using a monitor red-teaming (MRT) workflow varying agent/monitor awareness and adversarial evasion strategies, evaluated on SHADE-Arena for tool-calling agents and CUA-SHADE-Arena for computer-use agents.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2508.19461'
  sources:
    arxiv: https://arxiv.org/abs/2508.19461
- title: 'MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers'
  link: https://arxiv.org/abs/2508.14704
  authors:
  - Ziyang Luo
  - Zhiqi Shen
  - Wenzhuo Yang
  - Zirui Zhao
  - Prathyusha Jwalapuram
  - Amrita Saha
  - Doyen Sahoo
  - Silvio Savarese
  - Caiming Xiong
  - Junnan Li
  institutions:
  - Salesforce AI Research
  date: '2025-08-20'
  publisher: arXiv
  envs: []
  keywords:
  - benchmark
  - dataset
  - framework
  - long-horizon reasoning
  - unknown-tools challenge
  - execution-based evaluation
  - MCP-universe
  tldr: |-
    MCP-Universe introduces the first comprehensive benchmark for evaluating large language models (LLMs) through interactions with real-world Model Context Protocol (MCP) servers. It spans six core domains—Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching—across 11 MCP servers. The benchmark employs execution-based evaluators (format, static, dynamic) to rigorously assess agent performance. Despite progress, state-of-the-art models like GPT-5 (43.72% success), Grok-4 (33.33%), and Claude-4.0-Sonnet (29.44%) show significant limitations. The benchmark highlights challenges in long-context reasoning and unfamiliar tool handling, and provides an open-source extensible evaluation framework with UI support to accelerate future research.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2508.14704'
  sources:
    arxiv: https://arxiv.org/abs/2508.14704
- title: 'BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent'
  link: https://arxiv.org/abs/2508.06600
  authors:
  - Zijian Chen
  - Xueguang Ma
  - Shengyao Zhuang
  - Ping Nie
  - Kai Zou
  - Andrew Liu
  - Joshua Green
  - Kshama Patel
  - Ruoxi Meng
  - Mingyi Su
  - Sahel Sharifymoghaddam
  - Yanxi Li
  - Haoran Hong
  - Xinyu Shi
  - Xuye Liu
  - Nandan Thakur
  - Crystina Zhang
  - Luyu Gao
  - Wenhu Chen
  - Jimmy Lin
  institutions:
  - University of Waterloo
  - CSIRO
  - Independent
  - Carnegie Mellon University
  - The University of Queensland
  date: '2025-08-08'
  publisher: arXiv
  envs: []
  keywords:
  - benchmark
  - dataset
  - agentic search
  - deep research
  - BrowseComp-plus
  tldr: |-
    Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variation—for example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queries—highlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2508.06600'
  sources:
    arxiv: https://arxiv.org/abs/2508.06600
- title: 'Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training'
  link: https://arxiv.org/abs/2508.00414
  authors:
  - Tianqing Fang
  - Zhisong Zhang
  - Xiaoyang Wang
  - Rui Wang
  - Can Qin
  - Yuxuan Wan
  - Jun-Yu Ma
  - Ce Zhang
  - Jiaqi Chen
  - Xiyun Li
  - Hongming Zhang
  - Haitao Mi
  - Dong Yu
  institutions:
  - Tencent AI Lab
  date: '2025-08-01'
  publisher: arXiv
  envs: []
  keywords:
  - framework
  - dataset
  - model
  - deep research
  - reflection
  - voting
  - agent foundation models
  tldr: |-
    This work introduces **Cognitive Kernel-Pro**, a fully open-source, multi-module agent framework designed to democratize advanced AI agent development. It curates high-quality training data across four domains—web, files, code, and general reasoning—and introduces test-time strategies like reflection and voting to enhance robustness. Evaluated on the GAIA benchmark, its open 8B-parameter model outperforms previous open-source agents such as WebDancer and WebSailor, setting a new performance standard. Code is publicly available.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2508.00414'
  sources:
    arxiv: https://arxiv.org/abs/2508.00414
- title: 'Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation'
  link: https://arxiv.org/abs/2505.11604
  authors:
  - Kyudan Jung
  - Hojun Cho
  - Jooyeol Yun
  - Soyoung Yang
  - Jaehyeok Jang
  - Jaegul Choo
  institutions:
  - Chung-ang University
  - KAIST AI
  date: '2025-05-16'
  publisher: arXiv
  envs: []
  keywords:
  - benchmark
  - structured data manipulation
  - slide editing
  - object model editing
  - TSBench
  tldr: |-
    Talk to Your Slides targets slide editing through language-driven manipulation of the underlying document object model rather than GUI-native visual interaction. It is relevant to GUI work because it compares against GUI-based baselines and introduces the TSBench benchmark, but its primary interaction mechanism is structured document editing rather than direct GUI control.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2505.11604'
  sources:
    arxiv: https://arxiv.org/abs/2505.11604
- title: 'Magma: A Foundation Model for Multimodal AI Agents'
  link: https://openaccess.thecvf.com/content/CVPR2025/html/Yang_Magma_A_Foundation_Model_for_Multimodal_AI_Agents_CVPR_2025_paper.html
  authors:
  - Jianwei Yang
  - Reuben Tan
  - Qianhui Wu
  - Ruijie Zheng
  - Baolin Peng
  - Yongyuan Liang
  - Yu Gu
  - Mu Cai
  - Seonghyeon Ye
  - Joel Jang
  - Yuquan Deng
  - Lars Liden
  - Jianfeng Gao
  institutions:
  - Microsoft Research
  - University of Maryland
  - University of Wisconsin-Madison
  - KAIST
  - University of Washington
  date: '2025-02-18'
  publisher: CVPR 2025
  envs: []
  keywords:
  - model
  - foundation model
  - SoM
  - ToM
  - UI navigation
  - robot manipulation
  - Magma
  tldr: |-
    Magma is a multimodal foundation model for agentic tasks spanning digital and physical environments rather than a GUI-specific paper. It is relevant here because it reports strong UI navigation results and uses Set-of-Mark and Trace-of-Mark supervision, but its main contribution is a broader agentic model covering robotics as well as GUI tasks.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    publisher_page: https://openaccess.thecvf.com/content/CVPR2025/html/Yang_Magma_A_Foundation_Model_for_Multimodal_AI_Agents_CVPR_2025_paper.html
- title: 'Language Agents: Foundations, Prospects, and Risks'
  link: https://aclanthology.org/2024.emnlp-tutorials.3/
  authors:
  - Yu Su
  - Diyi Yang
  - Shunyu Yao
  - Tao Yu
  institutions:
  - The Ohio State University
  - Stanford University
  - Princeton University
  - The University of Hong Kong
  date: '2024-11-30'
  publisher: EMNLP 2024 Tutorial Abstracts
  envs: []
  keywords:
  - survey
  - tutorial
  - reasoning
  - planning
  - memory
  - multi-agent systems
  - safety
  tldr: |-
    This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    publisher_page: https://aclanthology.org/2024.emnlp-tutorials.3/
    code: https://github.com/acl-org/acl-anthology/
- title: 'MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning'
  link: https://proceedings.iclr.cc/paper_files/paper/2025/hash/a2c3c86679300047c740c9900f19ddac-Abstract-Conference.html
  authors:
  - Haotian Zhang
  - Mingfei Gao
  - Zhe Gan
  - Philipp Dufter
  - Nina Wenzel
  - Forrest Huang
  - Dhruti Shah
  - Xianzhi Du
  - Bowen Zhang
  - Yanghao Li
  - Sam Dodge
  - Keen You
  - Zhen Yang
  - Aleksei Timofeev
  - Mingze Xu
  - Hong-You Chen
  - Jean-Philippe Fauconnier
  - Zhengfeng Lai
  - Haoxuan You
  - Zirui Wang
  - Afshin Dehghan
  - Peter Grasch
  - Yinfei Yang
  institutions:
  - Apple
  date: '2024-09-30'
  publisher: ICLR 2025 (Poster)
  envs: []
  keywords:
  - model
  - MM1.5
  - vision language model
  - visual grounding
  - reasoning
  - data-centric
  - analysis
  tldr: |-
    This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    homepage: https://proceedings.iclr.cc/paper_files/paper/2025/hash/a2c3c86679300047c740c9900f19ddac-Abstract-Conference.html
- title: 'TinyAgent: Function Calling at the Edge'
  link: https://aclanthology.org/2024.emnlp-demo.9/
  authors:
  - Lutfi Eren Erdogan
  - Nicholas Lee
  - Siddharth Jha
  - Sehoon Kim
  - Ryan Tabrizi
  - Suhong Moon
  - Coleman Richard Charles Hooper
  - Gopala Anumanchipalli
  - Kurt Keutzer
  - Amir Gholami
  institutions:
  - UC Berkeley
  - ICSI
  date: '2024-09-01'
  publisher: EMNLP 2024 System Demonstrations
  envs: []
  keywords:
  - framework
  - dataset
  - function calling
  - LLMCompiler
  - quantization
  - TinyAgent
  tldr: |-
    TinyAgent is an edge deployment framework for small function-calling language models, paired with a curated training dataset, tool retrieval, and quantization for local inference. Its GUI relevance comes mainly from the MacBook assistant demo and local agent deployment setting, not from a primary contribution to GUI interaction research itself.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    publisher_page: https://aclanthology.org/2024.emnlp-demo.9/
    code: https://github.com/acl-org/acl-anthology/
- title: 'VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents'
  link: https://proceedings.iclr.cc/paper_files/paper/2025/hash/eea71dc576381b88f2a0ca4dedc2140d-Abstract-Conference.html
  authors:
  - Xiao Liu
  - Tianjie Zhang
  - Yu Gu
  - Iat Long Iong
  - Yifan Xu
  - Xixuan Song
  - Shudan Zhang
  - Hanyu Lai
  - Xinyi Liu
  - Hanlin Zhao
  - Jiadai Sun
  - Xinyue Yang
  - Yu Yang
  - Zehan Qi
  - Shuntian Yao
  - Xueqiao Sun
  - Siyi Cheng
  - Qinkai Zheng
  - Hao Yu
  - Hanchen Zhang
  - Wenyi Hong
  - Ming Ding
  - Lihang Pan
  - Xiaotao Gu
  - Aohan Zeng
  - Zhengxiao Du
  - Chan Hee Song
  - Yu Su
  - Yuxiao Dong
  - Jie Tang
  institutions:
  - Tsinghua University
  - Zhejiang University
  - Peking University
  - The Ohio State University
  date: '2024-08-12'
  publisher: ICLR 2025
  envs: []
  keywords:
  - benchmark
  - dataset
  - visual foundation agents
  - embodied tasks
  - visual design
  - VisualAgentBench
  - VAB
  tldr: |-
    VisualAgentBench benchmarks large multimodal models as general visual foundation agents across embodied tasks, GUI tasks, and visual design rather than focusing only on GUI interaction. It also releases trajectory data for behavior cloning, making it relevant to GUI work as a broader visual-agent benchmark rather than a direct GUI paper.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    homepage: https://proceedings.iclr.cc/paper_files/paper/2025/hash/eea71dc576381b88f2a0ca4dedc2140d-Abstract-Conference.html
- title: 'Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions'
  link: https://aclanthology.org/2025.acl-long.1087/
  authors:
  - Xinbei Ma
  - Yiting Wang
  - Yao Yao
  - Tongxin Yuan
  - Aston Zhang
  - Zhuosheng Zhang
  - Hai Zhao
  institutions:
  - Shanghai Jiao Tong University
  - Meta
  date: '2024-08-05'
  publisher: ACL 2025
  envs: []
  keywords:
  - safety
  - robustness
  - environmental distraction
  - multimodal LLM agent
  tldr: |-
    This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    publisher_page: https://aclanthology.org/2025.acl-long.1087/
    code: https://github.com/acl-org/acl-anthology/
- title: 'MindSearch: Mimicking Human Minds Elicits Deep AI Searcher'
  link: https://openreview.net/forum?id=xgtXkyqw1f
  authors:
  - Zehui Chen
  - Kuikun Liu
  - Qiuchen Wang
  - Jiangning Liu
  - Wenwei Zhang
  - Kai Chen
  - Feng Zhao
  institutions:
  - University of Science and Technology of China
  - Shanghai AI Laboratory
  date: '2024-07-29'
  publisher: ICLR 2025 (Poster)
  envs: []
  keywords:
  - framework
  - information seeking
  - planning
  - AI search
  - MindSearch
  tldr: |-
    This paper presents MindSearch, a novel approach to web information seeking and integration that mimics human cognitive processes. The system uses a multi-agent framework consisting of a WebPlanner and WebSearcher. The WebPlanner models multi-step information seeking as a dynamic graph construction process, decomposing complex queries into sub-questions. The WebSearcher performs hierarchical information retrieval for each sub-question. MindSearch demonstrates significant improvements in response quality and depth compared to existing AI search solutions, processing information from over 300 web pages in just 3 minutes.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    openreview: https://openreview.net/forum?id=xgtXkyqw1f
- title: 'Octo-planner: On-device Language Model for Planner-Action Agents'
  link: https://arxiv.org/abs/2406.18082
  authors:
  - Nexa AI Team
  institutions:
  - Nexa AI
  date: '2024-06-26'
  publisher: arXiv
  envs: []
  keywords:
  - model
  - planner-action framework
  - Octo-planner
  - on-device planning
  tldr: |-
    Presents Octo-planner, an on-device planner for a planner-action agent framework that separates task decomposition from action execution. Built on Phi-3 Mini and paired with an Octopus action model, it targets low-latency planning and execution on resource-constrained devices.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2406.18082'
  sources:
    arxiv: https://arxiv.org/abs/2406.18082
- title: 'VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought'
  link: https://openreview.net/forum?id=5G7MRfPngt
  authors:
  - Gabriel Herbert Sarch
  - Lawrence Jang
  - Michael J. Tarr
  - William W. Cohen
  - Kenneth Marino
  - Katerina Fragkiadaki
  institutions:
  - Carnegie Mellon University
  - Google DeepMind
  date: '2024-06-20'
  publisher: NeurIPS 2024 (Spotlight)
  envs: []
  keywords:
  - memory
  - In-Context Abstraction Learning
  - programs of thought
  - retrieval augmentation
  - ICAL
  tldr: |-
    ICAL turns sub-optimal demonstrations and feedback into reusable multimodal memories that improve VLM and LLM agents across TEACh, VisualWebArena, and Ego4D. It is relevant to GUI work because one evaluation domain is web agents, but the method itself is a broader embodied-agent memory approach rather than a direct GUI paper.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    openreview: https://openreview.net/forum?id=5G7MRfPngt
- title: 'Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent'
  link: https://arxiv.org/abs/2404.11459
  authors:
  - Wei Chen
  - Zhiyuan Li
  institutions:
  - Stanford University
  date: '2024-04-17'
  publisher: arXiv
  envs: []
  keywords:
  - model
  - functional tokens
  - on-device agent
  - edge deployment
  - Octopus v3
  tldr: |-
    Octopus v3 is a sub-billion multimodal AI agent model designed for efficient on-device deployment, with the paper centered on its functional-token mechanism and edge-device constraints rather than GUI-native interaction. It is relevant to GUI research as a lightweight multimodal agent backbone, but it is broader than a direct GUI-agent paper.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2404.11459'
  sources:
    arxiv: https://arxiv.org/abs/2404.11459
- title: Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking
  link: https://arxiv.org/abs/2404.08860
  authors:
  - Lei Ding
  - Jeshwanth Bheemanpally
  - Yi Zhang
  institutions:
  - University of California
  - Santa Cruz
  date: '2024-04-13'
  publisher: SIGIR 2024
  envs: []
  keywords:
  - framework
  - reranking
  - verification
  - technical support search
  - instruction execution
  tldr: |-
    Targets mobile "how-to" search by automatically executing step-by-step instructions from retrieved pages in a controlled Android environment and reranking results based on actual success. The paper frames this as a verification-driven ranking pipeline for technical-support search rather than a pure mobile-control benchmark.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2404.08860'
  sources:
    arxiv: https://arxiv.org/abs/2404.08860
- title: 'AgentStudio: A Toolkit for Building General Virtual Agents'
  link: https://openreview.net/forum?id=axUf8BOjnH
  authors:
  - Longtao Zheng
  - Zhiyuan Huang
  - Zhenghai Xue
  - Xinrun Wang
  - Bo An
  - Shuicheng Yan
  institutions:
  - Nanyang Technological University
  - Skywork AI
  - ETH Zurich
  date: '2024-03-26'
  publisher: ICLR 2025
  envs: []
  keywords:
  - benchmark
  - dataset
  - general virtual agents
  - GroundUI
  - IDMBench
  - CriticBench
  tldr: |-
    AgentStudio packages environments, tools, benchmarks, and datasets for general virtual agents with mixed GUI and API action spaces. It is relevant here because GUI interaction is one supported modality and because it contributes GroundUI, IDMBench, and CriticBench, but the paper is broader than a direct GUI-agent study.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    openreview: https://openreview.net/forum?id=axUf8BOjnH
- title: 'Cradle: Empowering Foundation Agents Towards General Computer Control'
  link: https://arxiv.org/abs/2403.03186
  authors:
  - Weihao Tan
  - Wentao Zhang
  - Xinrun Xu
  - Haochong Xia
  - Ziluo Ding
  - Boyu Li
  - Bohan Zhou
  - Junpeng Yue
  - Jiechuan Jiang
  - Yewen Li
  - Ruyi An
  - Molei Qin
  - Chuqiao Zong
  - Longtao Zheng
  - Yujie Wu
  - Xiaoqiang Chai
  - Yifei Bi
  - Tianbao Xie
  - Pengjie Gu
  - Xiyun Li
  - Ceyao Zhang
  - Long Tian
  - Chaojie Wang
  - Xinrun Wang
  - Börje F. Karlsson
  - Bo An
  - Shuicheng Yan
  - Zongqing Lu
  institutions:
  - Skywork AI
  - Beijing Academy of Artificial Intelligence
  - Nanyang Technological University
  - Peking University
  - Institute of Software
  - Chinese Academy of Sciences
  - The University of Hong Kong
  - The Chinese University of Hong Kong
  - Shenzhen
  date: '2024-03-05'
  publisher: arXiv
  envs: []
  keywords:
  - framework
  - Cradle
  - general computer control
  - screen-only control
  - memory
  - self-reflection
  tldr: |-
    Cradle formulates general computer control as screenshot input plus keyboard-and-mouse output, and instantiates that setting with a modular multimodal agent for software and video games. It matters to GUI work because it demonstrates screen-only control on real software and evaluates on OSWorld, but the paper is framed as a broader general-computer-control agenda rather than a direct GUI paper.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2403.03186'
  sources:
    arxiv: https://arxiv.org/abs/2403.03186
- title: Improving Language Understanding from Screenshots
  link: https://arxiv.org/abs/2402.14073
  authors:
  - Tianyu Gao
  - Zirui Wang
  - Adithya Bhaskar
  - Danqi Chen
  institutions:
  - Princeton Language and Intelligence (PLI)
  - Princeton University
  date: '2024-02-21'
  publisher: arXiv
  envs: []
  keywords:
  - screenshot language models
  - PTP
  - patch-and-text prediction
  - language understanding
  - plain-text-rendered screenshots
  tldr: |-
    This paper studies screenshot language models in a simplified plain-text-rendered setting and improves them with a patch-and-text prediction objective. It is relevant here because screenshot pretraining can transfer to UI-style inputs, but the paper is about general screenshot language understanding rather than direct GUI-agent behavior.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  arxiv_id: '2402.14073'
  sources:
    arxiv: https://arxiv.org/abs/2402.14073
- title: Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
  link: https://proceedings.neurips.cc/paper_files/paper/2024/hash/b6e9d6f4f3428cd5f3f9e9bbae2cab10-Abstract-Conference.html
  authors:
  - Wenkai Yang
  - Xiaohan Bi
  - Yankai Lin
  - Sishuo Chen
  - Jie Zhou
  - Xu Sun
  institutions:
  - Renmin University of China
  - Peking University
  - Tencent
  date: '2024-02-17'
  publisher: NeurIPS 2024
  envs: []
  keywords:
  - backdoor attacks
  - agent security
  - query-trigger attacks
  - observation-trigger attacks
  - reasoning-step attacks
  tldr: |-
    This paper analyzes backdoor attacks against generic LLM-based agents, including attacks that trigger from user queries or intermediate observations and attacks that alter intermediate reasoning while preserving the final answer. It matters for GUI work because web-shopping agents are one evaluation setting, but the contribution is a broader LLM-agent security analysis rather than a GUI-specific study.
  relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
  sources:
    publisher_page: https://neurips.cc/Help/Contact?select=Conference
    homepage: https://proceedings.neurips.cc/paper_files/paper/2024/hash/b6e9d6f4f3428cd5f3f9e9bbae2cab10-Abstract-Conference.html
- title: A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
  link: https://arxiv.org/abs/2402.10196
  authors:
  - Lingbo Mo
  - Zeyi Liao
  - Boyuan Zheng
  - Yu Su
  - Chaowei Xiao
  - Huan Sun
  institutions:
  - The Ohio State University
  - University of Wisconsin-Madison
  date: '2024-02-15'
  publisher: arXiv
  envs: []
  keywords:
  - safety