You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Switch source-of-truth to papers.yaml; site reads YAML directly
This is the structured-format migration the workflow has needed for a
while. Adding a new paper used to mean conforming to a strict 9-field
markdown schema where it was hard to capture extras (BibTeX,
OpenReview link, dataset link, β¦). All of that lives natively in YAML
and IDE/editor tooling renders it cleanly.
Format
------
papers.yaml is the new canonical store. Each entry:
- title: β¦
link: β¦ # primary canonical link
authors: [β¦]
institutions: [β¦]
date: "YYYY-MM-DD" | "YYYY-MM"
publisher: β¦
envs: [Web | Mobile | Desktop | "General GUI"]
keywords: [β¦]
tldr: |
multi-line summary
arxiv_id: optional
sources: # all keys optional
arxiv: β¦
openreview: β¦
publisher_page: β¦
homepage: β¦
code: β¦
dataset: β¦
bibtex: | # auto-generated by migration
@inproceedings{β¦}
bibtex_confirmed: false # set true after verifying
ADjacent.yaml is the same schema for non-canonical entries. Both files
support contributors editing them directly to fix BibTeX, add missing
sources, etc.
Pipeline
--------
scripts/migrate_to_yaml.py is the one-shot converter that produced
the YAML from the legacy ALL_PAPERS.md format. It enriches the
sources block from paper_db/papers/<id>/entity.json's canonical_links
when available (OpenReview forum, GitHub repo, project homepage).
update_template_or_data/utils/scripts/sort_by_date.py was rewritten:
it now reads papers.yaml + adjacent.yaml, sorts newest-first, writes
them back, and emits *derived* ALL_PAPERS.md / ADJACENT_PAPERS.md
mirrors without the bibtex blocks. paper_db's ingest_from_all_papers.py
keeps working because the derived markdown files retain the original
9-field shape.
scripts/update_repo.sh now does just two steps: run sort_by_date.py,
then assemble_readme.py. The legacy normalize_institutions.py and
lint_keys.py were dropped from the canonical pipeline (they operated
on ALL_PAPERS.md by line regex; they remain in scripts/ as ad-hoc
utilities).
scripts/sync_dates_from_paper_db.py was ported to read/write
papers.yaml so its --write mode keeps the source-of-truth canonical.
Site
----
site/src/lib/parsePapers.ts now parses YAML directly via js-yaml
instead of regexing markdown. Each Paper now carries:
- bibtex / bibtexConfirmed (used by the Copy BibTeX button)
- sources map (arxiv / openreview / publisher_page / homepage /
code / dataset; rendered as a labelled link row inside the
expanded card)
Card UI:
- Expanded card surfaces a "Sources" row with one entry per
populated source key.
- "Copy BibTeX" button shows a small "verified" badge when
bibtex_confirmed is true. Title text differs by status so a hover
reveals "Verified β copied from official source" vs
"Auto-generated β please verify before citing".
Detail page builds bib via the same prefer-stored / synthesize
fallback path.
Docs
----
CLAUDE.md (workspace root) and paper_repo/CLAUDE.md fully rewritten
to describe the YAML schema, the new pipeline, and the verify-bibtex
flow. The "How to Add a Paper" section now shows the YAML schema
directly. Dependencies updated (pyyaml in requirements.txt and
pyproject.toml; js-yaml in site/package.json).
Tests
-----
tests/test_local_update_workflow.py replaced its two paper_by_*
assertions with one that exercises the new YAML round-trip:
papers.yaml in β derived ALL_PAPERS.md (no bibtex) + fragments out,
plus legacy paper_by_* dirs cleaned up. All four tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- π TLDR: Stress-tests LLM agent monitoring systems for detecting covert misbehavior using a monitor red-teaming (MRT) workflow varying agent/monitor awareness and adversarial evasion strategies, evaluated on SHADE-Arena for tool-calling agents and CUA-SHADE-Arena for computer-use agents.
111
111
112
-
-[BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://arxiv.org/abs/2508.06600)
113
-
- Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
114
-
- ποΈ Institutions: University of Waterloo, CSIRO, Independent, Carnegie Mellon University, The University of Queensland
115
-
- π Date: August 08, 2025
116
-
- π Publisher: arXiv
117
-
- π Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
- π TLDR: Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variationβfor example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queriesβhighlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.
120
-
121
112
-[MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers](https://arxiv.org/abs/2508.14704)
- π TLDR: MCP-Universe introduces the first comprehensive benchmark for evaluating large language models (LLMs) through interactions with real-world Model Context Protocol (MCP) servers. It spans six core domainsβLocation Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searchingβacross 11 MCP servers. The benchmark employs execution-based evaluators (format, static, dynamic) to rigorously assess agent performance. Despite progress, state-of-the-art models like GPT-5 (43.72% success), Grok-4 (33.33%), and Claude-4.0-Sonnet (29.44%) show significant limitations. The benchmark highlights challenges in long-context reasoning and unfamiliar tool handling, and provides an open-source extensible evaluation framework with UI support to accelerate future research.
129
120
121
+
-[BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://arxiv.org/abs/2508.06600)
122
+
- Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
123
+
- ποΈ Institutions: University of Waterloo, CSIRO, Independent, Carnegie Mellon University, The University of Queensland
124
+
- π Date: August 08, 2025
125
+
- π Publisher: arXiv
126
+
- π Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
- π TLDR: Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variationβfor example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queriesβhighlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.
129
+
130
130
-[Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training](https://arxiv.org/abs/2508.00414)
- π TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
174
174
175
-
-[Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions](https://aclanthology.org/2025.acl-long.1087/)
176
-
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
177
-
- ποΈ Institutions: Shanghai Jiao Tong University, Meta
178
-
- π Date: August 05, 2024
179
-
- π Publisher: ACL 2025
180
-
- π Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
- π TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
183
-
184
175
-[TinyAgent: Function Calling at the Edge](https://aclanthology.org/2024.emnlp-demo.9/)
185
176
- Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
186
177
- ποΈ Institutions: UC Berkeley, ICSI
@@ -199,6 +190,15 @@ These entries are kept for reference only. The main list and generated environme
- π TLDR: VisualAgentBench benchmarks large multimodal models as general visual foundation agents across embodied tasks, GUI tasks, and visual design rather than focusing only on GUI interaction. It also releases trajectory data for behavior cloning, making it relevant to GUI work as a broader visual-agent benchmark rather than a direct GUI paper.
201
192
193
+
-[Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions](https://aclanthology.org/2025.acl-long.1087/)
194
+
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
195
+
- ποΈ Institutions: Shanghai Jiao Tong University, Meta
196
+
- π Date: August 05, 2024
197
+
- π Publisher: ACL 2025
198
+
- π Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
- π TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
201
+
202
202
-[MindSearch: Mimicking Human Minds Elicits Deep AI Searcher](https://openreview.net/forum?id=xgtXkyqw1f)
- π TLDR: This position paper studies mobile UI layout generation with LLMs and proposes a UI grammar to represent hierarchical screen structure. Its focus is controllable UI generation rather than GUI action or environment interaction, so it stays outside the direct-GUI agent main list.
318
318
319
-
-[OpenAgents: An Open Platform for Language Agents in the Wild](https://arxiv.org/abs/2310.10634)
320
-
- Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
321
-
- ποΈ Institutions: The University of Hong Kong, Sea AI Lab, Salesforce Research
322
-
- π Date: October 16, 2023
323
-
- π Publisher: COLM 2024
324
-
- π Relation: Adjacent to GUI research because it is a broader language-agent platform with web browsing as only one mode
- π TLDR: OpenAgents is a deployment-oriented platform for language agents in everyday use, spanning data analysis, API-tool use, and web browsing behind a shared user interface. It matters here because it includes a web-agent mode, but the paper is broader than direct GUI-agent research.
327
-
328
319
-[Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441)
- π TLDR: Introduces Set-of-Mark prompting, where segmented image regions are overlaid with explicit marks before being passed to a multimodal model. The paper shows that simple region marking can unlock much stronger zero-shot grounding from GPT-4V without fine-tuning.
336
327
328
+
-[OpenAgents: An Open Platform for Language Agents in the Wild](https://arxiv.org/abs/2310.10634)
329
+
- Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
330
+
- ποΈ Institutions: The University of Hong Kong, Sea AI Lab, Salesforce Research
331
+
- π Date: October 16, 2023
332
+
- π Publisher: COLM 2024
333
+
- π Relation: Adjacent to GUI research because it is a broader language-agent platform with web browsing as only one mode
- π TLDR: OpenAgents is a deployment-oriented platform for language agents in everyday use, spanning data analysis, API-tool use, and web browsing behind a shared user interface. It matters here because it includes a web-agent mode, but the paper is broader than direct GUI-agent research.
336
+
337
337
-[Reflexion: Language Agents with Verbal Reinforcement Learning](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)
0 commit comments