Skip to content

Commit 7343a06

Browse files
boyugouclaude
andcommitted
Switch source-of-truth to papers.yaml; site reads YAML directly
This is the structured-format migration the workflow has needed for a while. Adding a new paper used to mean conforming to a strict 9-field markdown schema where it was hard to capture extras (BibTeX, OpenReview link, dataset link, …). All of that lives natively in YAML and IDE/editor tooling renders it cleanly. Format ------ papers.yaml is the new canonical store. Each entry: - title: … link: … # primary canonical link authors: […] institutions: […] date: "YYYY-MM-DD" | "YYYY-MM" publisher: … envs: [Web | Mobile | Desktop | "General GUI"] keywords: […] tldr: | multi-line summary arxiv_id: optional sources: # all keys optional arxiv: … openreview: … publisher_page: … homepage: … code: … dataset: … bibtex: | # auto-generated by migration @inproceedings{…} bibtex_confirmed: false # set true after verifying ADjacent.yaml is the same schema for non-canonical entries. Both files support contributors editing them directly to fix BibTeX, add missing sources, etc. Pipeline -------- scripts/migrate_to_yaml.py is the one-shot converter that produced the YAML from the legacy ALL_PAPERS.md format. It enriches the sources block from paper_db/papers/<id>/entity.json's canonical_links when available (OpenReview forum, GitHub repo, project homepage). update_template_or_data/utils/scripts/sort_by_date.py was rewritten: it now reads papers.yaml + adjacent.yaml, sorts newest-first, writes them back, and emits *derived* ALL_PAPERS.md / ADJACENT_PAPERS.md mirrors without the bibtex blocks. paper_db's ingest_from_all_papers.py keeps working because the derived markdown files retain the original 9-field shape. scripts/update_repo.sh now does just two steps: run sort_by_date.py, then assemble_readme.py. The legacy normalize_institutions.py and lint_keys.py were dropped from the canonical pipeline (they operated on ALL_PAPERS.md by line regex; they remain in scripts/ as ad-hoc utilities). scripts/sync_dates_from_paper_db.py was ported to read/write papers.yaml so its --write mode keeps the source-of-truth canonical. Site ---- site/src/lib/parsePapers.ts now parses YAML directly via js-yaml instead of regexing markdown. Each Paper now carries: - bibtex / bibtexConfirmed (used by the Copy BibTeX button) - sources map (arxiv / openreview / publisher_page / homepage / code / dataset; rendered as a labelled link row inside the expanded card) Card UI: - Expanded card surfaces a "Sources" row with one entry per populated source key. - "Copy BibTeX" button shows a small "verified" badge when bibtex_confirmed is true. Title text differs by status so a hover reveals "Verified β€” copied from official source" vs "Auto-generated β€” please verify before citing". Detail page builds bib via the same prefer-stored / synthesize fallback path. Docs ---- CLAUDE.md (workspace root) and paper_repo/CLAUDE.md fully rewritten to describe the YAML schema, the new pipeline, and the verify-bibtex flow. The "How to Add a Paper" section now shows the YAML schema directly. Dependencies updated (pyyaml in requirements.txt and pyproject.toml; js-yaml in site/package.json). Tests ----- tests/test_local_update_workflow.py replaced its two paper_by_* assertions with one that exercises the new YAML round-trip: papers.yaml in β†’ derived ALL_PAPERS.md (no bibtex) + fragments out, plus legacy paper_by_* dirs cleaned up. All four tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f2b0523 commit 7343a06

22 files changed

Lines changed: 24706 additions & 756 deletions

β€ŽADJACENT_PAPERS.mdβ€Ž

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -109,15 +109,6 @@ These entries are kept for reference only. The main list and generated environme
109109
- πŸ”‘ Key: [safety], [LLM agent], [monitoring], [red-teaming]
110110
- πŸ“– TLDR: Stress-tests LLM agent monitoring systems for detecting covert misbehavior using a monitor red-teaming (MRT) workflow varying agent/monitor awareness and adversarial evasion strategies, evaluated on SHADE-Arena for tool-calling agents and CUA-SHADE-Arena for computer-use agents.
111111

112-
- [BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://arxiv.org/abs/2508.06600)
113-
- Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
114-
- πŸ›οΈ Institutions: University of Waterloo, CSIRO, Independent, Carnegie Mellon University, The University of Queensland
115-
- πŸ“… Date: August 08, 2025
116-
- πŸ“‘ Publisher: arXiv
117-
- πŸ“Œ Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
118-
- πŸ”‘ Key: [benchmark], [dataset], [agentic search], [deep research], [BrowseComp-plus]
119-
- πŸ“– TLDR: Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variationβ€”for example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queriesβ€”highlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.
120-
121112
- [MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers](https://arxiv.org/abs/2508.14704)
122113
- Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li
123114
- πŸ›οΈ Institutions: Salesforce AI Research
@@ -127,6 +118,15 @@ These entries are kept for reference only. The main list and generated environme
127118
- πŸ”‘ Key: [benchmark], [dataset], [framework], [long-horizon reasoning], [unknown-tools challenge], [execution-based evaluation], [MCP-universe]
128119
- πŸ“– TLDR: MCP-Universe introduces the first comprehensive benchmark for evaluating large language models (LLMs) through interactions with real-world Model Context Protocol (MCP) servers. It spans six core domainsβ€”Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searchingβ€”across 11 MCP servers. The benchmark employs execution-based evaluators (format, static, dynamic) to rigorously assess agent performance. Despite progress, state-of-the-art models like GPT-5 (43.72% success), Grok-4 (33.33%), and Claude-4.0-Sonnet (29.44%) show significant limitations. The benchmark highlights challenges in long-context reasoning and unfamiliar tool handling, and provides an open-source extensible evaluation framework with UI support to accelerate future research.
129120

121+
- [BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://arxiv.org/abs/2508.06600)
122+
- Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
123+
- πŸ›οΈ Institutions: University of Waterloo, CSIRO, Independent, Carnegie Mellon University, The University of Queensland
124+
- πŸ“… Date: August 08, 2025
125+
- πŸ“‘ Publisher: arXiv
126+
- πŸ“Œ Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
127+
- πŸ”‘ Key: [benchmark], [dataset], [agentic search], [deep research], [BrowseComp-plus]
128+
- πŸ“– TLDR: Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variationβ€”for example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queriesβ€”highlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.
129+
130130
- [Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training](https://arxiv.org/abs/2508.00414)
131131
- Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu
132132
- πŸ›οΈ Institutions: Tencent AI Lab
@@ -172,15 +172,6 @@ These entries are kept for reference only. The main list and generated environme
172172
- πŸ”‘ Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
173173
- πŸ“– TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
174174

175-
- [Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions](https://aclanthology.org/2025.acl-long.1087/)
176-
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
177-
- πŸ›οΈ Institutions: Shanghai Jiao Tong University, Meta
178-
- πŸ“… Date: August 05, 2024
179-
- πŸ“‘ Publisher: ACL 2025
180-
- πŸ“Œ Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
181-
- πŸ”‘ Key: [safety], [robustness], [environmental distraction], [multimodal LLM agent]
182-
- πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
183-
184175
- [TinyAgent: Function Calling at the Edge](https://aclanthology.org/2024.emnlp-demo.9/)
185176
- Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
186177
- πŸ›οΈ Institutions: UC Berkeley, ICSI
@@ -199,6 +190,15 @@ These entries are kept for reference only. The main list and generated environme
199190
- πŸ”‘ Key: [benchmark], [dataset], [visual foundation agents], [embodied tasks], [visual design], [VisualAgentBench], [VAB]
200191
- πŸ“– TLDR: VisualAgentBench benchmarks large multimodal models as general visual foundation agents across embodied tasks, GUI tasks, and visual design rather than focusing only on GUI interaction. It also releases trajectory data for behavior cloning, making it relevant to GUI work as a broader visual-agent benchmark rather than a direct GUI paper.
201192

193+
- [Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions](https://aclanthology.org/2025.acl-long.1087/)
194+
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
195+
- πŸ›οΈ Institutions: Shanghai Jiao Tong University, Meta
196+
- πŸ“… Date: August 05, 2024
197+
- πŸ“‘ Publisher: ACL 2025
198+
- πŸ“Œ Relation: Adjacent to GUI research (not part of the canonical direct-GUI main list)
199+
- πŸ”‘ Key: [safety], [robustness], [environmental distraction], [multimodal LLM agent]
200+
- πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
201+
202202
- [MindSearch: Mimicking Human Minds Elicits Deep AI Searcher](https://openreview.net/forum?id=xgtXkyqw1f)
203203
- Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao
204204
- πŸ›οΈ Institutions: University of Science and Technology of China, Shanghai AI Laboratory
@@ -316,15 +316,6 @@ These entries are kept for reference only. The main list and generated environme
316316
- πŸ”‘ Key: [position paper], [UI grammar], [mobile UI layout generation], [controllable generation]
317317
- πŸ“– TLDR: This position paper studies mobile UI layout generation with LLMs and proposes a UI grammar to represent hierarchical screen structure. Its focus is controllable UI generation rather than GUI action or environment interaction, so it stays outside the direct-GUI agent main list.
318318

319-
- [OpenAgents: An Open Platform for Language Agents in the Wild](https://arxiv.org/abs/2310.10634)
320-
- Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
321-
- πŸ›οΈ Institutions: The University of Hong Kong, Sea AI Lab, Salesforce Research
322-
- πŸ“… Date: October 16, 2023
323-
- πŸ“‘ Publisher: COLM 2024
324-
- πŸ“Œ Relation: Adjacent to GUI research because it is a broader language-agent platform with web browsing as only one mode
325-
- πŸ”‘ Key: [platform], [deployment], [web browsing], [data analysis], [API tools], [OpenAgents]
326-
- πŸ“– TLDR: OpenAgents is a deployment-oriented platform for language agents in everyday use, spanning data analysis, API-tool use, and web browsing behind a shared user interface. It matters here because it includes a web-agent mode, but the paper is broader than direct GUI-agent research.
327-
328319
- [Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441)
329320
- Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
330321
- πŸ›οΈ Institutions: Microsoft Research
@@ -334,6 +325,15 @@ These entries are kept for reference only. The main list and generated environme
334325
- πŸ”‘ Key: [visual prompting], [Set-of-Mark], [visual grounding], [zero-shot], [region marking]
335326
- πŸ“– TLDR: Introduces Set-of-Mark prompting, where segmented image regions are overlaid with explicit marks before being passed to a multimodal model. The paper shows that simple region marking can unlock much stronger zero-shot grounding from GPT-4V without fine-tuning.
336327

328+
- [OpenAgents: An Open Platform for Language Agents in the Wild](https://arxiv.org/abs/2310.10634)
329+
- Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
330+
- πŸ›οΈ Institutions: The University of Hong Kong, Sea AI Lab, Salesforce Research
331+
- πŸ“… Date: October 16, 2023
332+
- πŸ“‘ Publisher: COLM 2024
333+
- πŸ“Œ Relation: Adjacent to GUI research because it is a broader language-agent platform with web browsing as only one mode
334+
- πŸ”‘ Key: [platform], [deployment], [web browsing], [data analysis], [API tools], [OpenAgents]
335+
- πŸ“– TLDR: OpenAgents is a deployment-oriented platform for language agents in everyday use, spanning data analysis, API-tool use, and web browsing behind a shared user interface. It matters here because it includes a web-agent mode, but the paper is broader than direct GUI-agent research.
336+
337337
- [Reflexion: Language Agents with Verbal Reinforcement Learning](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)
338338
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
339339
- πŸ›οΈ Institutions: Northeastern University, Massachusetts Institute of Technology, Princeton University

0 commit comments

Comments
Β (0)