Add arXiv Omi integration app#7438
Conversation
Greptile SummaryThis PR adds a self-contained arXiv integration plugin under
Confidence Score: 4/5The plugin is a new standalone directory with no changes to the main backend; the only risk is in the arXiv integration logic itself, which is well-isolated. The code is clean and well-structured. The two issues found — double XML traversal in _format_entry and silent rejection of old-style versioned arXiv IDs like hep-th/9905001v1 in _safe_paper_id — are both minor and non-blocking for the common case. No auth, secrets, or database access is involved. plugins/omi-arxiv-app/main.py — specifically the _safe_paper_id version-stripping logic and _format_entry's double call to _entry_authors. Important Files Changed
Sequence DiagramsequenceDiagram
participant Omi as Omi Client
participant App as omi-arxiv-app (FastAPI)
participant arXiv as arXiv Atom API
Omi->>App: GET /.well-known/omi-tools.json
App-->>Omi: tool manifest (search_papers, get_paper_details, search_author)
Omi->>App: POST /tools/search_papers
App->>App: _build_search_query() sanitize inputs
App->>arXiv: "GET /api/query?search_query=...&max_results=N"
arXiv-->>App: Atom XML feed
App->>App: _parse_entries() + _format_entry()
App-->>Omi: "ChatToolResponse {result}"
Omi->>App: POST /tools/get_paper_details
App->>App: "_safe_paper_id() validate & clean"
App->>arXiv: "GET /api/query?id_list=XXXX&max_results=1"
arXiv-->>App: Atom XML feed
App-->>Omi: "ChatToolResponse {result}"
Omi->>App: POST /tools/search_author
App->>arXiv: "GET /api/query?search_query=au:Name&sortBy=submittedDate"
arXiv-->>App: Atom XML feed
App-->>Omi: "ChatToolResponse {result}"
Reviews (1): Last reviewed commit: "Add arXiv Omi integration app" | Re-trigger Greptile |
| authors = ", ".join(_entry_authors(entry)[:5]) or "unknown authors" | ||
| if len(_entry_authors(entry)) > 5: | ||
| authors += ", et al." |
There was a problem hiding this comment.
_entry_authors is called twice on lines 158–160, traversing the XML tree twice. Storing the result in a variable avoids the redundant walk.
| authors = ", ".join(_entry_authors(entry)[:5]) or "unknown authors" | |
| if len(_entry_authors(entry)) > 5: | |
| authors += ", et al." | |
| all_authors = _entry_authors(entry) | |
| authors = ", ".join(all_authors[:5]) or "unknown authors" | |
| if len(all_authors) > 5: | |
| authors += ", et al." |
There was a problem hiding this comment.
Addressed in 9f1991b: _entry_authors(entry) is now stored once in all_authors and reused for formatting.
| candidate = candidate.removeprefix("https://arxiv.org/abs/") | ||
| candidate = candidate.removeprefix("http://arxiv.org/abs/") | ||
| candidate = candidate.removeprefix("arXiv:") | ||
| candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) else candidate |
There was a problem hiding this comment.
Old-style versioned arXiv IDs such as
hep-th/9905001v1 or cs/9901001v2 pass through the version-strip branch unchanged (it only fires for new-format YYMM.NNNNvN IDs), so fullmatch on \d{7} fails and _safe_paper_id returns None. Users who paste a versioned legacy ID will get a confusing "invalid paper ID" error. Extend the split to cover old-format IDs as well.
| candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) else candidate | |
| candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) or re.match(r"^[a-z\-]+(\.[A-Z]{2})?/\d{7}v\d+$", candidate) else candidate |
There was a problem hiding this comment.
Addressed in 9f1991b: _safe_paper_id now strips versions for both new-format and legacy arXiv IDs such as hep-th/9905001v1.
| async def _request_arxiv(params: dict[str, Any]) -> str: | ||
| client = await _get_arxiv_client() | ||
| response = await client.get(ARXIV_API_URL, params=params) | ||
| response.raise_for_status() | ||
| return response.text |
There was a problem hiding this comment.
No rate-limit protection for arXiv API calls
arXiv's usage policy asks automated clients to stay under 3 requests per second. With no rate-limiting or retry-with-backoff logic here, concurrent users could collectively trigger 429/503 responses that surface as opaque httpx.HTTPError messages. Consider adding a simple delay between requests or at minimum returning a user-friendly message when arXiv returns a 429.
There was a problem hiding this comment.
Addressed in 9f1991b: arXiv requests are serialized with a small delay to stay below 3 req/s, and 429/503 responses now return user-friendly messages.
Summary
plugins/omi-arxiv-appValidation
python -m py_compile plugins/omi-arxiv-app/main.pygit diff --checktype: objectschemassearch_papersCandidate integration app for #3120.