Skip to content

Add arXiv Omi integration app#7438

Open
wly12312 wants to merge 2 commits into
BasedHardware:mainfrom
wly12312:omi-arxiv-app
Open

Add arXiv Omi integration app#7438
wly12312 wants to merge 2 commits into
BasedHardware:mainfrom
wly12312:omi-arxiv-app

Conversation

@wly12312
Copy link
Copy Markdown

Summary

  • add a standalone no-auth arXiv integration app under plugins/omi-arxiv-app
  • expose Omi chat tools for paper search, paper metadata lookup, and recent author papers
  • include Railway/Heroku deployment files and local usage docs

Validation

  • python -m py_compile plugins/omi-arxiv-app/main.py
  • git diff --check
  • FastAPI TestClient manifest checks for all 3 tools and type: object schemas
  • helper checks for string/invalid limits, arXiv paper ID cleanup, category validation, and manifest schemas
  • live arXiv smoke check for search_papers

Candidate integration app for #3120.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR adds a self-contained arXiv integration plugin under plugins/omi-arxiv-app that exposes three Omi chat tools — paper search, paper detail lookup, and author search — backed by the public arXiv Atom API with no auth or environment variables required.

  • main.py: FastAPI app with a shared httpx.AsyncClient managed by a lifespan context, input sanitization helpers (_safe_paper_id, _safe_category, _safe_limit), Atom feed XML parsing, and a /.well-known/omi-tools.json manifest endpoint.
  • Deployment files (Procfile, railway.toml, runtime.txt): ready for Railway or Heroku, pinning Python 3.11.9 and starting uvicorn on $PORT.
  • requirements.txt: fully pinned dependency set (fastapi, uvicorn, pydantic, httpx) with no conflicts.

Confidence Score: 4/5

The plugin is a new standalone directory with no changes to the main backend; the only risk is in the arXiv integration logic itself, which is well-isolated.

The code is clean and well-structured. The two issues found — double XML traversal in _format_entry and silent rejection of old-style versioned arXiv IDs like hep-th/9905001v1 in _safe_paper_id — are both minor and non-blocking for the common case. No auth, secrets, or database access is involved.

plugins/omi-arxiv-app/main.py — specifically the _safe_paper_id version-stripping logic and _format_entry's double call to _entry_authors.

Important Files Changed

Filename Overview
plugins/omi-arxiv-app/main.py Core FastAPI app with three Omi chat tools backed by the arXiv Atom API; minor issues with double XML traversal in _format_entry and old-style versioned IDs silently rejected by _safe_paper_id
plugins/omi-arxiv-app/requirements.txt Pins fastapi, uvicorn, pydantic, and httpx to specific versions; compatible set with no obvious conflicts
plugins/omi-arxiv-app/Procfile Standard Heroku/Railway Procfile; starts uvicorn on $PORT with a sensible default of 8080
plugins/omi-arxiv-app/railway.toml Railway deployment config with NIXPACKS builder, /health check, and ON_FAILURE restart policy
plugins/omi-arxiv-app/runtime.txt Pins Python 3.11.9 for Heroku/Railway runtime
plugins/omi-arxiv-app/README.md Clear local dev and deployment docs with working curl examples for all three tools
plugins/omi-arxiv-app/.gitignore Standard Python gitignore for .venv, pycache, and .pyc files

Sequence Diagram

sequenceDiagram
    participant Omi as Omi Client
    participant App as omi-arxiv-app (FastAPI)
    participant arXiv as arXiv Atom API

    Omi->>App: GET /.well-known/omi-tools.json
    App-->>Omi: tool manifest (search_papers, get_paper_details, search_author)

    Omi->>App: POST /tools/search_papers
    App->>App: _build_search_query() sanitize inputs
    App->>arXiv: "GET /api/query?search_query=...&max_results=N"
    arXiv-->>App: Atom XML feed
    App->>App: _parse_entries() + _format_entry()
    App-->>Omi: "ChatToolResponse {result}"

    Omi->>App: POST /tools/get_paper_details
    App->>App: "_safe_paper_id() validate & clean"
    App->>arXiv: "GET /api/query?id_list=XXXX&max_results=1"
    arXiv-->>App: Atom XML feed
    App-->>Omi: "ChatToolResponse {result}"

    Omi->>App: POST /tools/search_author
    App->>arXiv: "GET /api/query?search_query=au:Name&sortBy=submittedDate"
    arXiv-->>App: Atom XML feed
    App-->>Omi: "ChatToolResponse {result}"
Loading

Reviews (1): Last reviewed commit: "Add arXiv Omi integration app" | Re-trigger Greptile

Comment thread plugins/omi-arxiv-app/main.py Outdated
Comment on lines +158 to +160
authors = ", ".join(_entry_authors(entry)[:5]) or "unknown authors"
if len(_entry_authors(entry)) > 5:
authors += ", et al."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _entry_authors is called twice on lines 158–160, traversing the XML tree twice. Storing the result in a variable avoids the redundant walk.

Suggested change
authors = ", ".join(_entry_authors(entry)[:5]) or "unknown authors"
if len(_entry_authors(entry)) > 5:
authors += ", et al."
all_authors = _entry_authors(entry)
authors = ", ".join(all_authors[:5]) or "unknown authors"
if len(all_authors) > 5:
authors += ", et al."

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9f1991b: _entry_authors(entry) is now stored once in all_authors and reused for formatting.

Comment thread plugins/omi-arxiv-app/main.py Outdated
candidate = candidate.removeprefix("https://arxiv.org/abs/")
candidate = candidate.removeprefix("http://arxiv.org/abs/")
candidate = candidate.removeprefix("arXiv:")
candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) else candidate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Old-style versioned arXiv IDs such as hep-th/9905001v1 or cs/9901001v2 pass through the version-strip branch unchanged (it only fires for new-format YYMM.NNNNvN IDs), so fullmatch on \d{7} fails and _safe_paper_id returns None. Users who paste a versioned legacy ID will get a confusing "invalid paper ID" error. Extend the split to cover old-format IDs as well.

Suggested change
candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) else candidate
candidate = candidate.split("v", 1)[0] if re.match(r"^\d{4}\.\d{4,5}v\d+$", candidate) or re.match(r"^[a-z\-]+(\.[A-Z]{2})?/\d{7}v\d+$", candidate) else candidate

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9f1991b: _safe_paper_id now strips versions for both new-format and legacy arXiv IDs such as hep-th/9905001v1.

Comment on lines +186 to +190
async def _request_arxiv(params: dict[str, Any]) -> str:
client = await _get_arxiv_client()
response = await client.get(ARXIV_API_URL, params=params)
response.raise_for_status()
return response.text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No rate-limit protection for arXiv API calls

arXiv's usage policy asks automated clients to stay under 3 requests per second. With no rate-limiting or retry-with-backoff logic here, concurrent users could collectively trigger 429/503 responses that surface as opaque httpx.HTTPError messages. Consider adding a simple delay between requests or at minimum returning a user-friendly message when arXiv returns a 429.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9f1991b: arXiv requests are serialized with a small delay to stay below 3 req/s, and 429/503 responses now return user-friendly messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant