Add ArXiv search MCP tool with tests and dependency#1387
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new ArXiv MCP server module with four tools (search, get, sections, read-chunk), accompanying tests and docs, and updates CI/workflow to conditionally use GHCR PAT, adjusts install extras, and adds a pytest Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MCP as "MCP Server\n(arxiv tool)"
participant Cache as "Process Cache"
participant RL as "RateLimiter\n(file lock / asyncio)"
participant OpenAlex
participant arXiv as "arXiv Atom/HTML"
Client->>MCP: call arxiv-search / arxiv-get (query or paper_id)
MCP->>Cache: check cache(key=sha256(inputs))
alt cache hit
Cache-->>MCP: return cached result
MCP-->>Client: respond with cached result
else cache miss
MCP->>RL: acquire rate limit slot
RL-->>MCP: granted
alt prefer OpenAlex
MCP->>OpenAlex: request metadata (retries/backoff)
alt OpenAlex 429 / budget error
MCP->>OpenAlex: retry without api_key (polite tier)
end
OpenAlex-->>MCP: metadata or not-found
else fallback to arXiv
MCP->>arXiv: fetch Atom or HTML (ar5iv fallback)
arXiv-->>MCP: atom/xml or html
end
MCP->>MCP: normalize IDs, extract/reconstruct abstract, extract sections/chunks
MCP->>Cache: store result (LRU eviction if needed)
RL-->>MCP: release
MCP-->>Client: return assembled response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)
130-131: Cache grows but never evicts entries.The cache stops accepting new entries at
_CACHE_MAX_SIZEbut never removes old ones. In long-running processes, the first 256 queries will be cached permanently, and all subsequent unique queries will bypass caching entirely.Consider using an LRU cache or periodic eviction if cache effectiveness matters for repeated queries over time.
♻️ Optional: Use functools.lru_cache or implement LRU eviction
+from functools import lru_cache + +@lru_cache(maxsize=_CACHE_MAX_SIZE) +def _cached_search(query: str, max_results: int) -> str: + # Move fetch logic here + ...Or implement simple LRU by removing oldest entry when full.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 130 - 131, The in-memory dictionary _cache currently never evicts entries when full (checked against _CACHE_MAX_SIZE), so implement LRU eviction: switch _cache to an ordered mapping (e.g., collections.OrderedDict) or use functools.lru_cache on the function that computes result_str, then when inserting into _cache (where code currently does if len(_cache) < _CACHE_MAX_SIZE: _cache[key] = result_str) ensure that if len(_cache) >= _CACHE_MAX_SIZE you remove the oldest entry (popitem(last=False) for OrderedDict) before assigning _cache[key] = result_str; keep the existing key/result_str logic and names (_cache, _CACHE_MAX_SIZE, key, result_str) so the rest of the code continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_mcp_clients.py`:
- Around line 1035-1039: The test test_arxiv_get_invalid_id calls the real
arxiv_get function and should be marked as a live/network test; add the
pytest.mark.live marker decorator to the test function (e.g., annotate
test_arxiv_get_invalid_id with `@pytest.mark.live`) so CI can skip it when live
tests are disabled and to match other live tests in the suite.
---
Nitpick comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 130-131: The in-memory dictionary _cache currently never evicts
entries when full (checked against _CACHE_MAX_SIZE), so implement LRU eviction:
switch _cache to an ordered mapping (e.g., collections.OrderedDict) or use
functools.lru_cache on the function that computes result_str, then when
inserting into _cache (where code currently does if len(_cache) <
_CACHE_MAX_SIZE: _cache[key] = result_str) ensure that if len(_cache) >=
_CACHE_MAX_SIZE you remove the oldest entry (popitem(last=False) for
OrderedDict) before assigning _cache[key] = result_str; keep the existing
key/result_str logic and names (_cache, _CACHE_MAX_SIZE, key, result_str) so the
rest of the code continues to work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 33c83249-85db-47b9-afce-0f5380f99ce3
📒 Files selected for processing (5)
.github/workflows/tests.ymldocs/agentic_inference/tool_calling.mdnemo_skills/mcp/servers/arxiv_tool.pytests/test_mcp_clients.pytools/requirements.txt
Add ArxivSearchTool as a first-party MCP server for scientific paper retrieval via the free arXiv API (no API key required). Includes: - New nemo_skills/mcp/servers/arxiv_tool.py with arxiv-search and arxiv-get tools, rate limiting, retry logic, and response caching - Declare `arxiv` dependency in tools/requirements.txt - CI: install [tools] extra so MCP tool deps are available in tests - Tests: config validation, offline stdio subprocess integration (list_tools, hide_args), invalid-id handling, and live API tests - Docs: add ArxivSearchTool to built-in tools reference Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com> Made-with: Cursor
0ca4934 to
b95bc5d
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)
132-133: Consider adding simple eviction when cache reaches capacity.Current behavior stops caching new keys once full, which quickly reduces cache usefulness. A tiny FIFO/LRU-style eviction would keep hit-rate stable.
Also applies to: 169-170
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 132 - 133, The cache currently skips adding new entries when full; modify the insertion logic around _cache, _CACHE_MAX_SIZE, key, and result_str so that when len(_cache) >= _CACHE_MAX_SIZE you evict an entry before inserting (e.g., FIFO by popping the oldest key or LRU via OrderedDict.move_to_end/popitem). Update both insertion sites (the block using if len(_cache) < _CACHE_MAX_SIZE: _cache[key] = result_str and the similar block at the other occurrence) to perform eviction then _cache[key] = result_str, ensuring the eviction policy is consistent and thread-safe if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 101-103: The code currently caps max_results with MAX_RESULTS but
doesn't check for non-positive values; add an explicit validation in the same
function that handles arXiv queries (look for the max_results parameter and the
existing MAX_RESULTS cap in arxiv_tool.py) to raise a clear error (e.g.,
ValueError) if max_results is less than 1 before proceeding or calling the arXiv
API, then apply the existing upper-bound cap (max_results = min(max_results,
MAX_RESULTS)).
- Around line 127-136: Remove the broad "except Exception" handlers that convert
unexpected failures into error strings; specifically, in the block that calls
_with_retry(_fetch) (which builds result_str and caches into _cache up to
_CACHE_MAX_SIZE) and the similar handler later near line 172, delete or replace
the generic except clauses so unexpected errors (AttributeError, TypeError,
etc.) propagate instead of returning "ArXiv search failed: ...". If you must
handle errors, catch only the specific expected exceptions (e.g., network/HTTP
exceptions already handled by _with_retry) or re-raise after logging; do not
swallow all Exceptions or return error strings.
---
Nitpick comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 132-133: The cache currently skips adding new entries when full;
modify the insertion logic around _cache, _CACHE_MAX_SIZE, key, and result_str
so that when len(_cache) >= _CACHE_MAX_SIZE you evict an entry before inserting
(e.g., FIFO by popping the oldest key or LRU via
OrderedDict.move_to_end/popitem). Update both insertion sites (the block using
if len(_cache) < _CACHE_MAX_SIZE: _cache[key] = result_str and the similar block
at the other occurrence) to perform eviction then _cache[key] = result_str,
ensuring the eviction policy is consistent and thread-safe if needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a9da9a12-b23c-4042-b042-3f8bf2b7548f
📒 Files selected for processing (5)
.github/workflows/tests.ymldocs/agentic_inference/tool_calling.mdnemo_skills/mcp/servers/arxiv_tool.pytests/test_mcp_clients.pytools/requirements.txt
✅ Files skipped from review due to trivial changes (2)
- tools/requirements.txt
- docs/agentic_inference/tool_calling.md
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/tests.yml
- tests/test_mcp_clients.py
…dling - Replace stop-at-capacity cache with LRU eviction via OrderedDict - Clamp max_results to [1, MAX_RESULTS] instead of only capping upper bound - Remove broad except-Exception handlers; let unexpected errors propagate (expected transient errors are already handled by _with_retry) - Add @pytest.mark.live to test_arxiv_get_invalid_id (makes real API call) - Wrap stdio integration tests in try/finally for guaranteed shutdown Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com> Made-with: Cursor
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)
77-95:⚠️ Potential issue | 🟠 MajorNarrow retry exception handling to transient failures only.
Line 84 catches all exceptions, so non-transient bugs/input issues can be retried for up to ~135s before failing. Restrict retries to explicit transient exceptions and let unexpected failures surface immediately.
Proposed fix
@@ _MAX_DELAY = 60.0 _CACHE_MAX_SIZE = 256 +_RETRYABLE_ERRORS = (ConnectionError, TimeoutError, OSError) @@ def _with_retry(fn): @@ - except Exception as e: + except _RETRYABLE_ERRORS as e: if attempt == _NUM_RETRIES: raise logger.warning(As per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 77 - 95, The _with_retry helper currently catches all exceptions; change it to only retry on transient/network-related errors by catching a specific tuple of transient exception types (e.g., requests.exceptions.RequestException, socket.timeout, TimeoutError, http.client.HTTPException or your project's equivalent) and re-raising any other exception immediately so non-transient input/logic errors surface; keep the existing backoff logic and calls to _rate_limit, preserve use of _INITIAL_DELAY/_NUM_RETRIES/_MAX_DELAY, and continue to log retries via logger when a transient exception is caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 77-95: The _with_retry helper currently catches all exceptions;
change it to only retry on transient/network-related errors by catching a
specific tuple of transient exception types (e.g.,
requests.exceptions.RequestException, socket.timeout, TimeoutError,
http.client.HTTPException or your project's equivalent) and re-raising any other
exception immediately so non-transient input/logic errors surface; keep the
existing backoff logic and calls to _rate_limit, preserve use of
_INITIAL_DELAY/_NUM_RETRIES/_MAX_DELAY, and continue to log retries via logger
when a transient exception is caught.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b0d39d84-46ef-4e5d-a8c1-335314f9b9df
📒 Files selected for processing (2)
nemo_skills/mcp/servers/arxiv_tool.pytests/test_mcp_clients.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_mcp_clients.py
Use direct arXiv/OpenAlex API clients with bounded paper chunking and shared request throttling so retrieval stays robust under parallel eval workloads. Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com> Made-with: Cursor
99cf1de to
1cfe897
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)
537-540:⚠️ Potential issue | 🟡 MinorReject invalid numeric inputs instead of silently rewriting them.
0/negative values are currently coerced into defaults or minima (max_results=0 -> 1,max_sections=0 -> 40,max_chars=0 -> PAPER_CHUNK_LIMIT, negativeoffset -> 0), so a bad MCP call looks valid instead of failing fast.As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
Also applies to: 616-616, 663-664
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 537 - 540, Replace the silent clamping logic for numeric args with explicit validation that raises an error: instead of changing max_results when it exceeds MAX_RESULTS or is <1, validate in the input handling code (the check around max_results/MAX_RESULTS) and raise a ValueError (or custom BadRequest) with a clear message like "max_results must be between 1 and {MAX_RESULTS}"; apply the same pattern to max_sections, max_chars (vs PAPER_CHUNK_LIMIT), and offset so invalid or out-of-range user inputs are rejected rather than coerced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Line 87: _paper_cache is an unbounded dict that stores full extracted paper
text and will leak memory; replace it with a bounded LRU cache (e.g., use
cachetools.LRUCache(maxsize=...) or implement an OrderedDict-based LRU) and
update all uses of _paper_cache to call its .get/.__setitem__ semantics so older
entries are evicted when the cache exceeds maxsize; alternatively store only
metadata/paths and lazy-load full text to avoid pinning large strings in memory.
- Around line 220-225: The extraction in _extract_arxiv_id() lowercases the
whole `low` and then manipulates `pid`, causing old-style IDs like
"math.OV/0301234v1" to become "math.ov/0301234v1" and then get truncated by
naive split("v")[0]; to fix, only lowercase/normalize the URL host portion but
preserve the original case of the arXiv identifier portion, and replace the
blind split("v") logic with a targeted version-stripping that matches a trailing
"v<digits>" (e.g., use a regex to remove /v\d+$) so old-style IDs (with
subject/class before the slash) remain intact; update the code paths around the
`low` and `pid` handling in _extract_arxiv_id() (and the analogous block at the
other occurrence) to implement these changes.
- Around line 584-593: The fallback after _http_get_json currently only checks a
bare new-style regex on paper_id and misses other arXiv forms; update the
exception handling to call _extract_arxiv_id(paper_id) and, if it returns a
non-empty arXiv id, pass that id into _arxiv_api_get instead of the bare regex
check. Specifically, in the except block around the _http_get_json call, replace
the re.match(...) check with extraction via _extract_arxiv_id(paper_id.strip()),
then try await _arxiv_api_get(extracted_id) on success and include both OpenAlex
and arXiv errors in the returned message as before.
---
Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 537-540: Replace the silent clamping logic for numeric args with
explicit validation that raises an error: instead of changing max_results when
it exceeds MAX_RESULTS or is <1, validate in the input handling code (the check
around max_results/MAX_RESULTS) and raise a ValueError (or custom BadRequest)
with a clear message like "max_results must be between 1 and {MAX_RESULTS}";
apply the same pattern to max_sections, max_chars (vs PAPER_CHUNK_LIMIT), and
offset so invalid or out-of-range user inputs are rejected rather than coerced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5a2486d4-f229-4535-9cea-e4201822009c
📒 Files selected for processing (3)
nemo_skills/mcp/servers/arxiv_tool.pypyproject.tomltests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_mcp_clients.py
There was a problem hiding this comment.
♻️ Duplicate comments (5)
nemo_skills/mcp/servers/arxiv_tool.py (5)
554-557:⚠️ Potential issue | 🟠 MajorAvoid broad
except Exceptionin tool entrypoints.Lines 554, 630, and 682 catch all exceptions and convert them to strings. That can hide unexpected defects instead of surfacing them.
🔧 Suggested fix (example pattern)
- try: - result = await _arxiv_api_search(query, max_results) - except Exception as e: - return f"arXiv search failed: {e}" + try: + result = await _arxiv_api_search(query, max_results) + except (httpx.RequestError, httpx.TimeoutException, RuntimeError, ValueError) as e: + return f"arXiv search failed: {e}"Apply the same narrowing to
arxiv_sectionsandarxiv_read_chunk.As per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."
Also applies to: 628-631, 680-683
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 554 - 557, The broad except blocks in the arXiv tool entrypoints (the call to _arxiv_api_search in arxiv_search, and similar blocks in arxiv_sections and arxiv_read_chunk) should be replaced with narrow exception handling for expected failures (e.g., network/HTTP errors like aiohttp.ClientError or requests.exceptions.RequestException, asyncio.TimeoutError, and known parsing/value errors) and allow unexpected exceptions to propagate; update each try/except to catch only these specific exceptions, convert those to the existing user-facing error string, and re-raise any other exceptions (or omit a catch) so real bugs are not silently swallowed.
596-599:⚠️ Potential issue | 🟠 MajorUse
_extract_arxiv_id()in the OpenAlex fallback path.Line 596 only matches bare new-style IDs, so valid inputs like
arXiv:..., arxiv.org URLs, DOI-form arXiv ids, and old-style IDs skip native arXiv fallback.🔧 Suggested fix
- if re.match(r"^\d{4}\.\d{4,5}(v\d+)?$", paper_id.strip()): + if arxiv_id := _extract_arxiv_id(paper_id): try: - return await _arxiv_api_get(paper_id.strip()) + return await _arxiv_api_get(arxiv_id) except Exception as e2: return f"Paper lookup failed (OpenAlex: {e}; arXiv: {e2})."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 596 - 599, The current arXiv native fallback only attempts _arxiv_api_get when paper_id matches strict new-style IDs, missing inputs like "arXiv:...", arxiv.org URLs, DOI-form IDs, or old-style IDs; update the OpenAlex fallback branch so it calls _extract_arxiv_id(paper_id) and, if that returns a non-empty id, pass that extracted id into _arxiv_api_get (i.e., replace the strict re.match gate with using _extract_arxiv_id to normalize paper_id before calling _arxiv_api_get); ensure you still catch exceptions around the await _arxiv_api_get(...) call as currently done.
87-87:⚠️ Potential issue | 🟠 MajorBound
_paper_cacheto avoid unbounded memory growth.Line 87 defines
_paper_cacheas unbounded, and Line 262 stores full paper text per unique id. In a long-lived MCP process this can grow without limit.🔧 Suggested fix
+from collections import OrderedDict @@ -_paper_cache: dict[str, tuple[str, str]] = {} +PAPER_CACHE_MAX_SIZE = 64 +_paper_cache: OrderedDict[str, tuple[str, str]] = OrderedDict() @@ - if cache_key in _paper_cache: - return _paper_cache[cache_key] + if cache_key in _paper_cache: + _paper_cache.move_to_end(cache_key) + return _paper_cache[cache_key] @@ - _paper_cache[cache_key] = (url, text) + _paper_cache[cache_key] = (url, text) + _paper_cache.move_to_end(cache_key) + if len(_paper_cache) > PAPER_CACHE_MAX_SIZE: + _paper_cache.popitem(last=False) return url, textAlso applies to: 262-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` at line 87, _paper_cache is an unbounded dict that will grow indefinitely (entries added around where full paper text is stored at lines ~262-263); replace it with a bounded cache (e.g., an LRU with a fixed max size) or enforce eviction on insert so only recent N papers are kept, or change stored value to a lightweight summary/metadata instead of full text. Update the global _paper_cache declaration and the code paths that write to it (refer to _paper_cache and the code that stores the full paper text around lines 262-263) to use the bounded cache/eviction policy and ensure concurrent access remains safe if used from multiple coroutines/threads.
547-548:⚠️ Potential issue | 🟡 MinorReject invalid
max_resultsinstead of silently rewriting it.Line 547-548 coerces invalid values to
1. Invalid user input should fail clearly rather than being silently changed.🔧 Suggested fix
- if max_results < 1: - max_results = 1 + if max_results < 1: + raise ValueError("max_results must be >= 1")As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 547 - 548, The code currently coerces invalid max_results to 1 (the "if max_results < 1" block) which hides bad user input; change this to reject invalid values by raising a clear exception (e.g., raise ValueError("max_results must be >= 1, got {max_results}")) where max_results is validated, so callers see the error instead of silently continuing; update any docstring or error handling in the surrounding function that references max_results to reflect this validation.
225-229:⚠️ Potential issue | 🟠 MajorOld-style arXiv IDs can be corrupted and collide in cache keys.
Line 225/228 split using
low(lowercased input), and Line 243 usessplit("v")[0]. This can alias old-style IDs (e.g., subject class containingv) and return wrong cached text.🔧 Suggested fix
- if "arxiv.org/abs/" in low or "arxiv.org/pdf/" in low or "arxiv.org/html/" in low: - pid = re.split(r"arxiv\.org/(?:abs|pdf|html)/", low, maxsplit=1)[-1] + if "arxiv.org/abs/" in low or "arxiv.org/pdf/" in low or "arxiv.org/html/" in low: + pid = re.split(r"arxiv\.org/(?:abs|pdf|html)/", pid, flags=re.I, maxsplit=1)[-1] pid = pid.split("?")[0].split("#")[0].replace(".pdf", "") if "10.48550/arxiv." in low: - pid = re.split(r"10\.48550/arxiv\.", low, flags=re.I, maxsplit=1)[-1] + pid = re.split(r"10\.48550/arxiv\.", pid, flags=re.I, maxsplit=1)[-1] @@ - cache_key = arxiv_id.split("v")[0] + cache_key = re.sub(r"v\d+$", "", arxiv_id, flags=re.I)Also applies to: 243-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 225 - 229, The code currently lowercases the input into low and then uses naive splits (pid = re.split(..., low)[-1] and pid.split("v")[0]) which can corrupt old-style arXiv IDs (subject-class forms containing "v") and cause cache key collisions; update arxiv_tool.py to extract the identifier with a single regex capture from the original URL/string (not blindly lowercasing) — e.g., capture after arxiv.org/(abs|pdf|html)/ or 10.48550/arxiv. and strip query/fragments and ".pdf" — and remove version suffix only via a targeted regex that strips a trailing "v" followed by digits (e.g., r'^(.*?)(?:v\d+)?$') instead of using split("v")[0]; adjust uses of variables pid and low accordingly so old-style IDs keep their internal "v" characters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 554-557: The broad except blocks in the arXiv tool entrypoints
(the call to _arxiv_api_search in arxiv_search, and similar blocks in
arxiv_sections and arxiv_read_chunk) should be replaced with narrow exception
handling for expected failures (e.g., network/HTTP errors like
aiohttp.ClientError or requests.exceptions.RequestException,
asyncio.TimeoutError, and known parsing/value errors) and allow unexpected
exceptions to propagate; update each try/except to catch only these specific
exceptions, convert those to the existing user-facing error string, and re-raise
any other exceptions (or omit a catch) so real bugs are not silently swallowed.
- Around line 596-599: The current arXiv native fallback only attempts
_arxiv_api_get when paper_id matches strict new-style IDs, missing inputs like
"arXiv:...", arxiv.org URLs, DOI-form IDs, or old-style IDs; update the OpenAlex
fallback branch so it calls _extract_arxiv_id(paper_id) and, if that returns a
non-empty id, pass that extracted id into _arxiv_api_get (i.e., replace the
strict re.match gate with using _extract_arxiv_id to normalize paper_id before
calling _arxiv_api_get); ensure you still catch exceptions around the await
_arxiv_api_get(...) call as currently done.
- Line 87: _paper_cache is an unbounded dict that will grow indefinitely
(entries added around where full paper text is stored at lines ~262-263);
replace it with a bounded cache (e.g., an LRU with a fixed max size) or enforce
eviction on insert so only recent N papers are kept, or change stored value to a
lightweight summary/metadata instead of full text. Update the global
_paper_cache declaration and the code paths that write to it (refer to
_paper_cache and the code that stores the full paper text around lines 262-263)
to use the bounded cache/eviction policy and ensure concurrent access remains
safe if used from multiple coroutines/threads.
- Around line 547-548: The code currently coerces invalid max_results to 1 (the
"if max_results < 1" block) which hides bad user input; change this to reject
invalid values by raising a clear exception (e.g., raise ValueError("max_results
must be >= 1, got {max_results}")) where max_results is validated, so callers
see the error instead of silently continuing; update any docstring or error
handling in the surrounding function that references max_results to reflect this
validation.
- Around line 225-229: The code currently lowercases the input into low and then
uses naive splits (pid = re.split(..., low)[-1] and pid.split("v")[0]) which can
corrupt old-style arXiv IDs (subject-class forms containing "v") and cause cache
key collisions; update arxiv_tool.py to extract the identifier with a single
regex capture from the original URL/string (not blindly lowercasing) — e.g.,
capture after arxiv.org/(abs|pdf|html)/ or 10.48550/arxiv. and strip
query/fragments and ".pdf" — and remove version suffix only via a targeted regex
that strips a trailing "v" followed by digits (e.g., r'^(.*?)(?:v\d+)?$')
instead of using split("v")[0]; adjust uses of variables pid and low accordingly
so old-style IDs keep their internal "v" characters.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8cce5a2-ee4b-47cb-b29d-ed1e91169191
📒 Files selected for processing (3)
nemo_skills/mcp/servers/arxiv_tool.pypyproject.tomltests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
| # workspace mount when available; fall back to /tmp for local smoke tests. | ||
| ARXIV_REQUEST_INTERVAL = float(os.getenv("ARXIV_REQUEST_INTERVAL", "4.0")) | ||
| ARXIV_RATE_LIMIT_LOCK = os.getenv( | ||
| "ARXIV_RATE_LIMIT_LOCK", |
There was a problem hiding this comment.
Have you tested on clusters? just to make sure you can write to tempfile.gettempdir() on cluster and it exists.
There was a problem hiding this comment.
this also looks like a global lock. If you have 20 requests of arxiv, are they basically blocking each other?
There was a problem hiding this comment.
The default lock path is based on tempfile.gettempdir(), which should exist on the target Python runtime; the implementation also creates the parent directory and falls back to /tmp/arxiv_api_rate_limit.lock if creating/opening the configured path fails. For clusters that want a shared filesystem lock instead of node-local temp, the path is configurable with ARXIV_RATE_LIMIT_LOCK.
There was a problem hiding this comment.
Yes, requests that hit arXiv’s native API are intentionally serialized by this lock. arXiv asks clients to wait ~3 seconds between export API requests, and in our eval setting many chunk workers can run concurrently, so the shared lock prevents a burst of 20 workers from stampeding the public API and getting 429s. Metadata lookups served from cache or OpenAlex do not take this lock.
jiacheng-xu
left a comment
There was a problem hiding this comment.
A general question: do we consider using https://pypi.org/project/arxiv/ ?
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
Hi @tamohannes , Thank you for the contributions! I would encourage you to please switch to a tool path that avoids the MCP protocol (we have found it to cause real latency and stability issues during RL) in favor of a direct tool approach. The changes should be fairly minimal--please see the DirectPythonTool for an example.
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 219-221: The DOI normalization currently uses case-sensitive ops
on pid causing cases like "DOI:10..." to become malformed; change the extraction
to operate on the already-lowercased variable (low) and strip whitespace so you
always build a normalized lower-case identifier. Specifically, in the
arxiv_tool.py block that checks low.startswith("doi:") ... replace the pid-based
extraction with something like extracting from low (e.g., doi =
low.split("doi.org/")[-1].replace("doi:", "").strip()) and then return
f"doi:{doi}" so the returned DOI is consistently lowercased and trimmed.
- Around line 738-747: _arxiv_api_get currently parses feed-level <title> and
returns a paper even when the Atom feed has no <entry>, causing invalid IDs to
be treated as found; instead call _parse_arxiv_atom(text) to get entries and
then use _format_arxiv_entry on the first entry, and if the parsed entries list
is empty treat it as "not found" (return the failure result string or fallback
behavior expected by tests). Update _arxiv_api_get to replace the manual
regex/title/summary/pub extraction with a call to _parse_arxiv_atom(text), check
if entries is empty and return the same "failed"/"not found" path, otherwise
pass entries[0] to _format_arxiv_entry and continue.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3a37a011-0d88-4dea-a0b2-df65e6b8e572
📒 Files selected for processing (4)
.github/workflows/tests.ymlnemo_skills/mcp/servers/arxiv_tool.pypyproject.tomltests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/tests.yml
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
Updated this PR to use the direct \ path instead of launching a stdio MCP server. The PR diff is now limited to the tool implementation, docs reference, and focused tests; workflow/pyproject changes were removed. @coderabbitai review |
|
Tip For best results, initiate chat on the files or code changes.
[review] |
|
Updated this PR to use the direct Tool path instead of launching a stdio MCP server. The PR diff is now limited to the tool implementation, docs reference, and focused tests; workflow/pyproject changes were removed. @coderabbitai review |
|
Tip For best results, initiate chat on the files or code changes.
[review] |
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
Updated to the direct Tool implementation path (no stdio MCP server/client). The final diff is scoped to the tool file, docs reference, and focused tests only. @coderabbitai review |
|
Tip For best results, initiate chat on the files or code changes.
[review] |
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
Addressed the latest ArXiv comments: DOI normalization now handles uppercase DOI prefixes/URLs, and the arXiv Atom fallback returns a not-found message when the feed has no entry. @coderabbitai review |
|
Tip For best results, initiate chat on the files or code changes.
[review] |
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
|
Closing in favor of #1418, reopened from an internal NVIDIA-NeMo/Skills branch so CI can access repository secrets. |
Summary
ArxivSearchToolas a first-party MCP server for scientific paper retrieval via the free arXiv API (no API key required)nemo_skills/mcp/servers/arxiv_tool.pywitharxiv-searchandarxiv-gettools, rate limiting, exponential-backoff retry logic, and in-memory response cachingarxivdependency intools/requirements.txt[tools]extra alongside[dev]so MCP tool dependencies are available during testslist_tools,hide_argsfiltering), invalid-ID handling, and@pytest.mark.liveAPI testsArxivSearchToolto the built-in tools reference indocs/agentic_inference/tool_calling.mdUsage
ns generate \ --cluster local \ --input_file data.jsonl \ --output_dir outputs \ --model Qwen/Qwen3-8B \ --server_type vllm \ --server_gpus 1 \ ++tool_modules=[nemo_skills.mcp.servers.arxiv_tool::ArxivSearchTool] \ ++inference.tokens_to_generate=8192Test plan
pytest tests/test_mcp_clients.py::TestArxivTool -m "not live"— config, invalid-id, stdio list_tools, hide_args (all offline)pytest tests/test_mcp_clients.py::TestArxivTool -m live— real arXiv API (search + get)pytest tests/test_mcp_clients.py -m "not live and not gpu"— all pass)Summary by CodeRabbit
New Features
Documentation
Tests
Chores