Skip to content

Add ArXiv search MCP tool with tests and dependency#1387

Closed
tamohannes wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
tamohannes:feat/arxiv-mcp-tool
Closed

Add ArXiv search MCP tool with tests and dependency#1387
tamohannes wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
tamohannes:feat/arxiv-mcp-tool

Conversation

@tamohannes

@tamohannes tamohannes commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add ArxivSearchTool as a first-party MCP server for scientific paper retrieval via the free arXiv API (no API key required)
  • New nemo_skills/mcp/servers/arxiv_tool.py with arxiv-search and arxiv-get tools, rate limiting, exponential-backoff retry logic, and in-memory response caching
  • Declare arxiv dependency in tools/requirements.txt
  • CI: install [tools] extra alongside [dev] so MCP tool dependencies are available during tests
  • Tests: config validation, offline stdio subprocess integration (list_tools, hide_args filtering), invalid-ID handling, and @pytest.mark.live API tests
  • Docs: add ArxivSearchTool to the built-in tools reference in docs/agentic_inference/tool_calling.md

Usage

ns generate \
  --cluster local \
  --input_file data.jsonl \
  --output_dir outputs \
  --model Qwen/Qwen3-8B \
  --server_type vllm \
  --server_gpus 1 \
  ++tool_modules=[nemo_skills.mcp.servers.arxiv_tool::ArxivSearchTool] \
  ++inference.tokens_to_generate=8192

Test plan

  • pytest tests/test_mcp_clients.py::TestArxivTool -m "not live" — config, invalid-id, stdio list_tools, hide_args (all offline)
  • pytest tests/test_mcp_clients.py::TestArxivTool -m live — real arXiv API (search + get)
  • Existing test suite unaffected (pytest tests/test_mcp_clients.py -m "not live and not gpu" — all pass)

Summary by CodeRabbit

  • New Features

    • ArXiv paper search and retrieval: relevance search, paper metadata, section extraction, and chunked reading — no API key required.
  • Documentation

    • Built-in tools reference updated to include the new ArXiv tool.
  • Tests

    • Added live and unit tests validating search/get flows, schema exposure, and tool listing; introduced a new "live" marker.
  • Chores

    • CI workflow improvements: conditional registry login, retrying image builds, gated test steps, and a fork-safe local test path.

@coderabbitai

coderabbitai Bot commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new ArXiv MCP server module with four tools (search, get, sections, read-chunk), accompanying tests and docs, and updates CI/workflow to conditionally use GHCR PAT, adjusts install extras, and adds a pytest live marker.

Changes

Cohort / File(s) Summary
ArXiv MCP Server
nemo_skills/mcp/servers/arxiv_tool.py
New MCP server implementing arxiv-search, arxiv-get, arxiv-sections, arxiv-read-chunk; includes arXiv ID normalization, HTML→text parsing with ar5iv fallback, OpenAlex lookup with retry/backoff and polite-tier fallback, in-memory caches, and rate-limiting.
Tests
tests/test_mcp_clients.py
Adds TestArxivTool suite: live integration tests for search/get (including invalid-id handling), unit validation for input bounds, and stdio subprocess tests validating tool listing, schema exposure/hiding, and proper shutdown.
Documentation & Config
docs/agentic_inference/tool_calling.md, pyproject.toml
Documents new ArxivSearchTool in built-in tools list and adds a live pytest marker to config.
CI Workflow
.github/workflows/tests.yml
Derives HAS_GHCR_PAT from secrets.GHCR_PAT, conditionally gates GHCR login and image-build caching, changes editable extras to ".[dev,tools]", wraps Docker build in retry_docker_build() (3 attempts, 30s delay), adjusts pytest marker selection to not gpu and not live, and adds a fork-safe local MCP test path when GHCR PAT is absent.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant MCP as "MCP Server\n(arxiv tool)"
  participant Cache as "Process Cache"
  participant RL as "RateLimiter\n(file lock / asyncio)"
  participant OpenAlex
  participant arXiv as "arXiv Atom/HTML"

  Client->>MCP: call arxiv-search / arxiv-get (query or paper_id)
  MCP->>Cache: check cache(key=sha256(inputs))
  alt cache hit
    Cache-->>MCP: return cached result
    MCP-->>Client: respond with cached result
  else cache miss
    MCP->>RL: acquire rate limit slot
    RL-->>MCP: granted
    alt prefer OpenAlex
      MCP->>OpenAlex: request metadata (retries/backoff)
      alt OpenAlex 429 / budget error
        MCP->>OpenAlex: retry without api_key (polite tier)
      end
      OpenAlex-->>MCP: metadata or not-found
    else fallback to arXiv
      MCP->>arXiv: fetch Atom or HTML (ar5iv fallback)
      arXiv-->>MCP: atom/xml or html
    end
    MCP->>MCP: normalize IDs, extract/reconstruct abstract, extract sections/chunks
    MCP->>Cache: store result (LRU eviction if needed)
    RL-->>MCP: release
    MCP-->>Client: return assembled response
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: adding an ArXiv search MCP tool with accompanying tests and dependency management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)

130-131: Cache grows but never evicts entries.

The cache stops accepting new entries at _CACHE_MAX_SIZE but never removes old ones. In long-running processes, the first 256 queries will be cached permanently, and all subsequent unique queries will bypass caching entirely.

Consider using an LRU cache or periodic eviction if cache effectiveness matters for repeated queries over time.

♻️ Optional: Use functools.lru_cache or implement LRU eviction
+from functools import lru_cache
+
+@lru_cache(maxsize=_CACHE_MAX_SIZE)
+def _cached_search(query: str, max_results: int) -> str:
+    # Move fetch logic here
+    ...

Or implement simple LRU by removing oldest entry when full.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 130 - 131, The in-memory
dictionary _cache currently never evicts entries when full (checked against
_CACHE_MAX_SIZE), so implement LRU eviction: switch _cache to an ordered mapping
(e.g., collections.OrderedDict) or use functools.lru_cache on the function that
computes result_str, then when inserting into _cache (where code currently does
if len(_cache) < _CACHE_MAX_SIZE: _cache[key] = result_str) ensure that if
len(_cache) >= _CACHE_MAX_SIZE you remove the oldest entry (popitem(last=False)
for OrderedDict) before assigning _cache[key] = result_str; keep the existing
key/result_str logic and names (_cache, _CACHE_MAX_SIZE, key, result_str) so the
rest of the code continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_mcp_clients.py`:
- Around line 1035-1039: The test test_arxiv_get_invalid_id calls the real
arxiv_get function and should be marked as a live/network test; add the
pytest.mark.live marker decorator to the test function (e.g., annotate
test_arxiv_get_invalid_id with `@pytest.mark.live`) so CI can skip it when live
tests are disabled and to match other live tests in the suite.

---

Nitpick comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 130-131: The in-memory dictionary _cache currently never evicts
entries when full (checked against _CACHE_MAX_SIZE), so implement LRU eviction:
switch _cache to an ordered mapping (e.g., collections.OrderedDict) or use
functools.lru_cache on the function that computes result_str, then when
inserting into _cache (where code currently does if len(_cache) <
_CACHE_MAX_SIZE: _cache[key] = result_str) ensure that if len(_cache) >=
_CACHE_MAX_SIZE you remove the oldest entry (popitem(last=False) for
OrderedDict) before assigning _cache[key] = result_str; keep the existing
key/result_str logic and names (_cache, _CACHE_MAX_SIZE, key, result_str) so the
rest of the code continues to work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 33c83249-85db-47b9-afce-0f5380f99ce3

📥 Commits

Reviewing files that changed from the base of the PR and between 94478e5 and 0ca4934.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml
  • docs/agentic_inference/tool_calling.md
  • nemo_skills/mcp/servers/arxiv_tool.py
  • tests/test_mcp_clients.py
  • tools/requirements.txt

Comment thread tests/test_mcp_clients.py Outdated
Add ArxivSearchTool as a first-party MCP server for scientific paper
retrieval via the free arXiv API (no API key required). Includes:

- New nemo_skills/mcp/servers/arxiv_tool.py with arxiv-search and
  arxiv-get tools, rate limiting, retry logic, and response caching
- Declare `arxiv` dependency in tools/requirements.txt
- CI: install [tools] extra so MCP tool deps are available in tests
- Tests: config validation, offline stdio subprocess integration
  (list_tools, hide_args), invalid-id handling, and live API tests
- Docs: add ArxivSearchTool to built-in tools reference

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
Made-with: Cursor
@tamohannes tamohannes force-pushed the feat/arxiv-mcp-tool branch from 0ca4934 to b95bc5d Compare April 21, 2026 17:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)

132-133: Consider adding simple eviction when cache reaches capacity.

Current behavior stops caching new keys once full, which quickly reduces cache usefulness. A tiny FIFO/LRU-style eviction would keep hit-rate stable.

Also applies to: 169-170

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 132 - 133, The cache
currently skips adding new entries when full; modify the insertion logic around
_cache, _CACHE_MAX_SIZE, key, and result_str so that when len(_cache) >=
_CACHE_MAX_SIZE you evict an entry before inserting (e.g., FIFO by popping the
oldest key or LRU via OrderedDict.move_to_end/popitem). Update both insertion
sites (the block using if len(_cache) < _CACHE_MAX_SIZE: _cache[key] =
result_str and the similar block at the other occurrence) to perform eviction
then _cache[key] = result_str, ensuring the eviction policy is consistent and
thread-safe if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 101-103: The code currently caps max_results with MAX_RESULTS but
doesn't check for non-positive values; add an explicit validation in the same
function that handles arXiv queries (look for the max_results parameter and the
existing MAX_RESULTS cap in arxiv_tool.py) to raise a clear error (e.g.,
ValueError) if max_results is less than 1 before proceeding or calling the arXiv
API, then apply the existing upper-bound cap (max_results = min(max_results,
MAX_RESULTS)).
- Around line 127-136: Remove the broad "except Exception" handlers that convert
unexpected failures into error strings; specifically, in the block that calls
_with_retry(_fetch) (which builds result_str and caches into _cache up to
_CACHE_MAX_SIZE) and the similar handler later near line 172, delete or replace
the generic except clauses so unexpected errors (AttributeError, TypeError,
etc.) propagate instead of returning "ArXiv search failed: ...". If you must
handle errors, catch only the specific expected exceptions (e.g., network/HTTP
exceptions already handled by _with_retry) or re-raise after logging; do not
swallow all Exceptions or return error strings.

---

Nitpick comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 132-133: The cache currently skips adding new entries when full;
modify the insertion logic around _cache, _CACHE_MAX_SIZE, key, and result_str
so that when len(_cache) >= _CACHE_MAX_SIZE you evict an entry before inserting
(e.g., FIFO by popping the oldest key or LRU via
OrderedDict.move_to_end/popitem). Update both insertion sites (the block using
if len(_cache) < _CACHE_MAX_SIZE: _cache[key] = result_str and the similar block
at the other occurrence) to perform eviction then _cache[key] = result_str,
ensuring the eviction policy is consistent and thread-safe if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a9da9a12-b23c-4042-b042-3f8bf2b7548f

📥 Commits

Reviewing files that changed from the base of the PR and between 0ca4934 and b95bc5d.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml
  • docs/agentic_inference/tool_calling.md
  • nemo_skills/mcp/servers/arxiv_tool.py
  • tests/test_mcp_clients.py
  • tools/requirements.txt
✅ Files skipped from review due to trivial changes (2)
  • tools/requirements.txt
  • docs/agentic_inference/tool_calling.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/tests.yml
  • tests/test_mcp_clients.py

Comment thread nemo_skills/mcp/servers/arxiv_tool.py
Comment thread nemo_skills/mcp/servers/arxiv_tool.py Outdated
…dling

- Replace stop-at-capacity cache with LRU eviction via OrderedDict
- Clamp max_results to [1, MAX_RESULTS] instead of only capping upper bound
- Remove broad except-Exception handlers; let unexpected errors propagate
  (expected transient errors are already handled by _with_retry)
- Add @pytest.mark.live to test_arxiv_get_invalid_id (makes real API call)
- Wrap stdio integration tests in try/finally for guaranteed shutdown

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
Made-with: Cursor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)

77-95: ⚠️ Potential issue | 🟠 Major

Narrow retry exception handling to transient failures only.

Line 84 catches all exceptions, so non-transient bugs/input issues can be retried for up to ~135s before failing. Restrict retries to explicit transient exceptions and let unexpected failures surface immediately.

Proposed fix
@@
 _MAX_DELAY = 60.0
 _CACHE_MAX_SIZE = 256
+_RETRYABLE_ERRORS = (ConnectionError, TimeoutError, OSError)
@@
 def _with_retry(fn):
@@
-        except Exception as e:
+        except _RETRYABLE_ERRORS as e:
             if attempt == _NUM_RETRIES:
                 raise
             logger.warning(

As per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 77 - 95, The _with_retry
helper currently catches all exceptions; change it to only retry on
transient/network-related errors by catching a specific tuple of transient
exception types (e.g., requests.exceptions.RequestException, socket.timeout,
TimeoutError, http.client.HTTPException or your project's equivalent) and
re-raising any other exception immediately so non-transient input/logic errors
surface; keep the existing backoff logic and calls to _rate_limit, preserve use
of _INITIAL_DELAY/_NUM_RETRIES/_MAX_DELAY, and continue to log retries via
logger when a transient exception is caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 77-95: The _with_retry helper currently catches all exceptions;
change it to only retry on transient/network-related errors by catching a
specific tuple of transient exception types (e.g.,
requests.exceptions.RequestException, socket.timeout, TimeoutError,
http.client.HTTPException or your project's equivalent) and re-raising any other
exception immediately so non-transient input/logic errors surface; keep the
existing backoff logic and calls to _rate_limit, preserve use of
_INITIAL_DELAY/_NUM_RETRIES/_MAX_DELAY, and continue to log retries via logger
when a transient exception is caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b0d39d84-46ef-4e5d-a8c1-335314f9b9df

📥 Commits

Reviewing files that changed from the base of the PR and between b95bc5d and 585d1bc.

📒 Files selected for processing (2)
  • nemo_skills/mcp/servers/arxiv_tool.py
  • tests/test_mcp_clients.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_mcp_clients.py

Use direct arXiv/OpenAlex API clients with bounded paper chunking and shared request throttling so retrieval stays robust under parallel eval workloads.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
Made-with: Cursor
@tamohannes tamohannes force-pushed the feat/arxiv-mcp-tool branch from 99cf1de to 1cfe897 Compare April 28, 2026 23:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
nemo_skills/mcp/servers/arxiv_tool.py (1)

537-540: ⚠️ Potential issue | 🟡 Minor

Reject invalid numeric inputs instead of silently rewriting them.

0/negative values are currently coerced into defaults or minima (max_results=0 -> 1, max_sections=0 -> 40, max_chars=0 -> PAPER_CHUNK_LIMIT, negative offset -> 0), so a bad MCP call looks valid instead of failing fast.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

Also applies to: 616-616, 663-664

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 537 - 540, Replace the
silent clamping logic for numeric args with explicit validation that raises an
error: instead of changing max_results when it exceeds MAX_RESULTS or is <1,
validate in the input handling code (the check around max_results/MAX_RESULTS)
and raise a ValueError (or custom BadRequest) with a clear message like
"max_results must be between 1 and {MAX_RESULTS}"; apply the same pattern to
max_sections, max_chars (vs PAPER_CHUNK_LIMIT), and offset so invalid or
out-of-range user inputs are rejected rather than coerced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Line 87: _paper_cache is an unbounded dict that stores full extracted paper
text and will leak memory; replace it with a bounded LRU cache (e.g., use
cachetools.LRUCache(maxsize=...) or implement an OrderedDict-based LRU) and
update all uses of _paper_cache to call its .get/.__setitem__ semantics so older
entries are evicted when the cache exceeds maxsize; alternatively store only
metadata/paths and lazy-load full text to avoid pinning large strings in memory.
- Around line 220-225: The extraction in _extract_arxiv_id() lowercases the
whole `low` and then manipulates `pid`, causing old-style IDs like
"math.OV/0301234v1" to become "math.ov/0301234v1" and then get truncated by
naive split("v")[0]; to fix, only lowercase/normalize the URL host portion but
preserve the original case of the arXiv identifier portion, and replace the
blind split("v") logic with a targeted version-stripping that matches a trailing
"v<digits>" (e.g., use a regex to remove /v\d+$) so old-style IDs (with
subject/class before the slash) remain intact; update the code paths around the
`low` and `pid` handling in _extract_arxiv_id() (and the analogous block at the
other occurrence) to implement these changes.
- Around line 584-593: The fallback after _http_get_json currently only checks a
bare new-style regex on paper_id and misses other arXiv forms; update the
exception handling to call _extract_arxiv_id(paper_id) and, if it returns a
non-empty arXiv id, pass that id into _arxiv_api_get instead of the bare regex
check. Specifically, in the except block around the _http_get_json call, replace
the re.match(...) check with extraction via _extract_arxiv_id(paper_id.strip()),
then try await _arxiv_api_get(extracted_id) on success and include both OpenAlex
and arXiv errors in the returned message as before.

---

Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 537-540: Replace the silent clamping logic for numeric args with
explicit validation that raises an error: instead of changing max_results when
it exceeds MAX_RESULTS or is <1, validate in the input handling code (the check
around max_results/MAX_RESULTS) and raise a ValueError (or custom BadRequest)
with a clear message like "max_results must be between 1 and {MAX_RESULTS}";
apply the same pattern to max_sections, max_chars (vs PAPER_CHUNK_LIMIT), and
offset so invalid or out-of-range user inputs are rejected rather than coerced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a2486d4-f229-4535-9cea-e4201822009c

📥 Commits

Reviewing files that changed from the base of the PR and between 585d1bc and 99cf1de.

📒 Files selected for processing (3)
  • nemo_skills/mcp/servers/arxiv_tool.py
  • pyproject.toml
  • tests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_mcp_clients.py

Comment thread nemo_skills/mcp/servers/arxiv_tool.py Outdated
Comment thread nemo_skills/mcp/servers/arxiv_tool.py
Comment thread nemo_skills/mcp/servers/arxiv_tool.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
nemo_skills/mcp/servers/arxiv_tool.py (5)

554-557: ⚠️ Potential issue | 🟠 Major

Avoid broad except Exception in tool entrypoints.

Lines 554, 630, and 682 catch all exceptions and convert them to strings. That can hide unexpected defects instead of surfacing them.

🔧 Suggested fix (example pattern)
-    try:
-        result = await _arxiv_api_search(query, max_results)
-    except Exception as e:
-        return f"arXiv search failed: {e}"
+    try:
+        result = await _arxiv_api_search(query, max_results)
+    except (httpx.RequestError, httpx.TimeoutException, RuntimeError, ValueError) as e:
+        return f"arXiv search failed: {e}"

Apply the same narrowing to arxiv_sections and arxiv_read_chunk.

As per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

Also applies to: 628-631, 680-683

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 554 - 557, The broad
except blocks in the arXiv tool entrypoints (the call to _arxiv_api_search in
arxiv_search, and similar blocks in arxiv_sections and arxiv_read_chunk) should
be replaced with narrow exception handling for expected failures (e.g.,
network/HTTP errors like aiohttp.ClientError or
requests.exceptions.RequestException, asyncio.TimeoutError, and known
parsing/value errors) and allow unexpected exceptions to propagate; update each
try/except to catch only these specific exceptions, convert those to the
existing user-facing error string, and re-raise any other exceptions (or omit a
catch) so real bugs are not silently swallowed.

596-599: ⚠️ Potential issue | 🟠 Major

Use _extract_arxiv_id() in the OpenAlex fallback path.

Line 596 only matches bare new-style IDs, so valid inputs like arXiv:..., arxiv.org URLs, DOI-form arXiv ids, and old-style IDs skip native arXiv fallback.

🔧 Suggested fix
-            if re.match(r"^\d{4}\.\d{4,5}(v\d+)?$", paper_id.strip()):
+            if arxiv_id := _extract_arxiv_id(paper_id):
                 try:
-                    return await _arxiv_api_get(paper_id.strip())
+                    return await _arxiv_api_get(arxiv_id)
                 except Exception as e2:
                     return f"Paper lookup failed (OpenAlex: {e}; arXiv: {e2})."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 596 - 599, The current
arXiv native fallback only attempts _arxiv_api_get when paper_id matches strict
new-style IDs, missing inputs like "arXiv:...", arxiv.org URLs, DOI-form IDs, or
old-style IDs; update the OpenAlex fallback branch so it calls
_extract_arxiv_id(paper_id) and, if that returns a non-empty id, pass that
extracted id into _arxiv_api_get (i.e., replace the strict re.match gate with
using _extract_arxiv_id to normalize paper_id before calling _arxiv_api_get);
ensure you still catch exceptions around the await _arxiv_api_get(...) call as
currently done.

87-87: ⚠️ Potential issue | 🟠 Major

Bound _paper_cache to avoid unbounded memory growth.

Line 87 defines _paper_cache as unbounded, and Line 262 stores full paper text per unique id. In a long-lived MCP process this can grow without limit.

🔧 Suggested fix
+from collections import OrderedDict
@@
-_paper_cache: dict[str, tuple[str, str]] = {}
+PAPER_CACHE_MAX_SIZE = 64
+_paper_cache: OrderedDict[str, tuple[str, str]] = OrderedDict()
@@
-    if cache_key in _paper_cache:
-        return _paper_cache[cache_key]
+    if cache_key in _paper_cache:
+        _paper_cache.move_to_end(cache_key)
+        return _paper_cache[cache_key]
@@
-                        _paper_cache[cache_key] = (url, text)
+                        _paper_cache[cache_key] = (url, text)
+                        _paper_cache.move_to_end(cache_key)
+                        if len(_paper_cache) > PAPER_CACHE_MAX_SIZE:
+                            _paper_cache.popitem(last=False)
                         return url, text

Also applies to: 262-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` at line 87, _paper_cache is an
unbounded dict that will grow indefinitely (entries added around where full
paper text is stored at lines ~262-263); replace it with a bounded cache (e.g.,
an LRU with a fixed max size) or enforce eviction on insert so only recent N
papers are kept, or change stored value to a lightweight summary/metadata
instead of full text. Update the global _paper_cache declaration and the code
paths that write to it (refer to _paper_cache and the code that stores the full
paper text around lines 262-263) to use the bounded cache/eviction policy and
ensure concurrent access remains safe if used from multiple coroutines/threads.

547-548: ⚠️ Potential issue | 🟡 Minor

Reject invalid max_results instead of silently rewriting it.

Line 547-548 coerces invalid values to 1. Invalid user input should fail clearly rather than being silently changed.

🔧 Suggested fix
-    if max_results < 1:
-        max_results = 1
+    if max_results < 1:
+        raise ValueError("max_results must be >= 1")

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 547 - 548, The code
currently coerces invalid max_results to 1 (the "if max_results < 1" block)
which hides bad user input; change this to reject invalid values by raising a
clear exception (e.g., raise ValueError("max_results must be >= 1, got
{max_results}")) where max_results is validated, so callers see the error
instead of silently continuing; update any docstring or error handling in the
surrounding function that references max_results to reflect this validation.

225-229: ⚠️ Potential issue | 🟠 Major

Old-style arXiv IDs can be corrupted and collide in cache keys.

Line 225/228 split using low (lowercased input), and Line 243 uses split("v")[0]. This can alias old-style IDs (e.g., subject class containing v) and return wrong cached text.

🔧 Suggested fix
-    if "arxiv.org/abs/" in low or "arxiv.org/pdf/" in low or "arxiv.org/html/" in low:
-        pid = re.split(r"arxiv\.org/(?:abs|pdf|html)/", low, maxsplit=1)[-1]
+    if "arxiv.org/abs/" in low or "arxiv.org/pdf/" in low or "arxiv.org/html/" in low:
+        pid = re.split(r"arxiv\.org/(?:abs|pdf|html)/", pid, flags=re.I, maxsplit=1)[-1]
         pid = pid.split("?")[0].split("#")[0].replace(".pdf", "")
     if "10.48550/arxiv." in low:
-        pid = re.split(r"10\.48550/arxiv\.", low, flags=re.I, maxsplit=1)[-1]
+        pid = re.split(r"10\.48550/arxiv\.", pid, flags=re.I, maxsplit=1)[-1]
@@
-    cache_key = arxiv_id.split("v")[0]
+    cache_key = re.sub(r"v\d+$", "", arxiv_id, flags=re.I)

Also applies to: 243-243

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/mcp/servers/arxiv_tool.py` around lines 225 - 229, The code
currently lowercases the input into low and then uses naive splits (pid =
re.split(..., low)[-1] and pid.split("v")[0]) which can corrupt old-style arXiv
IDs (subject-class forms containing "v") and cause cache key collisions; update
arxiv_tool.py to extract the identifier with a single regex capture from the
original URL/string (not blindly lowercasing) — e.g., capture after
arxiv.org/(abs|pdf|html)/ or 10.48550/arxiv. and strip query/fragments and
".pdf" — and remove version suffix only via a targeted regex that strips a
trailing "v" followed by digits (e.g., r'^(.*?)(?:v\d+)?$') instead of using
split("v")[0]; adjust uses of variables pid and low accordingly so old-style IDs
keep their internal "v" characters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 554-557: The broad except blocks in the arXiv tool entrypoints
(the call to _arxiv_api_search in arxiv_search, and similar blocks in
arxiv_sections and arxiv_read_chunk) should be replaced with narrow exception
handling for expected failures (e.g., network/HTTP errors like
aiohttp.ClientError or requests.exceptions.RequestException,
asyncio.TimeoutError, and known parsing/value errors) and allow unexpected
exceptions to propagate; update each try/except to catch only these specific
exceptions, convert those to the existing user-facing error string, and re-raise
any other exceptions (or omit a catch) so real bugs are not silently swallowed.
- Around line 596-599: The current arXiv native fallback only attempts
_arxiv_api_get when paper_id matches strict new-style IDs, missing inputs like
"arXiv:...", arxiv.org URLs, DOI-form IDs, or old-style IDs; update the OpenAlex
fallback branch so it calls _extract_arxiv_id(paper_id) and, if that returns a
non-empty id, pass that extracted id into _arxiv_api_get (i.e., replace the
strict re.match gate with using _extract_arxiv_id to normalize paper_id before
calling _arxiv_api_get); ensure you still catch exceptions around the await
_arxiv_api_get(...) call as currently done.
- Line 87: _paper_cache is an unbounded dict that will grow indefinitely
(entries added around where full paper text is stored at lines ~262-263);
replace it with a bounded cache (e.g., an LRU with a fixed max size) or enforce
eviction on insert so only recent N papers are kept, or change stored value to a
lightweight summary/metadata instead of full text. Update the global
_paper_cache declaration and the code paths that write to it (refer to
_paper_cache and the code that stores the full paper text around lines 262-263)
to use the bounded cache/eviction policy and ensure concurrent access remains
safe if used from multiple coroutines/threads.
- Around line 547-548: The code currently coerces invalid max_results to 1 (the
"if max_results < 1" block) which hides bad user input; change this to reject
invalid values by raising a clear exception (e.g., raise ValueError("max_results
must be >= 1, got {max_results}")) where max_results is validated, so callers
see the error instead of silently continuing; update any docstring or error
handling in the surrounding function that references max_results to reflect this
validation.
- Around line 225-229: The code currently lowercases the input into low and then
uses naive splits (pid = re.split(..., low)[-1] and pid.split("v")[0]) which can
corrupt old-style arXiv IDs (subject-class forms containing "v") and cause cache
key collisions; update arxiv_tool.py to extract the identifier with a single
regex capture from the original URL/string (not blindly lowercasing) — e.g.,
capture after arxiv.org/(abs|pdf|html)/ or 10.48550/arxiv. and strip
query/fragments and ".pdf" — and remove version suffix only via a targeted regex
that strips a trailing "v" followed by digits (e.g., r'^(.*?)(?:v\d+)?$')
instead of using split("v")[0]; adjust uses of variables pid and low accordingly
so old-style IDs keep their internal "v" characters.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8cce5a2-ee4b-47cb-b29d-ed1e91169191

📥 Commits

Reviewing files that changed from the base of the PR and between 99cf1de and 1cfe897.

📒 Files selected for processing (3)
  • nemo_skills/mcp/servers/arxiv_tool.py
  • pyproject.toml
  • tests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

# workspace mount when available; fall back to /tmp for local smoke tests.
ARXIV_REQUEST_INTERVAL = float(os.getenv("ARXIV_REQUEST_INTERVAL", "4.0"))
ARXIV_RATE_LIMIT_LOCK = os.getenv(
"ARXIV_RATE_LIMIT_LOCK",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested on clusters? just to make sure you can write to tempfile.gettempdir() on cluster and it exists.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also looks like a global lock. If you have 20 requests of arxiv, are they basically blocking each other?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default lock path is based on tempfile.gettempdir(), which should exist on the target Python runtime; the implementation also creates the parent directory and falls back to /tmp/arxiv_api_rate_limit.lock if creating/opening the configured path fails. For clusters that want a shared filesystem lock instead of node-local temp, the path is configurable with ARXIV_RATE_LIMIT_LOCK.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, requests that hit arXiv’s native API are intentionally serialized by this lock. arXiv asks clients to wait ~3 seconds between export API requests, and in our eval setting many chunk workers can run concurrently, so the shared lock prevents a burst of 20 workers from stampeding the public API and getting 429s. Metadata lookups served from cache or OpenAlex do not take this lock.

@jiacheng-xu jiacheng-xu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: do we consider using https://pypi.org/project/arxiv/ ?

tamohannes added 2 commits May 1, 2026 11:46
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tamohannes , Thank you for the contributions! I would encourage you to please switch to a tool path that avoids the MCP protocol (we have found it to cause real latency and stability issues during RL) in favor of a direct tool approach. The changes should be fairly minimal--please see the DirectPythonTool for an example.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/mcp/servers/arxiv_tool.py`:
- Around line 219-221: The DOI normalization currently uses case-sensitive ops
on pid causing cases like "DOI:10..." to become malformed; change the extraction
to operate on the already-lowercased variable (low) and strip whitespace so you
always build a normalized lower-case identifier. Specifically, in the
arxiv_tool.py block that checks low.startswith("doi:") ... replace the pid-based
extraction with something like extracting from low (e.g., doi =
low.split("doi.org/")[-1].replace("doi:", "").strip()) and then return
f"doi:{doi}" so the returned DOI is consistently lowercased and trimmed.
- Around line 738-747: _arxiv_api_get currently parses feed-level <title> and
returns a paper even when the Atom feed has no <entry>, causing invalid IDs to
be treated as found; instead call _parse_arxiv_atom(text) to get entries and
then use _format_arxiv_entry on the first entry, and if the parsed entries list
is empty treat it as "not found" (return the failure result string or fallback
behavior expected by tests). Update _arxiv_api_get to replace the manual
regex/title/summary/pub extraction with a call to _parse_arxiv_atom(text), check
if entries is empty and return the same "failed"/"not found" path, otherwise
pass entries[0] to _format_arxiv_entry and continue.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a37a011-0d88-4dea-a0b2-df65e6b8e572

📥 Commits

Reviewing files that changed from the base of the PR and between 99cf1de and 9d87966.

📒 Files selected for processing (4)
  • .github/workflows/tests.yml
  • nemo_skills/mcp/servers/arxiv_tool.py
  • pyproject.toml
  • tests/test_mcp_clients.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/tests.yml

Comment thread nemo_skills/mcp/servers/arxiv_tool.py
Comment thread nemo_skills/mcp/servers/arxiv_tool.py
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

Updated this PR to use the direct \ path instead of launching a stdio MCP server. The PR diff is now limited to the tool implementation, docs reference, and focused tests; workflow/pyproject changes were removed. @coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@tamohannes Thanks for the update! Triggering a fresh review of the updated PR — focused on the tool implementation (arxiv_tool.py), docs reference, and tests.

[review]

@tamohannes

Copy link
Copy Markdown
Collaborator Author

Updated this PR to use the direct Tool path instead of launching a stdio MCP server. The PR diff is now limited to the tool implementation, docs reference, and focused tests; workflow/pyproject changes were removed. @coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@tamohannes Thanks for the update! Triggering a fresh review of the updated PR — focused on the tool implementation (arxiv_tool.py), docs reference, and tests, with the workflow/pyproject changes removed.

[review]

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

Updated to the direct Tool implementation path (no stdio MCP server/client). The final diff is scoped to the tool file, docs reference, and focused tests only. @coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@tamohannes Thanks for the update! Triggering a fresh review focused on the direct Tool implementation in arxiv_tool.py, the docs reference, and the scoped tests.

[review]

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

Addressed the latest ArXiv comments: DOI normalization now handles uppercase DOI prefixes/URLs, and the arXiv Atom fallback returns a not-found message when the feed has no entry. @coderabbitai review

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@tamohannes Thanks for addressing those comments! Triggering a fresh review to verify the DOI normalization fix (uppercase DOI: prefixes/URLs) and the arXiv Atom fallback not-found messaging, along with any other changes in the updated diff.

[review]

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #1418, reopened from an internal NVIDIA-NeMo/Skills branch so CI can access repository secrets.

@tamohannes tamohannes closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants