Skip to content

fix: retain tool catalog on transient Uplink manifest fetch failure#762

Merged
DaleSeo merged 5 commits into
apollographql:mainfrom
Staticsubh:fix/uplink-manifest-error-keep-catalog
Jun 22, 2026
Merged

fix: retain tool catalog on transient Uplink manifest fetch failure#762
DaleSeo merged 5 commits into
apollographql:mainfrom
Staticsubh:fix/uplink-manifest-error-keep-catalog

Conversation

@Staticsubh

Copy link
Copy Markdown
Contributor

Summary

Closes #761.

When operations.source: uplink is configured, a transient Uplink persisted-query manifest fetch failure (network timeout, DNS failure, HTTP 5xx, retryable retry_later response) previously caused the server to replace its active tool catalog with an empty list. The server continued reporting /health UP, sent tools/list_changed to connected clients, and those clients received {"tools": []} with no error signal.

This PR fixes the bug by mirroring the existing CollectionError keep-last-good pattern for manifest fetch failures.

Root Cause

manifest_poller.rs:44–47 converted any Err from the Uplink stream into Event::UpdateManifest(vec![]) — treating a transient infrastructure failure as an authoritative "empty collection" signal:

// Before (bug): error → silently wipes catalog
Err(e) => {
    tracing::error!("error from manifest stream: {}", e);
    Event::UpdateManifest(vec![])
}

Changes

  • apollo-mcp-registry — Added ManifestError(BoxError) variant to the manifest event enum. manifest_poller.rs now emits ManifestError(e) on Err instead of UpdateManifest(vec![]).
  • apollo-mcp-server — Added ManifestError(BoxError) to ServerEvent and OperationError. operation_source.rs forwards ManifestError through the event stream. State machine handles ManifestError while Running by retaining the existing catalog and logging the error (exactly as CollectionError is handled); treats it as fatal during startup before the first successful load.
  • Tests — Two new regression tests in states.rs: one confirming the server stays alive with its catalog intact on ManifestError while Running; one confirming it is fatal during startup.

Behavior

Uplink response Before After
HTTP 200, non-empty manifest Apply catalog ✅ Apply catalog ✅
HTTP 200, empty manifest Clear catalog ✅ Clear catalog ✅
Network error / timeout Silent empty catalog Retain catalog, log error ✅
HTTP 5xx / retry_later Silent empty catalog Retain catalog, log error ✅

Test Plan

  • cargo test -p apollo-mcp-server manifest_error — both new regression tests pass
  • cargo test -p apollo-mcp-server collection_error — existing tests still pass
  • cargo clippy --all-targets -- --deny warnings — no new warnings
  • Manual: start server with operations.source: uplink, simulate Uplink unreachable, confirm tools/list returns previous tools and /health logs a structured error

🤖 Generated with Claude Code

Previously, any Err from the Uplink persisted-query manifest stream was
converted into Event::UpdateManifest(vec![]) — an authoritative empty
update — causing the running server to wipe its tool catalog. The server
continued reporting /health UP and sent tools/list_changed to clients,
which then received {"tools": []}  with no error signal (fixes apollographql#761).

The fix mirrors the existing CollectionError keep-last-good pattern:
- Add ManifestError(BoxError) to the manifest event enum and server event
- Emit ManifestError on Err instead of UpdateManifest(vec![])
- Handle ManifestError while Running by retaining the existing catalog
  and logging the error; treat it as fatal during startup
- Add regression tests for both the keep-alive and fatal-startup cases

Authoritative empty manifests (HTTP 200 with a valid empty payload)
continue to clear the catalog as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Staticsubh Staticsubh requested a review from a team as a code owner June 9, 2026 18:35
@apollo-librarian

apollo-librarian Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

✅ AI Style Review — No Changes Detected

No MDX files were changed in this pull request.

Review Log: View detailed log

This review is AI-generated. Please use common sense when accepting these suggestions, as they may not always be accurate or appropriate for your specific context.

Subhranil Sengupta and others added 3 commits June 10, 2026 00:14
tower is a dev-dependency in apollo-mcp-server, so tower::BoxError is
not available in lib code. Replace with the equivalent std type
Box<dyn std::error::Error + Send + Sync + 'static> in event.rs and
errors.rs to fix CI compile errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three tests to bring patch coverage above 80%:
- event.rs (registry): debug format of ManifestError variant
- errors.rs (server): Display impl of OperationError::Manifest
- operation_source.rs (server): ManifestError forwarded when
  LocalStatic manifest file is missing (also covers the Err branch
  in manifest_poller.rs into_stream)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@DaleSeo DaleSeo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @Staticsubh. This looks great! I have a few comments.

Comment thread crates/apollo-mcp-server/src/server/states.rs
Comment thread crates/apollo-mcp-server/src/server/states.rs Outdated
Comment thread crates/apollo-mcp-registry/src/uplink/persisted_queries/manifest_poller.rs Outdated
Comment thread crates/apollo-mcp-server/src/errors.rs Outdated
- Remove duplicate log in manifest_poller.rs (state machine already logs)
- Downgrade Running-branch log to warn (keep-last-good path, not an error)
- Drop "Uplink" from OperationError::Manifest message (local manifests too)
- Strengthen test to assert catalog is retained, not just state is Running

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DaleSeo DaleSeo merged commit 36ab95b into apollographql:main Jun 22, 2026
13 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: transient Uplink persisted-query manifest fetch failure silently clears tool catalog while /health reports UP

2 participants