fix: retain tool catalog on transient Uplink manifest fetch failure#762
Merged
DaleSeo merged 5 commits intoJun 22, 2026
Merged
Conversation
Previously, any Err from the Uplink persisted-query manifest stream was
converted into Event::UpdateManifest(vec![]) — an authoritative empty
update — causing the running server to wipe its tool catalog. The server
continued reporting /health UP and sent tools/list_changed to clients,
which then received {"tools": []} with no error signal (fixes apollographql#761).
The fix mirrors the existing CollectionError keep-last-good pattern:
- Add ManifestError(BoxError) to the manifest event enum and server event
- Emit ManifestError on Err instead of UpdateManifest(vec![])
- Handle ManifestError while Running by retaining the existing catalog
and logging the error; treat it as fatal during startup
- Add regression tests for both the keep-alive and fatal-startup cases
Authoritative empty manifests (HTTP 200 with a valid empty payload)
continue to clear the catalog as before.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
✅ AI Style Review — No Changes DetectedNo MDX files were changed in this pull request. Review Log: View detailed log
|
tower is a dev-dependency in apollo-mcp-server, so tower::BoxError is not available in lib code. Replace with the equivalent std type Box<dyn std::error::Error + Send + Sync + 'static> in event.rs and errors.rs to fix CI compile errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three tests to bring patch coverage above 80%: - event.rs (registry): debug format of ManifestError variant - errors.rs (server): Display impl of OperationError::Manifest - operation_source.rs (server): ManifestError forwarded when LocalStatic manifest file is missing (also covers the Err branch in manifest_poller.rs into_stream) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DaleSeo
reviewed
Jun 16, 2026
DaleSeo
left a comment
Member
There was a problem hiding this comment.
Thanks for the fix, @Staticsubh. This looks great! I have a few comments.
- Remove duplicate log in manifest_poller.rs (state machine already logs) - Downgrade Running-branch log to warn (keep-last-good path, not an error) - Drop "Uplink" from OperationError::Manifest message (local manifests too) - Strengthen test to assert catalog is retained, not just state is Running Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DaleSeo
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #761.
When
operations.source: uplinkis configured, a transient Uplink persisted-query manifest fetch failure (network timeout, DNS failure, HTTP 5xx, retryableretry_laterresponse) previously caused the server to replace its active tool catalog with an empty list. The server continued reporting/healthUP, senttools/list_changedto connected clients, and those clients received{"tools": []}with no error signal.This PR fixes the bug by mirroring the existing
CollectionErrorkeep-last-good pattern for manifest fetch failures.Root Cause
manifest_poller.rs:44–47converted anyErrfrom the Uplink stream intoEvent::UpdateManifest(vec![])— treating a transient infrastructure failure as an authoritative "empty collection" signal:Changes
apollo-mcp-registry— AddedManifestError(BoxError)variant to the manifest event enum.manifest_poller.rsnow emitsManifestError(e)onErrinstead ofUpdateManifest(vec![]).apollo-mcp-server— AddedManifestError(BoxError)toServerEventandOperationError.operation_source.rsforwardsManifestErrorthrough the event stream. State machine handlesManifestErrorwhileRunningby retaining the existing catalog and logging the error (exactly asCollectionErroris handled); treats it as fatal during startup before the first successful load.states.rs: one confirming the server stays alive with its catalog intact onManifestErrorwhile Running; one confirming it is fatal during startup.Behavior
retry_laterTest Plan
cargo test -p apollo-mcp-server manifest_error— both new regression tests passcargo test -p apollo-mcp-server collection_error— existing tests still passcargo clippy --all-targets -- --deny warnings— no new warningsoperations.source: uplink, simulate Uplink unreachable, confirmtools/listreturns previous tools and/healthlogs a structured error🤖 Generated with Claude Code