Skip to content

Commit 07da810

Browse files
docs(spec): add clarifications for OAuth token refresh reliability (smart-mcp-proxy#23)
Related smart-mcp-proxy#23 Adds clarifications from interactive review session: - Exponential backoff retry (10s→20s→40s→80s→5min) for failed refreshes - Surface refresh failures as degraded health status across CLI/menubar/web - Prometheus metrics for refresh operations (counter + histogram) - Refresh schedule lifecycle: exists iff tokens exist ## Changes - Add Clarifications section with session Q&As - Add FR-009 (backoff retry), FR-010 (health surfacing), FR-011 (metrics) - Update edge case for network partition with specific retry intervals - Clarify Refresh Schedule entity lifecycle invariant
1 parent fa9b446 commit 07da810

1 file changed

Lines changed: 13 additions & 2 deletions

File tree

  • specs/023-oauth-state-persistence

specs/023-oauth-state-persistence/spec.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,14 @@
55
**Status**: Draft
66
**Input**: Fix token refresh so OAuth servers survive restarts and proactively refresh before expiration
77

8+
## Clarifications
9+
10+
### Session 2026-01-12
11+
12+
- Q: What happens when a proactive refresh attempt fails before token expiration? → A: Immediate retry with exponential backoff (10s, 20s, 40s...) up to expiration time, surfacing failures as health status for user visibility across CLI, menubar, and web UI.
13+
- Q: Should refresh operations emit metrics for observability? → A: Full Prometheus metrics only (`mcpproxy_oauth_refresh_total`, `mcpproxy_oauth_refresh_duration_seconds`), leveraging existing MetricsManager infrastructure.
14+
- Q: When should refresh schedules be created/destroyed? → A: Schedule exists iff tokens exist (created post-auth or when loaded at startup, destroyed when tokens removed). Implementation details deferred to planning.
15+
816
## Problem Statement
917

1018
OAuth-enabled MCP servers require manual re-authentication after any mcpproxy downtime (restart, laptop sleep, weekend). Despite having refresh tokens stored in the database, the automatic token refresh is not working reliably.
@@ -80,7 +88,7 @@ As a developer, when automatic token refresh fails, I want to understand why it
8088
- System should detect this on the subsequent connection attempt and report the specific error rather than entering a retry loop.
8189

8290
- What happens during a network partition?
83-
- System should use exponential backoff for refresh retries, with clear status indicating retry attempts.
91+
- System should use exponential backoff for refresh retries (10s → 20s → 40s → 80s → 5min cap), with health status showing "Refresh retry pending" and next attempt time.
8492

8593
## Requirements *(mandatory)*
8694

@@ -94,11 +102,14 @@ As a developer, when automatic token refresh fails, I want to understand why it
94102
- **FR-006**: System MUST display distinct health status messages for different refresh failure types (expired refresh token, network error, provider error).
95103
- **FR-007**: System MUST remove or correct the current misleading "OAuth token refresh successful" logging that reports success when `Start()` returns nil.
96104
- **FR-008**: System MUST rate-limit refresh attempts to no more than one per 10 seconds per server.
105+
- **FR-009**: System MUST implement exponential backoff retry (10s, 20s, 40s, 80s, capped at 5 minutes) when proactive refresh fails, continuing attempts until token expiration.
106+
- **FR-010**: System MUST surface ongoing refresh failures as degraded health status on the upstream server, visible in CLI (`upstream list`), menubar, and web control panel.
107+
- **FR-011**: System MUST emit Prometheus metrics for OAuth refresh operations: `mcpproxy_oauth_refresh_total` (counter with labels: server, result) and `mcpproxy_oauth_refresh_duration_seconds` (histogram with labels: server, result).
97108

98109
### Key Entities
99110

100111
- **OAuth Token**: Access token, refresh token, expiration timestamp, token type, scope. Stored in database with server identifier.
101-
- **Refresh Schedule**: Server name, scheduled refresh time, retry count, last error. Managed by RefreshManager.
112+
- **Refresh Schedule**: Server name, scheduled refresh time, retry count, last error. Managed by RefreshManager. Lifecycle invariant: exists iff tokens exist for server.
102113
- **Health Status**: Level (healthy/degraded/unhealthy), summary, detail (including refresh error), suggested action.
103114

104115
## Success Criteria *(mandatory)*

0 commit comments

Comments
 (0)