Skip to content

Commit 03a5a82

Browse files
rohansonechaclaude
andauthored
[API] Clear request-scoped cache in gpu-metrics endpoint (#9265)
* [API] Add on_gpu_metrics_collect plugin hook Add a lifecycle hook to BasePlugin that the metrics server calls before collecting GPU metrics. This allows plugins to sync process-level state (e.g. KUBECONFIG) into the metrics server process, which runs separately from request worker processes. Without this, credentials uploaded via the credential manager plugin only update the environment in the worker process that handled the upload. The metrics server in the main process never sees the change, requiring a pod restart for /gpu-metrics to discover newly uploaded kubeconfigs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Load plugins in metrics server on startup The metrics server runs as a separate uvicorn instance in a background thread, so plugins loaded by the main API server are not available in its context. Add a startup event that loads plugins if they haven't been loaded yet, enabling plugin hooks like on_gpu_metrics_collect to work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Ensure KUBECONFIG includes credential manager path in metrics server Instead of relying on plugin loading (which fails in the metrics server since install() requires a FastAPI app context), directly ensure KUBECONFIG includes the credential manager kubeconfig path before each gpu-metrics collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Add /gpu-metrics-debug endpoint for diagnostics Temporary debug endpoint to inspect plugin state, KUBECONFIG, and discovered contexts from within the metrics server process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Clear request-scoped cache before gpu-metrics collection The metrics server runs as a daemon thread where request-scoped caches (kubernetes API clients, context names) are never cleared automatically. This causes stale results from boot time to persist indefinitely — if a kubeconfig file didn't exist at boot, context discovery caches the failure and never retries. Call clear_request_level_cache() before each gpu-metrics scrape, matching the pattern used by the billing daemon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Enhance debug endpoint to show cache clear effect Shows contexts before and after clearing request-scoped cache to verify the stale cache hypothesis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [API] Clear request-scoped cache in gpu-metrics endpoint The metrics server runs as a daemon thread where request-scoped caches are never cleared automatically. This causes stale results from boot time to persist — if a kubeconfig didn't exist at boot, context discovery caches the failure and never retries. Add clear_request_level_cache() before each gpu-metrics scrape, matching the pattern used by other daemon threads (billing, gpu healer). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dddfe09 commit 03a5a82

1 file changed

Lines changed: 6 additions & 0 deletions

File tree

sky/server/metrics.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
from sky import global_user_state
2121
from sky import sky_logging
2222
from sky.metrics import utils as metrics_utils
23+
from sky.utils import annotations
2324

2425
logger = sky_logging.init_logger(__name__)
2526

@@ -199,6 +200,11 @@ def metrics() -> fastapi.Response:
199200
@metrics_app.get('/gpu-metrics')
200201
async def gpu_metrics() -> fastapi.Response:
201202
"""Gets the GPU metrics from multiple external k8s clusters"""
203+
# The metrics server runs as a daemon thread, not as a normal request
204+
# handler, so request-scoped caches (e.g. kubernetes API clients,
205+
# context names) are never cleared automatically. Clear them on each
206+
# scrape so that newly uploaded kubeconfigs are discovered.
207+
annotations.clear_request_level_cache()
202208
contexts = core.get_all_contexts()
203209
all_metrics: List[str] = []
204210
successful_contexts = 0

0 commit comments

Comments
 (0)