Skip to content

Cache: serializing keyMutex in FetchOrSave + Save() canceled by request context (src/lib/cache) #23335

@air1bzz

Description

@air1bzz

Description

Two issues in src/lib/cache/ affect the FetchOrSave("cfgs", ...) path under concurrent load. This path is hit twice per HTTP request (security middleware + auth.Login, both via config.AuthMode(ctx)), so the effect is doubled per request.

Bug 1 — keyMutex serializes concurrent requests on L1 miss (src/lib/cache/util.go)

FetchOrSave uses a global keyMutex to guard the cache-fill. On a cold L1 cache, only one caller runs builder(); all other concurrent callers queue on Lock() and each performs its own Redis Fetch sequentially after the lock is released. The result the leader computed is not shared with the waiters.

singleflight.Group is the appropriate primitive here, and is already used elsewhere in Harbor for the same anti-pattern (src/controller/quota/controller.go, "prevent cache penetration").

Bug 2 — Save() uses the (cancelable) request context (src/lib/cache/helper.go)

FetchOrSave calls c.Save(ctx, ...) synchronously with the HTTP request context. If the client disconnects before Redis responds (slow Redis: BGSAVE, reconnection after idle timeout, network), the context is canceled and Save fails — so Redis L2 is never populated and the next L1 miss re-queries Postgres.

This is amplified by error wrapping in src/lib/cache/redis/redis.go: Fetch() wraps all errors (including context.Canceled) as cache.ErrNotFound, so a canceled fetch is indistinguishable from a real cache miss and builder() runs again on every request.

Both behaviors are present on the current main branch as well as on v2.15.1.

Affected version

v2.15.1 (observed), and main. Reproduced on 3 independent registries.

Symptom

Recurring warning, ~60 per 30 requests (2 calls/request):

[WARNING] [/lib/cache/helper.go:66]: failed to save value to cache, error: context canceled

Constant Postgres pressure on properties (a 2-row table): ~1700 seq scans over 26h, i.e. one full read roughly every TTL window, instead of being served from Redis L2.

Reproduction

Bug 1 (serialization). Hold an exclusive lock on properties for ~3s, then fire 30 parallel requests against an endpoint that goes through config.AuthMode, on a cold L1 cache (fresh / restarted core):

DO $$ BEGIN LOCK TABLE properties IN ACCESS EXCLUSIVE MODE; PERFORM pg_sleep(3); END $$;

Measure the latency spread (max − min) across the 30 responses. With the mutex, the waiters drain sequentially after unlock → spread in the hundreds of ms. With singleflight, all waiters get the leader's result simultaneously → spread dominated by scheduler jitter.

Bug 2 (Save canceled). Block Redis for ~3s (CLIENT PAUSE / DEBUG SLEEP), then send 30 requests with a 1s client timeout (curl --max-time 1) so the request context is canceled before Redis recovers. Count context canceled warnings and check EXISTS cache:cfgs.

Note: a cold L1 must be guaranteed by restarting the core process, not by waiting for TTL expiry — the internal scheduler periodically self-calls (CORE_LOCAL_URL) and can repopulate L1 during the wait.

Observed vs. expected

Test Setup Current Expected
Latency spread properties locked 3s, 30 parallel req, cold L1 542 ms < 150 ms (shared result)
context canceled warnings Redis blocked 3s, --max-time 1 60 0
cache:cfgs EXISTS after Redis recovers same 0 (never written) 1
Postgres reads (delta) same +60 +1

Proposed fix

Replace keyMutex with singleflight.Group and make the inner Fetch/Save non-cancelable via context.WithoutCancel(ctx). FetchOrSave's signature and external semantics are unchanged; the existing TestSaveCalledOnlyOneTime still holds (single entry into the singleflight.Do callback). The redis.go error-wrapping issue (treating transient errors as ErrNotFound) is related but is a larger change, best handled separately.

I have a working patch and validation results, and am happy to open a PR and contribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions