Description
Two issues in src/lib/cache/ affect the FetchOrSave("cfgs", ...) path under concurrent load. This path is hit twice per HTTP request (security middleware + auth.Login, both via config.AuthMode(ctx)), so the effect is doubled per request.
Bug 1 — keyMutex serializes concurrent requests on L1 miss (src/lib/cache/util.go)
FetchOrSave uses a global keyMutex to guard the cache-fill. On a cold L1 cache, only one caller runs builder(); all other concurrent callers queue on Lock() and each performs its own Redis Fetch sequentially after the lock is released. The result the leader computed is not shared with the waiters.
singleflight.Group is the appropriate primitive here, and is already used elsewhere in Harbor for the same anti-pattern (src/controller/quota/controller.go, "prevent cache penetration").
Bug 2 — Save() uses the (cancelable) request context (src/lib/cache/helper.go)
FetchOrSave calls c.Save(ctx, ...) synchronously with the HTTP request context. If the client disconnects before Redis responds (slow Redis: BGSAVE, reconnection after idle timeout, network), the context is canceled and Save fails — so Redis L2 is never populated and the next L1 miss re-queries Postgres.
This is amplified by error wrapping in src/lib/cache/redis/redis.go: Fetch() wraps all errors (including context.Canceled) as cache.ErrNotFound, so a canceled fetch is indistinguishable from a real cache miss and builder() runs again on every request.
Both behaviors are present on the current main branch as well as on v2.15.1.
Affected version
v2.15.1 (observed), and main. Reproduced on 3 independent registries.
Symptom
Recurring warning, ~60 per 30 requests (2 calls/request):
[WARNING] [/lib/cache/helper.go:66]: failed to save value to cache, error: context canceled
Constant Postgres pressure on properties (a 2-row table): ~1700 seq scans over 26h, i.e. one full read roughly every TTL window, instead of being served from Redis L2.
Reproduction
Bug 1 (serialization). Hold an exclusive lock on properties for ~3s, then fire 30 parallel requests against an endpoint that goes through config.AuthMode, on a cold L1 cache (fresh / restarted core):
DO $$ BEGIN LOCK TABLE properties IN ACCESS EXCLUSIVE MODE; PERFORM pg_sleep(3); END $$;
Measure the latency spread (max − min) across the 30 responses. With the mutex, the waiters drain sequentially after unlock → spread in the hundreds of ms. With singleflight, all waiters get the leader's result simultaneously → spread dominated by scheduler jitter.
Bug 2 (Save canceled). Block Redis for ~3s (CLIENT PAUSE / DEBUG SLEEP), then send 30 requests with a 1s client timeout (curl --max-time 1) so the request context is canceled before Redis recovers. Count context canceled warnings and check EXISTS cache:cfgs.
Note: a cold L1 must be guaranteed by restarting the core process, not by waiting for TTL expiry — the internal scheduler periodically self-calls (CORE_LOCAL_URL) and can repopulate L1 during the wait.
Observed vs. expected
| Test |
Setup |
Current |
Expected |
| Latency spread |
properties locked 3s, 30 parallel req, cold L1 |
542 ms |
< 150 ms (shared result) |
context canceled warnings |
Redis blocked 3s, --max-time 1 |
60 |
0 |
cache:cfgs EXISTS after Redis recovers |
same |
0 (never written) |
1 |
| Postgres reads (delta) |
same |
+60 |
+1 |
Proposed fix
Replace keyMutex with singleflight.Group and make the inner Fetch/Save non-cancelable via context.WithoutCancel(ctx). FetchOrSave's signature and external semantics are unchanged; the existing TestSaveCalledOnlyOneTime still holds (single entry into the singleflight.Do callback). The redis.go error-wrapping issue (treating transient errors as ErrNotFound) is related but is a larger change, best handled separately.
I have a working patch and validation results, and am happy to open a PR and contribute.
Description
Two issues in
src/lib/cache/affect theFetchOrSave("cfgs", ...)path under concurrent load. This path is hit twice per HTTP request (security middleware +auth.Login, both viaconfig.AuthMode(ctx)), so the effect is doubled per request.Bug 1 —
keyMutexserializes concurrent requests on L1 miss (src/lib/cache/util.go)FetchOrSaveuses a globalkeyMutexto guard the cache-fill. On a cold L1 cache, only one caller runsbuilder(); all other concurrent callers queue onLock()and each performs its own RedisFetchsequentially after the lock is released. The result the leader computed is not shared with the waiters.singleflight.Groupis the appropriate primitive here, and is already used elsewhere in Harbor for the same anti-pattern (src/controller/quota/controller.go, "prevent cache penetration").Bug 2 —
Save()uses the (cancelable) request context (src/lib/cache/helper.go)FetchOrSavecallsc.Save(ctx, ...)synchronously with the HTTP request context. If the client disconnects before Redis responds (slow Redis: BGSAVE, reconnection after idle timeout, network), the context is canceled andSavefails — so Redis L2 is never populated and the next L1 miss re-queries Postgres.This is amplified by error wrapping in
src/lib/cache/redis/redis.go:Fetch()wraps all errors (includingcontext.Canceled) ascache.ErrNotFound, so a canceled fetch is indistinguishable from a real cache miss andbuilder()runs again on every request.Both behaviors are present on the current
mainbranch as well as onv2.15.1.Affected version
v2.15.1(observed), andmain. Reproduced on 3 independent registries.Symptom
Recurring warning, ~60 per 30 requests (2 calls/request):
Constant Postgres pressure on
properties(a 2-row table): ~1700 seq scans over 26h, i.e. one full read roughly every TTL window, instead of being served from Redis L2.Reproduction
Bug 1 (serialization). Hold an exclusive lock on
propertiesfor ~3s, then fire 30 parallel requests against an endpoint that goes throughconfig.AuthMode, on a cold L1 cache (fresh / restarted core):Measure the latency spread (max − min) across the 30 responses. With the mutex, the waiters drain sequentially after unlock → spread in the hundreds of ms. With
singleflight, all waiters get the leader's result simultaneously → spread dominated by scheduler jitter.Bug 2 (Save canceled). Block Redis for ~3s (
CLIENT PAUSE/DEBUG SLEEP), then send 30 requests with a 1s client timeout (curl --max-time 1) so the request context is canceled before Redis recovers. Countcontext canceledwarnings and checkEXISTS cache:cfgs.Observed vs. expected
context canceledwarnings--max-time 1cache:cfgs EXISTSafter Redis recoversProposed fix
Replace
keyMutexwithsingleflight.Groupand make the innerFetch/Savenon-cancelable viacontext.WithoutCancel(ctx).FetchOrSave's signature and external semantics are unchanged; the existingTestSaveCalledOnlyOneTimestill holds (single entry into thesingleflight.Docallback). Theredis.goerror-wrapping issue (treating transient errors asErrNotFound) is related but is a larger change, best handled separately.I have a working patch and validation results, and am happy to open a PR and contribute.