Cache: serializing keyMutex in FetchOrSave + Save() canceled by request context (src/lib/cache)

## Description

Two issues in `src/lib/cache/` affect the `FetchOrSave("cfgs", ...)` path under concurrent load. This path is hit twice per HTTP request (security middleware + `auth.Login`, both via `config.AuthMode(ctx)`), so the effect is doubled per request.

### Bug 1 — `keyMutex` serializes concurrent requests on L1 miss (`src/lib/cache/util.go`)

`FetchOrSave` uses a global `keyMutex` to guard the cache-fill. On a cold L1 cache, only one caller runs `builder()`; all other concurrent callers queue on `Lock()` and each performs its own Redis `Fetch` sequentially after the lock is released. The result the leader computed is not shared with the waiters.

`singleflight.Group` is the appropriate primitive here, and is already used elsewhere in Harbor for the same anti-pattern (`src/controller/quota/controller.go`, "prevent cache penetration").

### Bug 2 — `Save()` uses the (cancelable) request context (`src/lib/cache/helper.go`)

`FetchOrSave` calls `c.Save(ctx, ...)` synchronously with the HTTP request context. If the client disconnects before Redis responds (slow Redis: BGSAVE, reconnection after idle timeout, network), the context is canceled and `Save` fails — so Redis L2 is never populated and the next L1 miss re-queries Postgres.

This is amplified by error wrapping in `src/lib/cache/redis/redis.go`: `Fetch()` wraps **all** errors (including `context.Canceled`) as `cache.ErrNotFound`, so a canceled fetch is indistinguishable from a real cache miss and `builder()` runs again on every request.

Both behaviors are present on the current `main` branch as well as on `v2.15.1`.

## Affected version

`v2.15.1` (observed), and `main`. Reproduced on 3 independent registries.

## Symptom

Recurring warning, ~60 per 30 requests (2 calls/request):

```
[WARNING] [/lib/cache/helper.go:66]: failed to save value to cache, error: context canceled
```

Constant Postgres pressure on `properties` (a 2-row table): ~1700 seq scans over 26h, i.e. one full read roughly every TTL window, instead of being served from Redis L2.

## Reproduction

**Bug 1 (serialization).** Hold an exclusive lock on `properties` for ~3s, then fire 30 parallel requests against an endpoint that goes through `config.AuthMode`, on a cold L1 cache (fresh / restarted core):

```sql
DO $$ BEGIN LOCK TABLE properties IN ACCESS EXCLUSIVE MODE; PERFORM pg_sleep(3); END $$;
```

Measure the latency spread (max − min) across the 30 responses. With the mutex, the waiters drain sequentially after unlock → spread in the hundreds of ms. With `singleflight`, all waiters get the leader's result simultaneously → spread dominated by scheduler jitter.

**Bug 2 (Save canceled).** Block Redis for ~3s (`CLIENT PAUSE` / `DEBUG SLEEP`), then send 30 requests with a 1s client timeout (`curl --max-time 1`) so the request context is canceled before Redis recovers. Count `context canceled` warnings and check `EXISTS cache:cfgs`.

> Note: a cold L1 must be guaranteed by restarting the core process, not by waiting for TTL expiry — the internal scheduler periodically self-calls (`CORE_LOCAL_URL`) and can repopulate L1 during the wait.

## Observed vs. expected

| Test | Setup | Current | Expected |
|------|-------|---------|----------|
| Latency spread | properties locked 3s, 30 parallel req, cold L1 | 542 ms | < 150 ms (shared result) |
| `context canceled` warnings | Redis blocked 3s, `--max-time 1` | 60 | 0 |
| `cache:cfgs EXISTS` after Redis recovers | same | 0 (never written) | 1 |
| Postgres reads (delta) | same | +60 | +1 |

## Proposed fix

Replace `keyMutex` with `singleflight.Group` and make the inner `Fetch`/`Save` non-cancelable via `context.WithoutCancel(ctx)`. `FetchOrSave`'s signature and external semantics are unchanged; the existing `TestSaveCalledOnlyOneTime` still holds (single entry into the `singleflight.Do` callback). The `redis.go` error-wrapping issue (treating transient errors as `ErrNotFound`) is related but is a larger change, best handled separately.

I have a working patch and validation results, and am happy to open a PR and contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache: serializing keyMutex in FetchOrSave + Save() canceled by request context (src/lib/cache) #23335

Description

Bug 1 — `keyMutex` serializes concurrent requests on L1 miss (`src/lib/cache/util.go`)

Bug 2 — `Save()` uses the (cancelable) request context (`src/lib/cache/helper.go`)

Affected version

Symptom

Reproduction

Observed vs. expected

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Setup	Current	Expected
Latency spread	properties locked 3s, 30 parallel req, cold L1	542 ms	< 150 ms (shared result)
`context canceled` warnings	Redis blocked 3s, `--max-time 1`	60	0
`cache:cfgs EXISTS` after Redis recovers	same	0 (never written)	1
Postgres reads (delta)	same	+60	+1

Cache: serializing keyMutex in FetchOrSave + Save() canceled by request context (src/lib/cache) #23335

Description

Description

Bug 1 — keyMutex serializes concurrent requests on L1 miss (src/lib/cache/util.go)

Bug 2 — Save() uses the (cancelable) request context (src/lib/cache/helper.go)

Affected version

Symptom

Reproduction

Observed vs. expected

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `keyMutex` serializes concurrent requests on L1 miss (`src/lib/cache/util.go`)

Bug 2 — `Save()` uses the (cancelable) request context (`src/lib/cache/helper.go`)