MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat

## MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat

### Summary

When the MEK (Master Encryption Key) ConfigMap is regenerated with new key material — which happens on Helm reinstall, ArgoCD prune/recreate, or namespace cleanup — the config deserialization logic in `postgres.py` wraps already-encrypted JWE values in an additional JWE layer instead of detecting the key mismatch. This causes config values to grow ~1.33× per MEK regeneration (faster for small values due to fixed JWE overhead), eventually breaking JWT auth, workflow submission, and K8s Secret creation.

### Affected code

Two interacting issues:

**1. Static `kid` in MEK generation** — [`mek-configmap.yaml` L57](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/deployments/charts/quick-start/templates/mek-configmap.yaml#L57):

```bash
JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"key1","kty":"oct"}'
```

Every MEK regeneration produces a key with `kid="key1"`. When [`get_mek("key1")`](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/secret_manager/secret_manager.py#L92-L98) is called, it returns the current (new) key — the kid matches, so no `OSMONotFoundError` is raised, but the key material is wrong.

**2. Except handler treats undecryptable JWE as plaintext** — [`postgres.py` L2697–2701](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/connectors/postgres.py#L2697-L2701):

```python
except (JWException, osmo_errors.OSMONotFoundError):
    # Encrypt the plain text secret
    encrypted = postgres.secret_manager.encrypt(secret, '')
    encrypt_keys.add(top_level_key)
    return secret, encrypted.value
```

When [`deserialize()`](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/connectors/postgres.py#L2686) succeeds (the value is structurally valid JWE) but [`decrypt()`](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/secret_manager/secret_manager.py#L165-L182) fails (wrong key material), the except handler does not distinguish between "genuinely unencrypted plaintext" and "JWE encrypted with a lost key". It re-encrypts the entire JWE string, producing `JWE_new(JWE_old(plaintext))`.

The same pattern exists in [`decrypt_credential()`](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/connectors/postgres.py#L606-L633) at [L624–626](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/connectors/postgres.py#L624-L626).

**Why `get_uek()` doesn't catch it** — When `uid=''` (which is the case for config secrets), [`get_uek()`](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/secret_manager/secret_manager.py#L100-L104) returns the MEK directly. Since the kid is always `"key1"` and `get_mek("key1")` returns the current (new) MEK, the key material mismatch is invisible.

### Growth cycle

```
Initial:         plaintext                                        ~490 bytes
After MEK regen: JWE_1(plaintext)                                 ~683 bytes
After 2nd regen: JWE_2(JWE_1(plaintext))                        ~1,100 bytes
After 3rd regen: JWE_3(JWE_2(JWE_1(plaintext)))                 ~1,700 bytes
...
After N regens:  JWE_N(JWE_{N-1}(...(JWE_1(plaintext))...))     ~1.33^N × initial
```

Growth is per MEK regeneration event, not per pod restart with the same MEK.

### Affected configs

Three values in the `configs` PostgreSQL table contain `pydantic.SecretStr` fields:

| Config key | Nested SecretStr field | Healthy size |
|---|---|---|
| `service_auth` | `keys.<kid>.private_key` | ~2,900 B |
| `backend_images` | `credential.auth` | ~490 B |
| `workflow_data` | `credential.access_key` | ~460 B |

### Failure modes

1. **JWT auth failure** — `service_auth.keys.<kid>.private_key` becomes a JWE string instead of a JWK JSON object. `jwcrypto.jwk.JWK.from_json()` fails with `JSONDecodeError`. Result: `POST /api/auth/jwt/access_token` returns 500.

2. **K8s Secret size limit** — OSMO syncs `backend_images` to a K8s Secret. After enough MEK regenerations, the value exceeds the 1 MB K8s Secret limit. Result: `422 Unprocessable Entity` on Secret creation, service CrashLoopBackOff.

3. **Workflow credential parsing** — `workflow_data.credential` becomes a JWE string instead of the expected `{endpoint, access_key_id, access_key}` dict. Result: `CreateGroup` fails with `OSMOServerError: Workflow data credential is not set`.

### Trigger conditions

The MEK ConfigMap (`mek-config`) is regenerated when:
- ArgoCD prunes/recreates the ConfigMap (e.g., when `osmo-init` Application is deleted and recreated)
- Helm uninstall + reinstall
- Namespace cleanup or manual ConfigMap deletion
- Any operation that deletes the ConfigMap while the Helm `pre-install` hook is active

### Reproducer

Minimal Python script using only `jwcrypto`:

```python
from jwcrypto import jwe, jwk
from jwcrypto.common import json_encode

ALG, ENC = 'A256GCMKW', 'A256GCM'

def make_mek():
    """Simulate MEK generation — always kid='key1', random key material."""
    return jwk.JWK.generate(kty='oct', size=256, kid='key1')

def encrypt(plaintext, mek):
    token = jwe.JWE(plaintext.encode(), json_encode({'alg': ALG, 'enc': ENC, 'kid': mek.key_id}))
    token.add_recipient(mek)
    return token.serialize(True)

def decrypt_or_reencrypt(value, mek):
    """Simulates the logic at postgres.py L2685-2701."""
    token = jwe.JWE()
    try:
        token.deserialize(value)
        token.decrypt(mek)
        return token.payload.decode(), value  # success
    except Exception:
        # BUG: re-encrypts the JWE string as if it were plaintext
        return value, encrypt(value, mek)

# Simulate MEK regeneration cycle
secret = '{"kty":"RSA","n":"abc...","d":"xyz..."}'  # ~40 bytes
mek_0 = make_mek()
encrypted = encrypt(secret, mek_0)
print(f"Initial: {len(encrypted)} bytes")

for i in range(1, 8):
    mek_new = make_mek()  # new key material, same kid="key1"
    _, encrypted = decrypt_or_reencrypt(encrypted, mek_new)
    print(f"After MEK regen {i}: {len(encrypted)} bytes")
```

Output:
```
Initial: 262 bytes
After MEK regen 1: 573 bytes
After MEK regen 2: 987 bytes
After MEK regen 3: 1539 bytes
After MEK regen 4: 2275 bytes
After MEK regen 5: 3259 bytes
After MEK regen 6: 4571 bytes
After MEK regen 7: 6319 bytes
```

### Proposed fix

Two complementary changes:

**Fix 1 — Don't re-wrap existing JWE** ([`postgres.py` L2697](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/src/utils/connectors/postgres.py#L2697)):

```python
except (JWException, osmo_errors.OSMONotFoundError):
    if secret.startswith("eyJ"):  # already a JWE compact serialization
        logging.error(
            "Cannot decrypt config key '%s': MEK key material mismatch. "
            "Value will remain encrypted with the previous key.", top_level_key)
        return secret, None  # preserve as-is, don't double-wrap
    # Genuinely unencrypted plaintext — encrypt it
    encrypted = postgres.secret_manager.encrypt(secret, '')
    encrypt_keys.add(top_level_key)
    return secret, encrypted.value
```

**Fix 2 — Unique `kid` per MEK generation** ([`mek-configmap.yaml` L57](https://github.com/NVIDIA/OSMO/blob/d1e8a17db97ea7015277a7a7e8b6bd4e90770c74/deployments/charts/quick-start/templates/mek-configmap.yaml#L57)):

```bash
KID=$(openssl rand -hex 8)
JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"'$KID'","kty":"oct"}'
```

Fix 1 prevents the damage (defense in depth). Fix 2 makes the key mismatch detectable — `get_mek(old_kid)` would raise `OSMONotFoundError` instead of silently returning the wrong key.

### Environment

- OSMO version: 6.2-rc6 (also reproducible on `main` @ d1e8a17)
- Deployment: EKS 1.31, ArgoCD-managed
- Database: Aurora PostgreSQL Serverless v2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat #731