-
Notifications
You must be signed in to change notification settings - Fork 28
MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat #731
Description
MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat
Summary
When the MEK (Master Encryption Key) ConfigMap is regenerated with new key material — which happens on Helm reinstall, ArgoCD prune/recreate, or namespace cleanup — the config deserialization logic in postgres.py wraps already-encrypted JWE values in an additional JWE layer instead of detecting the key mismatch. This causes config values to grow ~1.33× per MEK regeneration (faster for small values due to fixed JWE overhead), eventually breaking JWT auth, workflow submission, and K8s Secret creation.
Affected code
Two interacting issues:
1. Static kid in MEK generation — mek-configmap.yaml L57:
JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"key1","kty":"oct"}'Every MEK regeneration produces a key with kid="key1". When get_mek("key1") is called, it returns the current (new) key — the kid matches, so no OSMONotFoundError is raised, but the key material is wrong.
2. Except handler treats undecryptable JWE as plaintext — postgres.py L2697–2701:
except (JWException, osmo_errors.OSMONotFoundError):
# Encrypt the plain text secret
encrypted = postgres.secret_manager.encrypt(secret, '')
encrypt_keys.add(top_level_key)
return secret, encrypted.valueWhen deserialize() succeeds (the value is structurally valid JWE) but decrypt() fails (wrong key material), the except handler does not distinguish between "genuinely unencrypted plaintext" and "JWE encrypted with a lost key". It re-encrypts the entire JWE string, producing JWE_new(JWE_old(plaintext)).
The same pattern exists in decrypt_credential() at L624–626.
Why get_uek() doesn't catch it — When uid='' (which is the case for config secrets), get_uek() returns the MEK directly. Since the kid is always "key1" and get_mek("key1") returns the current (new) MEK, the key material mismatch is invisible.
Growth cycle
Initial: plaintext ~490 bytes
After MEK regen: JWE_1(plaintext) ~683 bytes
After 2nd regen: JWE_2(JWE_1(plaintext)) ~1,100 bytes
After 3rd regen: JWE_3(JWE_2(JWE_1(plaintext))) ~1,700 bytes
...
After N regens: JWE_N(JWE_{N-1}(...(JWE_1(plaintext))...)) ~1.33^N × initial
Growth is per MEK regeneration event, not per pod restart with the same MEK.
Affected configs
Three values in the configs PostgreSQL table contain pydantic.SecretStr fields:
| Config key | Nested SecretStr field | Healthy size |
|---|---|---|
service_auth |
keys.<kid>.private_key |
~2,900 B |
backend_images |
credential.auth |
~490 B |
workflow_data |
credential.access_key |
~460 B |
Failure modes
-
JWT auth failure —
service_auth.keys.<kid>.private_keybecomes a JWE string instead of a JWK JSON object.jwcrypto.jwk.JWK.from_json()fails withJSONDecodeError. Result:POST /api/auth/jwt/access_tokenreturns 500. -
K8s Secret size limit — OSMO syncs
backend_imagesto a K8s Secret. After enough MEK regenerations, the value exceeds the 1 MB K8s Secret limit. Result:422 Unprocessable Entityon Secret creation, service CrashLoopBackOff. -
Workflow credential parsing —
workflow_data.credentialbecomes a JWE string instead of the expected{endpoint, access_key_id, access_key}dict. Result:CreateGroupfails withOSMOServerError: Workflow data credential is not set.
Trigger conditions
The MEK ConfigMap (mek-config) is regenerated when:
- ArgoCD prunes/recreates the ConfigMap (e.g., when
osmo-initApplication is deleted and recreated) - Helm uninstall + reinstall
- Namespace cleanup or manual ConfigMap deletion
- Any operation that deletes the ConfigMap while the Helm
pre-installhook is active
Reproducer
Minimal Python script using only jwcrypto:
from jwcrypto import jwe, jwk
from jwcrypto.common import json_encode
ALG, ENC = 'A256GCMKW', 'A256GCM'
def make_mek():
"""Simulate MEK generation — always kid='key1', random key material."""
return jwk.JWK.generate(kty='oct', size=256, kid='key1')
def encrypt(plaintext, mek):
token = jwe.JWE(plaintext.encode(), json_encode({'alg': ALG, 'enc': ENC, 'kid': mek.key_id}))
token.add_recipient(mek)
return token.serialize(True)
def decrypt_or_reencrypt(value, mek):
"""Simulates the logic at postgres.py L2685-2701."""
token = jwe.JWE()
try:
token.deserialize(value)
token.decrypt(mek)
return token.payload.decode(), value # success
except Exception:
# BUG: re-encrypts the JWE string as if it were plaintext
return value, encrypt(value, mek)
# Simulate MEK regeneration cycle
secret = '{"kty":"RSA","n":"abc...","d":"xyz..."}' # ~40 bytes
mek_0 = make_mek()
encrypted = encrypt(secret, mek_0)
print(f"Initial: {len(encrypted)} bytes")
for i in range(1, 8):
mek_new = make_mek() # new key material, same kid="key1"
_, encrypted = decrypt_or_reencrypt(encrypted, mek_new)
print(f"After MEK regen {i}: {len(encrypted)} bytes")Output:
Initial: 262 bytes
After MEK regen 1: 573 bytes
After MEK regen 2: 987 bytes
After MEK regen 3: 1539 bytes
After MEK regen 4: 2275 bytes
After MEK regen 5: 3259 bytes
After MEK regen 6: 4571 bytes
After MEK regen 7: 6319 bytes
Proposed fix
Two complementary changes:
Fix 1 — Don't re-wrap existing JWE (postgres.py L2697):
except (JWException, osmo_errors.OSMONotFoundError):
if secret.startswith("eyJ"): # already a JWE compact serialization
logging.error(
"Cannot decrypt config key '%s': MEK key material mismatch. "
"Value will remain encrypted with the previous key.", top_level_key)
return secret, None # preserve as-is, don't double-wrap
# Genuinely unencrypted plaintext — encrypt it
encrypted = postgres.secret_manager.encrypt(secret, '')
encrypt_keys.add(top_level_key)
return secret, encrypted.valueFix 2 — Unique kid per MEK generation (mek-configmap.yaml L57):
KID=$(openssl rand -hex 8)
JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"'$KID'","kty":"oct"}'Fix 1 prevents the damage (defense in depth). Fix 2 makes the key mismatch detectable — get_mek(old_kid) would raise OSMONotFoundError instead of silently returning the wrong key.
Environment
- OSMO version: 6.2-rc6 (also reproducible on
main@ d1e8a17) - Deployment: EKS 1.31, ArgoCD-managed
- Database: Aurora PostgreSQL Serverless v2