Skip to content

MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat #731

@KeitaW

Description

@KeitaW

MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat

Summary

When the MEK (Master Encryption Key) ConfigMap is regenerated with new key material — which happens on Helm reinstall, ArgoCD prune/recreate, or namespace cleanup — the config deserialization logic in postgres.py wraps already-encrypted JWE values in an additional JWE layer instead of detecting the key mismatch. This causes config values to grow ~1.33× per MEK regeneration (faster for small values due to fixed JWE overhead), eventually breaking JWT auth, workflow submission, and K8s Secret creation.

Affected code

Two interacting issues:

1. Static kid in MEK generationmek-configmap.yaml L57:

JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"key1","kty":"oct"}'

Every MEK regeneration produces a key with kid="key1". When get_mek("key1") is called, it returns the current (new) key — the kid matches, so no OSMONotFoundError is raised, but the key material is wrong.

2. Except handler treats undecryptable JWE as plaintextpostgres.py L2697–2701:

except (JWException, osmo_errors.OSMONotFoundError):
    # Encrypt the plain text secret
    encrypted = postgres.secret_manager.encrypt(secret, '')
    encrypt_keys.add(top_level_key)
    return secret, encrypted.value

When deserialize() succeeds (the value is structurally valid JWE) but decrypt() fails (wrong key material), the except handler does not distinguish between "genuinely unencrypted plaintext" and "JWE encrypted with a lost key". It re-encrypts the entire JWE string, producing JWE_new(JWE_old(plaintext)).

The same pattern exists in decrypt_credential() at L624–626.

Why get_uek() doesn't catch it — When uid='' (which is the case for config secrets), get_uek() returns the MEK directly. Since the kid is always "key1" and get_mek("key1") returns the current (new) MEK, the key material mismatch is invisible.

Growth cycle

Initial:         plaintext                                        ~490 bytes
After MEK regen: JWE_1(plaintext)                                 ~683 bytes
After 2nd regen: JWE_2(JWE_1(plaintext))                        ~1,100 bytes
After 3rd regen: JWE_3(JWE_2(JWE_1(plaintext)))                 ~1,700 bytes
...
After N regens:  JWE_N(JWE_{N-1}(...(JWE_1(plaintext))...))     ~1.33^N × initial

Growth is per MEK regeneration event, not per pod restart with the same MEK.

Affected configs

Three values in the configs PostgreSQL table contain pydantic.SecretStr fields:

Config key Nested SecretStr field Healthy size
service_auth keys.<kid>.private_key ~2,900 B
backend_images credential.auth ~490 B
workflow_data credential.access_key ~460 B

Failure modes

  1. JWT auth failureservice_auth.keys.<kid>.private_key becomes a JWE string instead of a JWK JSON object. jwcrypto.jwk.JWK.from_json() fails with JSONDecodeError. Result: POST /api/auth/jwt/access_token returns 500.

  2. K8s Secret size limit — OSMO syncs backend_images to a K8s Secret. After enough MEK regenerations, the value exceeds the 1 MB K8s Secret limit. Result: 422 Unprocessable Entity on Secret creation, service CrashLoopBackOff.

  3. Workflow credential parsingworkflow_data.credential becomes a JWE string instead of the expected {endpoint, access_key_id, access_key} dict. Result: CreateGroup fails with OSMOServerError: Workflow data credential is not set.

Trigger conditions

The MEK ConfigMap (mek-config) is regenerated when:

  • ArgoCD prunes/recreates the ConfigMap (e.g., when osmo-init Application is deleted and recreated)
  • Helm uninstall + reinstall
  • Namespace cleanup or manual ConfigMap deletion
  • Any operation that deletes the ConfigMap while the Helm pre-install hook is active

Reproducer

Minimal Python script using only jwcrypto:

from jwcrypto import jwe, jwk
from jwcrypto.common import json_encode

ALG, ENC = 'A256GCMKW', 'A256GCM'

def make_mek():
    """Simulate MEK generation — always kid='key1', random key material."""
    return jwk.JWK.generate(kty='oct', size=256, kid='key1')

def encrypt(plaintext, mek):
    token = jwe.JWE(plaintext.encode(), json_encode({'alg': ALG, 'enc': ENC, 'kid': mek.key_id}))
    token.add_recipient(mek)
    return token.serialize(True)

def decrypt_or_reencrypt(value, mek):
    """Simulates the logic at postgres.py L2685-2701."""
    token = jwe.JWE()
    try:
        token.deserialize(value)
        token.decrypt(mek)
        return token.payload.decode(), value  # success
    except Exception:
        # BUG: re-encrypts the JWE string as if it were plaintext
        return value, encrypt(value, mek)

# Simulate MEK regeneration cycle
secret = '{"kty":"RSA","n":"abc...","d":"xyz..."}'  # ~40 bytes
mek_0 = make_mek()
encrypted = encrypt(secret, mek_0)
print(f"Initial: {len(encrypted)} bytes")

for i in range(1, 8):
    mek_new = make_mek()  # new key material, same kid="key1"
    _, encrypted = decrypt_or_reencrypt(encrypted, mek_new)
    print(f"After MEK regen {i}: {len(encrypted)} bytes")

Output:

Initial: 262 bytes
After MEK regen 1: 573 bytes
After MEK regen 2: 987 bytes
After MEK regen 3: 1539 bytes
After MEK regen 4: 2275 bytes
After MEK regen 5: 3259 bytes
After MEK regen 6: 4571 bytes
After MEK regen 7: 6319 bytes

Proposed fix

Two complementary changes:

Fix 1 — Don't re-wrap existing JWE (postgres.py L2697):

except (JWException, osmo_errors.OSMONotFoundError):
    if secret.startswith("eyJ"):  # already a JWE compact serialization
        logging.error(
            "Cannot decrypt config key '%s': MEK key material mismatch. "
            "Value will remain encrypted with the previous key.", top_level_key)
        return secret, None  # preserve as-is, don't double-wrap
    # Genuinely unencrypted plaintext — encrypt it
    encrypted = postgres.secret_manager.encrypt(secret, '')
    encrypt_keys.add(top_level_key)
    return secret, encrypted.value

Fix 2 — Unique kid per MEK generation (mek-configmap.yaml L57):

KID=$(openssl rand -hex 8)
JWK_JSON='{"k":"'$RANDOM_KEY'","kid":"'$KID'","kty":"oct"}'

Fix 1 prevents the damage (defense in depth). Fix 2 makes the key mismatch detectable — get_mek(old_kid) would raise OSMONotFoundError instead of silently returning the wrong key.

Environment

  • OSMO version: 6.2-rc6 (also reproducible on main @ d1e8a17)
  • Deployment: EKS 1.31, ArgoCD-managed
  • Database: Aurora PostgreSQL Serverless v2

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalThe author is not in @NVIDIA/osmo-dev

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions