fix: prevent MEK re-encryption from double-wrapping JWE config values#732
fix: prevent MEK re-encryption from double-wrapping JWE config values#732KeitaW wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
When the MEK ConfigMap is regenerated with new key material (e.g. via
ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the
_decrypt() except handler in postgres.py treats undecryptable JWE
ciphertext as "plaintext" and wraps it in another JWE layer. This
causes config values to grow ~1.33x per MEK regeneration, eventually
breaking JWT auth (500), workflow submission, and K8s Secret creation
(422 when backend_images exceeds the 1MB limit).
Root cause: two interacting issues:
1. Static kid="key1" in MEK generation makes key material mismatches
invisible — get_mek("key1") returns the new (wrong) key without error
2. The except handler at postgres.py L2697 does not distinguish between
genuinely unencrypted plaintext and JWE encrypted with a lost key
Fix:
- Add _is_jwe_compact() heuristic (eyJ prefix + 4 dots = 5 JWE segments)
to detect existing JWE before the re-encryption path
- Raise OSMOServerError on MEK mismatch instead of silently double-wrapping
or leaking ciphertext to application code
- Generate unique kid per MEK (key-<random hex>) in mek-configmap.yaml,
deploy-k8s.sh, and start_service_kind.py
- Add existence check in deploy-k8s.sh to skip MEK generation if ConfigMap
already exists
Fixes: NVIDIA#731
c02821e to
c77a4e0
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR gives each MEK generation a unique per-generation Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/utils/connectors/postgres.py (1)
637-645: Preserve the original decrypt failure when rethrowing.Both blocks replace
JWException/OSMONotFoundErrorwithOSMOServerErrorbut drop the original cause. Capture the exception and re-raise withfrom errorso incident logs still show whether this was a missingkidor an unwrap/decrypt failure.Minimal change
- except (JWException, osmo_errors.OSMONotFoundError): + except (JWException, osmo_errors.OSMONotFoundError) as error: if self._is_jwe_compact(value): raise osmo_errors.OSMOServerError( ... - ) + ) from errorApply the same pattern in
_decrypt().Also applies to: 2718-2734
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/connectors/postgres.py` around lines 637 - 645, The exception handlers that catch (JWException, osmo_errors.OSMONotFoundError) currently raise a new OSMOServerError but drop the original exception; update both spots (the handler inside the credential decrypt logic and the separate _decrypt() method) to capture the caught exception (e.g., "except (JWException, osmo_errors.OSMONotFoundError) as err") and re-raise OSMOServerError using "from err" so the original cause (missing kid vs unwrap/decrypt failure) is preserved in tracebacks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deployments/charts/quick-start/templates/mek-configmap.yaml`:
- Around line 56-59: The hook uses KID="key-$(openssl rand -hex 8)" but the
alpine/k8s:1.28.4 hook image doesn't ship openssl; either ensure openssl is
installed in the hook runtime (add an apk add openssl step in the
pre-install/pre-upgrade hook) or replace the openssl call with a POSIX-safe
alternative that exists in the image (e.g., use /dev/urandom + od/xxd or a
shell-based hex generator). Update the template line that defines KID (the
KID="key-$(openssl rand -hex 8)" expression) and the hook commands so the chosen
tool is available before KID is evaluated.
In `@deployments/scripts/deploy-k8s.sh`:
- Around line 278-283: The current key-generation uses local var=$(...) which
masks command-substitution failures; change the pattern in deploy-k8s.sh so you
first declare the locals (e.g., local random_key; local kid; local jwk_json;
local encoded_jwk) and then assign using separate statements
(random_key="$(openssl rand -base64 32 | tr -d '\n')" etc.), so any failure in
openssl or base64 will cause the script to exit under set -euo pipefail; update
the assignments for random_key, kid, jwk_json and encoded_jwk accordingly and
keep the existing transformations (tr -d '\n', openssl rand -hex 8, base64).
In `@src/utils/connectors/postgres.py`:
- Around line 606-617: The current _is_jwe_compact(value: str) only checks shape
and causes false positives; change detection to require an OSMO-specific marker:
decode the first JWE segment (header) using base64url, parse JSON, and return
True only if it contains a designated OSMO claim/header (e.g.,
"osmo_version":"v1" or "osmo_encrypted":true); if decoding/parsing fails return
False. Update decrypt_credential() and DynamicConfig.deserialize() to call this
revised checker (or a new _is_osmo_encrypted helper) so only OSMO-tagged JWEs
trigger the hard-fail decrypt path; leave other JWE-shaped/plain values to be
treated as plaintext and handled via lazy encryption.
---
Nitpick comments:
In `@src/utils/connectors/postgres.py`:
- Around line 637-645: The exception handlers that catch (JWException,
osmo_errors.OSMONotFoundError) currently raise a new OSMOServerError but drop
the original exception; update both spots (the handler inside the credential
decrypt logic and the separate _decrypt() method) to capture the caught
exception (e.g., "except (JWException, osmo_errors.OSMONotFoundError) as err")
and re-raise OSMOServerError using "from err" so the original cause (missing kid
vs unwrap/decrypt failure) is preserved in tracebacks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 102ee310-3a04-4416-87b7-88cbc233c314
📒 Files selected for processing (4)
deployments/charts/quick-start/templates/mek-configmap.yamldeployments/scripts/deploy-k8s.shrun/start_service_kind.pysrc/utils/connectors/postgres.py
- Replace openssl with POSIX-safe /dev/urandom+od in Helm hook (alpine/k8s image lacks openssl) - Separate local declaration from assignment in deploy-k8s.sh to preserve exit codes under set -euo pipefail (SC2155) - Preserve original exception cause with 'from error' in decrypt handlers - Document _is_jwe_compact shape-based heuristic limitation
Two improvements discovered during production testing: 1. WebSocket exception handler (service.py): The generic top_level_exception_handler returns JSONResponse on ALL exceptions, but WebSocket connections cannot receive HTTP responses. This triggers ASGI protocol error: "Expected 'websocket.send' but got 'websocket.http.response.start'". Fix: detect WebSocket scope and close gracefully with code 1011 instead. 2. Graceful MEK mismatch (postgres.py): Changed from raising OSMOServerError (which crashes the service) to logging an error and returning empty string. This keeps the service alive after MEK rotation — affected config fields will be empty but the service continues to function. The previous behavior crashed the entire service on startup when ANY config had encrypted values from a previous MEK, preventing WebSocket connections from being established. Both issues were identified via production testing on EKS where: - MEK ConfigMap was recreated with new key material - Existing DB configs had values encrypted with the old MEK - Service crashed on startup → backend-listener WebSocket reconnect loop - Even when service partially started, WebSocket handler failures produced ASGI protocol errors instead of clean disconnects
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/service/core/service.py (1)
235-241: Good approach for WebSocket-safe exception handling.The detection of WebSocket scope and graceful close with code 1011 is the correct pattern to avoid ASGI protocol errors. The context snippets confirm that most WebSocket endpoints have internal exception handling, so this serves as a useful safety net.
Two minor concerns:
Silent exception swallowing (lines 239-240): Per static analysis (S110/BLE001), consider logging at
DEBUGlevel to aid troubleshooting when the close fails unexpectedly.Byte vs character truncation:
str(error)[:123]slices by character count, but RFC 6455 limits the close reason to 123 bytes. Multi-byte UTF-8 characters could still exceed the limit. Consider encoding and truncating by bytes.🔧 Suggested improvement
if request.scope.get('type') == 'websocket': try: websocket = fastapi.WebSocket(request.scope, request.receive, request._send) - await websocket.close(code=1011, reason=str(error)[:123]) - except Exception: - pass # Connection may already be closed + # Truncate by bytes to respect RFC 6455 limit + reason = str(error).encode('utf-8')[:123].decode('utf-8', errors='ignore') + await websocket.close(code=1011, reason=reason) + except Exception as close_err: # pylint: disable=broad-except + logging.debug('Failed to close WebSocket gracefully: %s', close_err) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/service/core/service.py` around lines 235 - 241, Replace the silent except with a debug log and ensure the close reason is truncated to 123 bytes (not characters): when detecting WebSocket scope (request.scope.get('type') == 'websocket') keep creating fastapi.WebSocket(request.scope, request.receive, request._send) and call websocket.close(code=1011, reason=...) but build the reason by encoding str(error) to UTF-8, truncating to 123 bytes and decoding back to a valid string (e.g., trim bytes and decode with errors='ignore') so you never exceed the RFC6455 byte limit; in the except block catch Exception as exc and emit a process- or module-level logger.debug (or similar) referencing websocket.close and the exception to avoid silently swallowing failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/service/core/service.py`:
- Around line 235-241: Replace the silent except with a debug log and ensure the
close reason is truncated to 123 bytes (not characters): when detecting
WebSocket scope (request.scope.get('type') == 'websocket') keep creating
fastapi.WebSocket(request.scope, request.receive, request._send) and call
websocket.close(code=1011, reason=...) but build the reason by encoding
str(error) to UTF-8, truncating to 123 bytes and decoding back to a valid string
(e.g., trim bytes and decode with errors='ignore') so you never exceed the
RFC6455 byte limit; in the except block catch Exception as exc and emit a
process- or module-level logger.debug (or similar) referencing websocket.close
and the exception to avoid silently swallowing failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0d7a6328-fa58-4aea-89b5-b74c0ffad8cb
📒 Files selected for processing (2)
src/service/core/service.pysrc/utils/connectors/postgres.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/utils/connectors/postgres.py
The previous approach (return '', secret) kept the undecryptable JWE in the DB permanently. The service never self-healed because: - _init_configs uses INSERT ON CONFLICT DO NOTHING - The row exists (even with empty fields) → insert skipped - service_auth with empty private_key → JWT signing fails → auth 500 New approach: when _is_jwe_compact() detects a value encrypted with a stale MEK, DELETE the config row from the DB. On next startup, _init_configs() re-inserts the default (e.g., fresh 4096-bit RSA keypair for service_auth) because the row is now absent. The DELETE uses the composite primary key (key, type) to avoid affecting configs of other types. Credentials (decrypt_credential) are NOT deleted — they are user-supplied with no default_factory regeneration. The stale JWE is preserved so the user can re-enter after MEK rotation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/utils/job/jobs.py (2)
1443-1467:⚠️ Potential issue | 🔴 CriticalInconsistent error-log key format will cause migration to fail.
Lines 1373-1377 use
get_error_log_key()which returns keys with hash tags (e.g.,{workflow_id}-{task_uuid}-{retry_id}-error-logs), but line 1467 constructs error-log keys using the old format without hash tags (f'{prefix}-error-logs').This means:
xadd/expirewrite to{workflow_id}-{task_uuid}-{retry_id}-error-logsmigrate_logs_concurrentlyreads fromworkflow_id-task_uuid-retry_id-error-logsThe migration will read from non-existent keys and error logs will not be uploaded to S3.
🐛 Proposed fix
if task_obj.status.has_error_logs(): - prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}' task_error_log_name = task_obj.name if task_obj.retry_id > 0: task_error_log_name += f'_{task_obj.retry_id}' task_error_log_name += common.ERROR_LOGS_SUFFIX_FILE_NAME task_parameters.append( - (workflow_obj.logs, f'{prefix}-error-logs', task_error_log_name)) + (workflow_obj.logs, + common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id), + task_error_log_name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/job/jobs.py` around lines 1443 - 1467, The error-log Redis key is constructed using the old format f'{prefix}-error-logs' when appending to task_parameters, causing a mismatch with code that writes keys via common.get_error_log_key(...); replace the hardcoded f'{prefix}-error-logs' with the canonical key by calling common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id) (or otherwise use the same helper used elsewhere) when building the tuple for error logs so xadd/expire and migrate_logs_concurrently use the same hashed-key format; adjust the tuple append in the loop that populates task_parameters accordingly.
1500-1502:⚠️ Potential issue | 🔴 CriticalInconsistent error-log key format will cause deletion to miss keys.
Similar to the migration issue, the deletion constructs keys using the old format without hash tags, but the data was written using
get_error_log_key()with hash tags.🐛 Proposed fix
if task_obj.status.has_error_logs(): - prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}' - redis_keys_to_delete.append(f'{prefix}-error-logs') + redis_keys_to_delete.append( + common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/job/jobs.py` around lines 1500 - 1502, The deletion builds Redis error-log keys using an old prefix format while writes use get_error_log_key(...) (which includes hash tags), so replace the manual prefix-based key construction in the block where task_obj.status.has_error_logs() is checked with a call to the same key generator used for writes (i.e., call get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts task_obj) and append that returned key to redis_keys_to_delete so the deletion matches the stored keys exactly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/utils/common.py`:
- Around line 1122-1124: Replace the assert validation in the function that
builds the pod-conditions key for workflow_uuid: instead of using "assert
len(workflow_uuid) == 32 and all(...)" raise a ValueError with a clear message
when workflow_uuid is not 32 hex characters (e.g., check length and hex chars
and raise ValueError('Input must be workflow UUID (32 hex characters)') ), then
return the formatted string f'{{{workflow_uuid}}}-pod-conditions'; locate the
check around the variable workflow_uuid in common.py and update it to use
explicit validation + ValueError rather than assert.
- Around line 1127-1144: Multiple places build Redis log keys with hardcoded
formats and missing hash tags, causing mismatch with the new helpers; update
those callers to use get_workflow_log_key(workflow_id) and
get_error_log_key(workflow_id, task_uuid, retry_id). Specifically, in the reader
endpoint (where redis_name = f'{name}-logs') replace the hardcoded construction
with get_workflow_log_key(name); in agent/objects.py (where the writer builds
f'{self._current_job.workflow.workflow_id}-{message_option.pod_log.task}-{message_option.pod_log.retry_id}-error-logs')
use get_error_log_key(self._current_job.workflow.workflow_id,
message_option.pod_log.task_uuid or the correct task UUID field,
message_option.pod_log.retry_id); and in jobs.py cleanup (where keys are built
as f'{prefix}-error-logs') call get_error_log_key(prefix_workflow_id, task_uuid,
retry_id) or get_workflow_log_key(prefix) as appropriate—ensure you pass the
workflow_id and task_uuid values the helpers expect so produced keys include the
Redis hash tag.
---
Outside diff comments:
In `@src/utils/job/jobs.py`:
- Around line 1443-1467: The error-log Redis key is constructed using the old
format f'{prefix}-error-logs' when appending to task_parameters, causing a
mismatch with code that writes keys via common.get_error_log_key(...); replace
the hardcoded f'{prefix}-error-logs' with the canonical key by calling
common.get_error_log_key(self.workflow_id, task_obj.task_uuid,
task_obj.retry_id) (or otherwise use the same helper used elsewhere) when
building the tuple for error logs so xadd/expire and migrate_logs_concurrently
use the same hashed-key format; adjust the tuple append in the loop that
populates task_parameters accordingly.
- Around line 1500-1502: The deletion builds Redis error-log keys using an old
prefix format while writes use get_error_log_key(...) (which includes hash
tags), so replace the manual prefix-based key construction in the block where
task_obj.status.has_error_logs() is checked with a call to the same key
generator used for writes (i.e., call get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts
task_obj) and append that returned key to redis_keys_to_delete so the deletion
matches the stored keys exactly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3345a590-82ce-442f-bea2-1b0284726fac
📒 Files selected for processing (3)
src/lib/utils/common.pysrc/service/logger/ctrl_websocket.pysrc/utils/job/jobs.py
54d123e to
46f557e
Compare
| local encoded_jwk=$(echo -n "$jwk_json" | base64 | tr -d '\n') | ||
|
|
||
| local mek_manifest="apiVersion: v1 | ||
| # Generate and create MEK (skip if already exists to avoid key material mismatch) |
There was a problem hiding this comment.
Take a look at deploy_service.rst to see if there are updates to the commands there
There was a problem hiding this comment.
Updated both docs/deployment_guide/getting_started/deploy_service.rst and docs/deployment_guide/appendix/deploy_minimal.rst to match the deploy script behavior:
- Replaced all hardcoded
"kid":"key1"with dynamicKID="key-$(openssl rand -hex 8)"generation - Updated ConfigMap YAML examples to use
$KIDvariable - Added a
.. note::explaining why unique key IDs matter (detecting key material mismatches) - Added a
.. tip::warning not to re-create the MEK ConfigMap if it already exists
See commit ebd427b.
| # Must include type in WHERE clause — configs PK is (key, type). | ||
| config_type = dynamic_config.get_type().value | ||
| for key in delete_keys: | ||
| cmd = 'DELETE FROM configs WHERE key = %s AND type = %s;' |
There was a problem hiding this comment.
Consider batching the keys into one SQL call, instead of making a call per key
There was a problem hiding this comment.
Batched into a single ANY(%s) call, matching the existing pattern in _batch_fetch_external_roles:
if delete_keys:
config_type = dynamic_config.get_type().value
cmd = 'DELETE FROM configs WHERE key = ANY(%s) AND type = %s;'
postgres.execute_commit_command(cmd, (list(delete_keys), config_type))The encrypt_keys UPDATE loop above is left as-is since each key has a different old_value for optimistic concurrency, making it unsuitable for simple batching.
See commit ebd427b.
Address ethany-nv's review comments on PR NVIDIA#732: 1. Batch per-key DELETE loop into single `ANY(%s)` SQL call, matching the existing pattern in `_batch_fetch_external_roles`. This reduces DB round-trips and makes the operation atomic. 2. Update deployment docs (`deploy_service.rst`, `deploy_minimal.rst`) to use dynamic `kid` generation instead of hardcoded `key1`, matching the deploy script and Helm chart. Added notes explaining why unique key IDs matter and a tip about MEK idempotency.
Fixes #731
Summary
When the MEK ConfigMap is regenerated with new key material (via ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the
_decrypt()except handler inpostgres.pytreats undecryptable JWE ciphertext as "plaintext" and wraps it in another JWE layer. This causes config values to grow ~1.33× per MEK regeneration, eventually breaking JWT auth (500), workflow submission, and K8s Secret creation (422 when >1MB).Changes
1. Detect JWE before re-encrypting (
postgres.py)Added
_is_jwe_compact()heuristic that checks foreyJprefix + exactly 4 dots (5 JWE compact segments). This distinguishes JWE from JWS/JWT (3 dots) and plain JSON/strings (0 dots).When a MEK mismatch is detected, the fix raises
OSMOServerErrorwith a clear recovery message instead of:Both affected sites are fixed:
_decrypt()L2697–2701 — config deserializationdecrypt_credential()L624–626 — credential storage2. Unique
kidper MEK generationChanged static
kid="key1"tokid="key-$(openssl rand -hex 8)"in:mek-configmap.yamlL57 — Helm hookdeploy-k8s.shL276 — deploy scriptstart_service_kind.pyL137 — local devThis makes key material mismatches detectable:
get_mek(old_kid)raisesOSMONotFoundErrorinstead of silently returning the wrong key.3. Existence check in
deploy-k8s.shAdded
kubectl get configmap mek-configcheck before generation to preventkubectl applyfrom overwriting an existing MEK on re-runs.Validation
Tested the core fix logic using the reproducer from #731:
Also validated on a live EKS 1.31 cluster with OSMO v6.2-rc6 + ArgoCD where the bug was originally discovered.
Backward Compatibility
kid="key1"continue to work — the fix only affects new MEK generation_is_jwe_compact()check only triggers in the except handler, so the normal encryption/decryption path is unchangedSummary by CodeRabbit
Bug Fixes
New Features
Chores