Skip to content

fix: prevent MEK re-encryption from double-wrapping JWE config values#732

Open
KeitaW wants to merge 5 commits intoNVIDIA:mainfrom
KeitaW:fix/mek-reencryption-bloat
Open

fix: prevent MEK re-encryption from double-wrapping JWE config values#732
KeitaW wants to merge 5 commits intoNVIDIA:mainfrom
KeitaW:fix/mek-reencryption-bloat

Conversation

@KeitaW
Copy link
Copy Markdown

@KeitaW KeitaW commented Mar 23, 2026

Fixes #731

Summary

When the MEK ConfigMap is regenerated with new key material (via ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the _decrypt() except handler in postgres.py treats undecryptable JWE ciphertext as "plaintext" and wraps it in another JWE layer. This causes config values to grow ~1.33× per MEK regeneration, eventually breaking JWT auth (500), workflow submission, and K8s Secret creation (422 when >1MB).

Changes

1. Detect JWE before re-encrypting (postgres.py)

Added _is_jwe_compact() heuristic that checks for eyJ prefix + exactly 4 dots (5 JWE compact segments). This distinguishes JWE from JWS/JWT (3 dots) and plain JSON/strings (0 dots).

When a MEK mismatch is detected, the fix raises OSMOServerError with a clear recovery message instead of:

  • Double-wrapping the value (the original bug)
  • Silently returning ciphertext to application code (which would cause downstream failures)

Both affected sites are fixed:

2. Unique kid per MEK generation

Changed static kid="key1" to kid="key-$(openssl rand -hex 8)" in:

This makes key material mismatches detectable: get_mek(old_kid) raises OSMONotFoundError instead of silently returning the wrong key.

3. Existence check in deploy-k8s.sh

Added kubectl get configmap mek-config check before generation to prevent kubectl apply from overwriting an existing MEK on re-runs.

Validation

Tested the core fix logic using the reproducer from #731:

=== Old behavior (bug) ===
Initial: 267 bytes
After regen 1: 579 bytes
After regen 2: 995 bytes
After regen 3: 1550 bytes
After regen 4: 2290 bytes
After regen 5: 3277 bytes

=== New behavior (fix) ===
Initial: 267 bytes
After regen 1: DETECTED MEK mismatch — raising OSMOServerError (size unchanged: 267 bytes)

=== Unique kid detection ===
Encrypted with kid=key-abc123
Current MEK kid=key-def456
Kid mismatch detectable: True

Also validated on a live EKS 1.31 cluster with OSMO v6.2-rc6 + ArgoCD where the bug was originally discovered.

Backward Compatibility

  • Existing deployments with kid="key1" continue to work — the fix only affects new MEK generation
  • Same-key pod restarts (no MEK ConfigMap deletion) are unaffected — the correct key is used
  • The _is_jwe_compact() check only triggers in the except handler, so the normal encryption/decryption path is unchanged

Summary by CodeRabbit

  • Bug Fixes

    • Detect and clear credentials that fail to decrypt due to key mismatches to prevent silent re-encryption and config growth
    • WebSocket errors now close connections with an appropriate error code instead of returning an HTTP error response
  • New Features

    • New helpers and key formats for Redis workflow and error log stream names
  • Chores

    • Deployment now skips regenerating encryption key config if present
    • Key identifiers are now generated uniquely per instance/run

@KeitaW KeitaW requested a review from a team as a code owner March 23, 2026 00:56
When the MEK ConfigMap is regenerated with new key material (e.g. via
ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the
_decrypt() except handler in postgres.py treats undecryptable JWE
ciphertext as "plaintext" and wraps it in another JWE layer. This
causes config values to grow ~1.33x per MEK regeneration, eventually
breaking JWT auth (500), workflow submission, and K8s Secret creation
(422 when backend_images exceeds the 1MB limit).

Root cause: two interacting issues:
1. Static kid="key1" in MEK generation makes key material mismatches
   invisible — get_mek("key1") returns the new (wrong) key without error
2. The except handler at postgres.py L2697 does not distinguish between
   genuinely unencrypted plaintext and JWE encrypted with a lost key

Fix:
- Add _is_jwe_compact() heuristic (eyJ prefix + 4 dots = 5 JWE segments)
  to detect existing JWE before the re-encryption path
- Raise OSMOServerError on MEK mismatch instead of silently double-wrapping
  or leaking ciphertext to application code
- Generate unique kid per MEK (key-<random hex>) in mek-configmap.yaml,
  deploy-k8s.sh, and start_service_kind.py
- Add existence check in deploy-k8s.sh to skip MEK generation if ConfigMap
  already exists

Fixes: NVIDIA#731
@KeitaW KeitaW force-pushed the fix/mek-reencryption-bloat branch from c02821e to c77a4e0 Compare March 23, 2026 01:07
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR gives each MEK generation a unique per-generation kid, avoids overwriting an existing MEK ConfigMap during deploy, detects JWE compact ciphertext on decrypt failures to avoid double-wrapping, deletes stale config rows for undecryptable secrets, updates Redis log key formatting, and closes WebSocket connections with code 1011 on unhandled exceptions.

Changes

Cohort / File(s) Summary
MEK generation & deploy
deployments/charts/quick-start/templates/mek-configmap.yaml, deployments/scripts/deploy-k8s.sh, run/start_service_kind.py
Generate a per-generation kid (key-<hex>) in JWKs and use it as currentMek/map key; deployment script skips generating/applying if mek-config exists; local dev helper emits dynamic kid.
DB connector — JWE handling
src/utils/connectors/postgres.py
Added PostgresConnector._is_jwe_compact() heuristic; on JWException/OSMONotFoundError for JWE-like values, log MEK mismatch, preserve ciphertext (avoid re-encrypting), mark keys for deletion and delete stale configs rows so values can be regenerated later.
Web error handling (WS)
src/service/core/service.py
top_level_exception_handler detects WebSocket scopes and closes the WebSocket with code 1011 and a trimmed reason instead of returning an HTTP JSONResponse for WS requests.
Redis key naming & log helpers
src/lib/utils/common.py, src/service/logger/ctrl_websocket.py, src/utils/job/jobs.py
Wrap workflow IDs in Redis hash tags for keys; added helpers get_workflow_log_key and get_error_log_key; updated usages across websocket logger and job workflows to use new helpers for workflow and error log streams.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through keys both old and new,
I named each MEK with a fresh-born hue.
I sniff the JWE and warn with care,
I leave the bytes when keys aren't there,
No double-wrap — a lighter share. 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Redis key hash-tag changes in common.py, ctrl_websocket.py, and jobs.py are related to preventing config bloat and ensuring safe recovery, but are not directly tied to the #731 MEK re-encryption bug fix. Clarify whether Redis hash-tag changes are necessary for the MEK fix or should be separated into a distinct PR; document their relationship to the core bug fix if they are essential.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: preventing MEK re-encryption from double-wrapping JWE config values, which is the core bug fix addressed across multiple files.
Linked Issues check ✅ Passed All PR objectives from issue #731 are met: unique kid per MEK generation implemented, JWE detection heuristic added, double-wrapping prevention implemented, ConfigMap existence check added, and Redis key hash-tag improvements included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/utils/connectors/postgres.py (1)

637-645: Preserve the original decrypt failure when rethrowing.

Both blocks replace JWException / OSMONotFoundError with OSMOServerError but drop the original cause. Capture the exception and re-raise with from error so incident logs still show whether this was a missing kid or an unwrap/decrypt failure.

Minimal change
-            except (JWException, osmo_errors.OSMONotFoundError):
+            except (JWException, osmo_errors.OSMONotFoundError) as error:
                 if self._is_jwe_compact(value):
                     raise osmo_errors.OSMOServerError(
                         ...
-                    )
+                    ) from error

Apply the same pattern in _decrypt().

Also applies to: 2718-2734

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/connectors/postgres.py` around lines 637 - 645, The exception
handlers that catch (JWException, osmo_errors.OSMONotFoundError) currently raise
a new OSMOServerError but drop the original exception; update both spots (the
handler inside the credential decrypt logic and the separate _decrypt() method)
to capture the caught exception (e.g., "except (JWException,
osmo_errors.OSMONotFoundError) as err") and re-raise OSMOServerError using "from
err" so the original cause (missing kid vs unwrap/decrypt failure) is preserved
in tracebacks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deployments/charts/quick-start/templates/mek-configmap.yaml`:
- Around line 56-59: The hook uses KID="key-$(openssl rand -hex 8)" but the
alpine/k8s:1.28.4 hook image doesn't ship openssl; either ensure openssl is
installed in the hook runtime (add an apk add openssl step in the
pre-install/pre-upgrade hook) or replace the openssl call with a POSIX-safe
alternative that exists in the image (e.g., use /dev/urandom + od/xxd or a
shell-based hex generator). Update the template line that defines KID (the
KID="key-$(openssl rand -hex 8)" expression) and the hook commands so the chosen
tool is available before KID is evaluated.

In `@deployments/scripts/deploy-k8s.sh`:
- Around line 278-283: The current key-generation uses local var=$(...) which
masks command-substitution failures; change the pattern in deploy-k8s.sh so you
first declare the locals (e.g., local random_key; local kid; local jwk_json;
local encoded_jwk) and then assign using separate statements
(random_key="$(openssl rand -base64 32 | tr -d '\n')" etc.), so any failure in
openssl or base64 will cause the script to exit under set -euo pipefail; update
the assignments for random_key, kid, jwk_json and encoded_jwk accordingly and
keep the existing transformations (tr -d '\n', openssl rand -hex 8, base64).

In `@src/utils/connectors/postgres.py`:
- Around line 606-617: The current _is_jwe_compact(value: str) only checks shape
and causes false positives; change detection to require an OSMO-specific marker:
decode the first JWE segment (header) using base64url, parse JSON, and return
True only if it contains a designated OSMO claim/header (e.g.,
"osmo_version":"v1" or "osmo_encrypted":true); if decoding/parsing fails return
False. Update decrypt_credential() and DynamicConfig.deserialize() to call this
revised checker (or a new _is_osmo_encrypted helper) so only OSMO-tagged JWEs
trigger the hard-fail decrypt path; leave other JWE-shaped/plain values to be
treated as plaintext and handled via lazy encryption.

---

Nitpick comments:
In `@src/utils/connectors/postgres.py`:
- Around line 637-645: The exception handlers that catch (JWException,
osmo_errors.OSMONotFoundError) currently raise a new OSMOServerError but drop
the original exception; update both spots (the handler inside the credential
decrypt logic and the separate _decrypt() method) to capture the caught
exception (e.g., "except (JWException, osmo_errors.OSMONotFoundError) as err")
and re-raise OSMOServerError using "from err" so the original cause (missing kid
vs unwrap/decrypt failure) is preserved in tracebacks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 102ee310-3a04-4416-87b7-88cbc233c314

📥 Commits

Reviewing files that changed from the base of the PR and between d1e8a17 and c77a4e0.

📒 Files selected for processing (4)
  • deployments/charts/quick-start/templates/mek-configmap.yaml
  • deployments/scripts/deploy-k8s.sh
  • run/start_service_kind.py
  • src/utils/connectors/postgres.py

KeitaW added 2 commits March 23, 2026 02:11
- Replace openssl with POSIX-safe /dev/urandom+od in Helm hook (alpine/k8s
  image lacks openssl)
- Separate local declaration from assignment in deploy-k8s.sh to preserve
  exit codes under set -euo pipefail (SC2155)
- Preserve original exception cause with 'from error' in decrypt handlers
- Document _is_jwe_compact shape-based heuristic limitation
Two improvements discovered during production testing:

1. WebSocket exception handler (service.py):
   The generic top_level_exception_handler returns JSONResponse on ALL
   exceptions, but WebSocket connections cannot receive HTTP responses.
   This triggers ASGI protocol error: "Expected 'websocket.send' but
   got 'websocket.http.response.start'". Fix: detect WebSocket scope
   and close gracefully with code 1011 instead.

2. Graceful MEK mismatch (postgres.py):
   Changed from raising OSMOServerError (which crashes the service) to
   logging an error and returning empty string. This keeps the service
   alive after MEK rotation — affected config fields will be empty but
   the service continues to function. The previous behavior crashed the
   entire service on startup when ANY config had encrypted values from
   a previous MEK, preventing WebSocket connections from being established.

Both issues were identified via production testing on EKS where:
- MEK ConfigMap was recreated with new key material
- Existing DB configs had values encrypted with the old MEK
- Service crashed on startup → backend-listener WebSocket reconnect loop
- Even when service partially started, WebSocket handler failures
  produced ASGI protocol errors instead of clean disconnects
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/service/core/service.py (1)

235-241: Good approach for WebSocket-safe exception handling.

The detection of WebSocket scope and graceful close with code 1011 is the correct pattern to avoid ASGI protocol errors. The context snippets confirm that most WebSocket endpoints have internal exception handling, so this serves as a useful safety net.

Two minor concerns:

  1. Silent exception swallowing (lines 239-240): Per static analysis (S110/BLE001), consider logging at DEBUG level to aid troubleshooting when the close fails unexpectedly.

  2. Byte vs character truncation: str(error)[:123] slices by character count, but RFC 6455 limits the close reason to 123 bytes. Multi-byte UTF-8 characters could still exceed the limit. Consider encoding and truncating by bytes.

🔧 Suggested improvement
     if request.scope.get('type') == 'websocket':
         try:
             websocket = fastapi.WebSocket(request.scope, request.receive, request._send)
-            await websocket.close(code=1011, reason=str(error)[:123])
-        except Exception:
-            pass  # Connection may already be closed
+            # Truncate by bytes to respect RFC 6455 limit
+            reason = str(error).encode('utf-8')[:123].decode('utf-8', errors='ignore')
+            await websocket.close(code=1011, reason=reason)
+        except Exception as close_err:  # pylint: disable=broad-except
+            logging.debug('Failed to close WebSocket gracefully: %s', close_err)
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/service.py` around lines 235 - 241, Replace the silent
except with a debug log and ensure the close reason is truncated to 123 bytes
(not characters): when detecting WebSocket scope (request.scope.get('type') ==
'websocket') keep creating fastapi.WebSocket(request.scope, request.receive,
request._send) and call websocket.close(code=1011, reason=...) but build the
reason by encoding str(error) to UTF-8, truncating to 123 bytes and decoding
back to a valid string (e.g., trim bytes and decode with errors='ignore') so you
never exceed the RFC6455 byte limit; in the except block catch Exception as exc
and emit a process- or module-level logger.debug (or similar) referencing
websocket.close and the exception to avoid silently swallowing failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/service/core/service.py`:
- Around line 235-241: Replace the silent except with a debug log and ensure the
close reason is truncated to 123 bytes (not characters): when detecting
WebSocket scope (request.scope.get('type') == 'websocket') keep creating
fastapi.WebSocket(request.scope, request.receive, request._send) and call
websocket.close(code=1011, reason=...) but build the reason by encoding
str(error) to UTF-8, truncating to 123 bytes and decoding back to a valid string
(e.g., trim bytes and decode with errors='ignore') so you never exceed the
RFC6455 byte limit; in the except block catch Exception as exc and emit a
process- or module-level logger.debug (or similar) referencing websocket.close
and the exception to avoid silently swallowing failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d7a6328-fa58-4aea-89b5-b74c0ffad8cb

📥 Commits

Reviewing files that changed from the base of the PR and between ad47c7b and 222d2c7.

📒 Files selected for processing (2)
  • src/service/core/service.py
  • src/utils/connectors/postgres.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/utils/connectors/postgres.py

The previous approach (return '', secret) kept the undecryptable JWE
in the DB permanently. The service never self-healed because:
- _init_configs uses INSERT ON CONFLICT DO NOTHING
- The row exists (even with empty fields) → insert skipped
- service_auth with empty private_key → JWT signing fails → auth 500

New approach: when _is_jwe_compact() detects a value encrypted with
a stale MEK, DELETE the config row from the DB. On next startup,
_init_configs() re-inserts the default (e.g., fresh 4096-bit RSA
keypair for service_auth) because the row is now absent.

The DELETE uses the composite primary key (key, type) to avoid
affecting configs of other types.

Credentials (decrypt_credential) are NOT deleted — they are
user-supplied with no default_factory regeneration. The stale JWE
is preserved so the user can re-enter after MEK rotation.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/utils/job/jobs.py (2)

1443-1467: ⚠️ Potential issue | 🔴 Critical

Inconsistent error-log key format will cause migration to fail.

Lines 1373-1377 use get_error_log_key() which returns keys with hash tags (e.g., {workflow_id}-{task_uuid}-{retry_id}-error-logs), but line 1467 constructs error-log keys using the old format without hash tags (f'{prefix}-error-logs').

This means:

  • xadd/expire write to {workflow_id}-{task_uuid}-{retry_id}-error-logs
  • migrate_logs_concurrently reads from workflow_id-task_uuid-retry_id-error-logs

The migration will read from non-existent keys and error logs will not be uploaded to S3.

🐛 Proposed fix
                 if task_obj.status.has_error_logs():
-                    prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}'
                     task_error_log_name = task_obj.name
                     if task_obj.retry_id > 0:
                         task_error_log_name += f'_{task_obj.retry_id}'
                     task_error_log_name += common.ERROR_LOGS_SUFFIX_FILE_NAME
                     task_parameters.append(
-                        (workflow_obj.logs, f'{prefix}-error-logs', task_error_log_name))
+                        (workflow_obj.logs,
+                         common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id),
+                         task_error_log_name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/job/jobs.py` around lines 1443 - 1467, The error-log Redis key is
constructed using the old format f'{prefix}-error-logs' when appending to
task_parameters, causing a mismatch with code that writes keys via
common.get_error_log_key(...); replace the hardcoded f'{prefix}-error-logs' with
the canonical key by calling common.get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) (or otherwise use the same helper used
elsewhere) when building the tuple for error logs so xadd/expire and
migrate_logs_concurrently use the same hashed-key format; adjust the tuple
append in the loop that populates task_parameters accordingly.

1500-1502: ⚠️ Potential issue | 🔴 Critical

Inconsistent error-log key format will cause deletion to miss keys.

Similar to the migration issue, the deletion constructs keys using the old format without hash tags, but the data was written using get_error_log_key() with hash tags.

🐛 Proposed fix
                 if task_obj.status.has_error_logs():
-                    prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}'
-                    redis_keys_to_delete.append(f'{prefix}-error-logs')
+                    redis_keys_to_delete.append(
+                        common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/job/jobs.py` around lines 1500 - 1502, The deletion builds Redis
error-log keys using an old prefix format while writes use
get_error_log_key(...) (which includes hash tags), so replace the manual
prefix-based key construction in the block where
task_obj.status.has_error_logs() is checked with a call to the same key
generator used for writes (i.e., call get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts
task_obj) and append that returned key to redis_keys_to_delete so the deletion
matches the stored keys exactly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/utils/common.py`:
- Around line 1122-1124: Replace the assert validation in the function that
builds the pod-conditions key for workflow_uuid: instead of using "assert
len(workflow_uuid) == 32 and all(...)" raise a ValueError with a clear message
when workflow_uuid is not 32 hex characters (e.g., check length and hex chars
and raise ValueError('Input must be workflow UUID (32 hex characters)') ), then
return the formatted string f'{{{workflow_uuid}}}-pod-conditions'; locate the
check around the variable workflow_uuid in common.py and update it to use
explicit validation + ValueError rather than assert.
- Around line 1127-1144: Multiple places build Redis log keys with hardcoded
formats and missing hash tags, causing mismatch with the new helpers; update
those callers to use get_workflow_log_key(workflow_id) and
get_error_log_key(workflow_id, task_uuid, retry_id). Specifically, in the reader
endpoint (where redis_name = f'{name}-logs') replace the hardcoded construction
with get_workflow_log_key(name); in agent/objects.py (where the writer builds
f'{self._current_job.workflow.workflow_id}-{message_option.pod_log.task}-{message_option.pod_log.retry_id}-error-logs')
use get_error_log_key(self._current_job.workflow.workflow_id,
message_option.pod_log.task_uuid or the correct task UUID field,
message_option.pod_log.retry_id); and in jobs.py cleanup (where keys are built
as f'{prefix}-error-logs') call get_error_log_key(prefix_workflow_id, task_uuid,
retry_id) or get_workflow_log_key(prefix) as appropriate—ensure you pass the
workflow_id and task_uuid values the helpers expect so produced keys include the
Redis hash tag.

---

Outside diff comments:
In `@src/utils/job/jobs.py`:
- Around line 1443-1467: The error-log Redis key is constructed using the old
format f'{prefix}-error-logs' when appending to task_parameters, causing a
mismatch with code that writes keys via common.get_error_log_key(...); replace
the hardcoded f'{prefix}-error-logs' with the canonical key by calling
common.get_error_log_key(self.workflow_id, task_obj.task_uuid,
task_obj.retry_id) (or otherwise use the same helper used elsewhere) when
building the tuple for error logs so xadd/expire and migrate_logs_concurrently
use the same hashed-key format; adjust the tuple append in the loop that
populates task_parameters accordingly.
- Around line 1500-1502: The deletion builds Redis error-log keys using an old
prefix format while writes use get_error_log_key(...) (which includes hash
tags), so replace the manual prefix-based key construction in the block where
task_obj.status.has_error_logs() is checked with a call to the same key
generator used for writes (i.e., call get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts
task_obj) and append that returned key to redis_keys_to_delete so the deletion
matches the stored keys exactly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3345a590-82ce-442f-bea2-1b0284726fac

📥 Commits

Reviewing files that changed from the base of the PR and between 46f557e and 54d123e.

📒 Files selected for processing (3)
  • src/lib/utils/common.py
  • src/service/logger/ctrl_websocket.py
  • src/utils/job/jobs.py

@KeitaW KeitaW force-pushed the fix/mek-reencryption-bloat branch from 54d123e to 46f557e Compare March 23, 2026 10:20
local encoded_jwk=$(echo -n "$jwk_json" | base64 | tr -d '\n')

local mek_manifest="apiVersion: v1
# Generate and create MEK (skip if already exists to avoid key material mismatch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at deploy_service.rst to see if there are updates to the commands there

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated both docs/deployment_guide/getting_started/deploy_service.rst and docs/deployment_guide/appendix/deploy_minimal.rst to match the deploy script behavior:

  • Replaced all hardcoded "kid":"key1" with dynamic KID="key-$(openssl rand -hex 8)" generation
  • Updated ConfigMap YAML examples to use $KID variable
  • Added a .. note:: explaining why unique key IDs matter (detecting key material mismatches)
  • Added a .. tip:: warning not to re-create the MEK ConfigMap if it already exists

See commit ebd427b.

# Must include type in WHERE clause — configs PK is (key, type).
config_type = dynamic_config.get_type().value
for key in delete_keys:
cmd = 'DELETE FROM configs WHERE key = %s AND type = %s;'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider batching the keys into one SQL call, instead of making a call per key

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batched into a single ANY(%s) call, matching the existing pattern in _batch_fetch_external_roles:

if delete_keys:
    config_type = dynamic_config.get_type().value
    cmd = 'DELETE FROM configs WHERE key = ANY(%s) AND type = %s;'
    postgres.execute_commit_command(cmd, (list(delete_keys), config_type))

The encrypt_keys UPDATE loop above is left as-is since each key has a different old_value for optimistic concurrency, making it unsuitable for simple batching.

See commit ebd427b.

KeitaW added a commit to KeitaW/OSMO that referenced this pull request Mar 25, 2026
Address ethany-nv's review comments on PR NVIDIA#732:

1. Batch per-key DELETE loop into single `ANY(%s)` SQL call, matching
   the existing pattern in `_batch_fetch_external_roles`. This reduces
   DB round-trips and makes the operation atomic.

2. Update deployment docs (`deploy_service.rst`, `deploy_minimal.rst`)
   to use dynamic `kid` generation instead of hardcoded `key1`, matching
   the deploy script and Helm chart. Added notes explaining why unique
   key IDs matter and a tip about MEK idempotency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat

2 participants