fix: prevent MEK re-encryption from double-wrapping JWE config values by KeitaW · Pull Request #732 · NVIDIA/OSMO

KeitaW · 2026-03-23T00:56:44Z

Fixes #731

Summary

When the MEK ConfigMap is regenerated with new key material (via ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the _decrypt() except handler in postgres.py treats undecryptable JWE ciphertext as "plaintext" and wraps it in another JWE layer. This causes config values to grow ~1.33× per MEK regeneration, eventually breaking JWT auth (500), workflow submission, and K8s Secret creation (422 when >1MB).

Changes

1. Detect JWE before re-encrypting (`postgres.py`)

Added _is_jwe_compact() heuristic that checks for eyJ prefix + exactly 4 dots (5 JWE compact segments). This distinguishes JWE from JWS/JWT (3 dots) and plain JSON/strings (0 dots).

When a MEK mismatch is detected, the fix raises OSMOServerError with a clear recovery message instead of:

Double-wrapping the value (the original bug)
Silently returning ciphertext to application code (which would cause downstream failures)

Both affected sites are fixed:

_decrypt() L2697–2701 — config deserialization
decrypt_credential() L624–626 — credential storage

2. Unique `kid` per MEK generation

Changed static kid="key1" to kid="key-$(openssl rand -hex 8)" in:

mek-configmap.yaml L57 — Helm hook
deploy-k8s.sh L276 — deploy script
start_service_kind.py L137 — local dev

This makes key material mismatches detectable: get_mek(old_kid) raises OSMONotFoundError instead of silently returning the wrong key.

3. Existence check in `deploy-k8s.sh`

Added kubectl get configmap mek-config check before generation to prevent kubectl apply from overwriting an existing MEK on re-runs.

Validation

Tested the core fix logic using the reproducer from #731:

=== Old behavior (bug) ===
Initial: 267 bytes
After regen 1: 579 bytes
After regen 2: 995 bytes
After regen 3: 1550 bytes
After regen 4: 2290 bytes
After regen 5: 3277 bytes

=== New behavior (fix) ===
Initial: 267 bytes
After regen 1: DETECTED MEK mismatch — raising OSMOServerError (size unchanged: 267 bytes)

=== Unique kid detection ===
Encrypted with kid=key-abc123
Current MEK kid=key-def456
Kid mismatch detectable: True

Also validated on a live EKS 1.31 cluster with OSMO v6.2-rc6 + ArgoCD where the bug was originally discovered.

Backward Compatibility

Existing deployments with kid="key1" continue to work — the fix only affects new MEK generation
Same-key pod restarts (no MEK ConfigMap deletion) are unaffected — the correct key is used
The _is_jwe_compact() check only triggers in the except handler, so the normal encryption/decryption path is unchanged

Summary by CodeRabbit

Bug Fixes
- Detect and clear credentials that fail to decrypt due to key mismatches to prevent silent re-encryption and config growth
- WebSocket errors now close connections with an appropriate error code instead of returning an HTTP error response
New Features
- New helpers and key formats for Redis workflow and error log stream names
Chores
- Deployment now skips regenerating encryption key config if present
- Key identifiers are now generated uniquely per instance/run

When the MEK ConfigMap is regenerated with new key material (e.g. via ArgoCD prune/recreate, Helm reinstall, or namespace cleanup), the _decrypt() except handler in postgres.py treats undecryptable JWE ciphertext as "plaintext" and wraps it in another JWE layer. This causes config values to grow ~1.33x per MEK regeneration, eventually breaking JWT auth (500), workflow submission, and K8s Secret creation (422 when backend_images exceeds the 1MB limit). Root cause: two interacting issues: 1. Static kid="key1" in MEK generation makes key material mismatches invisible — get_mek("key1") returns the new (wrong) key without error 2. The except handler at postgres.py L2697 does not distinguish between genuinely unencrypted plaintext and JWE encrypted with a lost key Fix: - Add _is_jwe_compact() heuristic (eyJ prefix + 4 dots = 5 JWE segments) to detect existing JWE before the re-encryption path - Raise OSMOServerError on MEK mismatch instead of silently double-wrapping or leaking ciphertext to application code - Generate unique kid per MEK (key-<random hex>) in mek-configmap.yaml, deploy-k8s.sh, and start_service_kind.py - Add existence check in deploy-k8s.sh to skip MEK generation if ConfigMap already exists Fixes: NVIDIA#731

coderabbitai · 2026-03-23T01:08:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR gives each MEK generation a unique per-generation kid, avoids overwriting an existing MEK ConfigMap during deploy, detects JWE compact ciphertext on decrypt failures to avoid double-wrapping, deletes stale config rows for undecryptable secrets, updates Redis log key formatting, and closes WebSocket connections with code 1011 on unhandled exceptions.

Changes

Cohort / File(s)	Summary
MEK generation & deploy `deployments/charts/quick-start/templates/mek-configmap.yaml`, `deployments/scripts/deploy-k8s.sh`, `run/start_service_kind.py`	Generate a per-generation `kid` (`key-<hex>`) in JWKs and use it as `currentMek`/map key; deployment script skips generating/applying if `mek-config` exists; local dev helper emits dynamic `kid`.
DB connector — JWE handling `src/utils/connectors/postgres.py`	Added `PostgresConnector._is_jwe_compact()` heuristic; on `JWException`/`OSMONotFoundError` for JWE-like values, log MEK mismatch, preserve ciphertext (avoid re-encrypting), mark keys for deletion and delete stale `configs` rows so values can be regenerated later.
Web error handling (WS) `src/service/core/service.py`	`top_level_exception_handler` detects WebSocket scopes and closes the WebSocket with code 1011 and a trimmed reason instead of returning an HTTP JSONResponse for WS requests.
Redis key naming & log helpers `src/lib/utils/common.py`, `src/service/logger/ctrl_websocket.py`, `src/utils/job/jobs.py`	Wrap workflow IDs in Redis hash tags for keys; added helpers `get_workflow_log_key` and `get_error_log_key`; updated usages across websocket logger and job workflows to use new helpers for workflow and error log streams.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through keys both old and new,
I named each MEK with a fresh-born hue.
I sniff the JWE and warn with care,
I leave the bytes when keys aren't there,
No double-wrap — a lighter share. 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Redis key hash-tag changes in common.py, ctrl_websocket.py, and jobs.py are related to preventing config bloat and ensuring safe recovery, but are not directly tied to the `#731` MEK re-encryption bug fix.	Clarify whether Redis hash-tag changes are necessary for the MEK fix or should be separated into a distinct PR; document their relationship to the core bug fix if they are essential.
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: preventing MEK re-encryption from double-wrapping JWE config values, which is the core bug fix addressed across multiple files.
Linked Issues check	✅ Passed	All PR objectives from issue `#731` are met: unique kid per MEK generation implemented, JWE detection heuristic added, double-wrapping prevention implemented, ConfigMap existence check added, and Redis key hash-tag improvements included.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/utils/connectors/postgres.py (1)

637-645: Preserve the original decrypt failure when rethrowing.

Both blocks replace JWException / OSMONotFoundError with OSMOServerError but drop the original cause. Capture the exception and re-raise with from error so incident logs still show whether this was a missing kid or an unwrap/decrypt failure.

Minimal change

-            except (JWException, osmo_errors.OSMONotFoundError):
+            except (JWException, osmo_errors.OSMONotFoundError) as error:
                 if self._is_jwe_compact(value):
                     raise osmo_errors.OSMOServerError(
                         ...
-                    )
+                    ) from error

Apply the same pattern in _decrypt().

Also applies to: 2718-2734

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/connectors/postgres.py` around lines 637 - 645, The exception
handlers that catch (JWException, osmo_errors.OSMONotFoundError) currently raise
a new OSMOServerError but drop the original exception; update both spots (the
handler inside the credential decrypt logic and the separate _decrypt() method)
to capture the caught exception (e.g., "except (JWException,
osmo_errors.OSMONotFoundError) as err") and re-raise OSMOServerError using "from
err" so the original cause (missing kid vs unwrap/decrypt failure) is preserved
in tracebacks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deployments/charts/quick-start/templates/mek-configmap.yaml`:
- Around line 56-59: The hook uses KID="key-$(openssl rand -hex 8)" but the
alpine/k8s:1.28.4 hook image doesn't ship openssl; either ensure openssl is
installed in the hook runtime (add an apk add openssl step in the
pre-install/pre-upgrade hook) or replace the openssl call with a POSIX-safe
alternative that exists in the image (e.g., use /dev/urandom + od/xxd or a
shell-based hex generator). Update the template line that defines KID (the
KID="key-$(openssl rand -hex 8)" expression) and the hook commands so the chosen
tool is available before KID is evaluated.

In `@deployments/scripts/deploy-k8s.sh`:
- Around line 278-283: The current key-generation uses local var=$(...) which
masks command-substitution failures; change the pattern in deploy-k8s.sh so you
first declare the locals (e.g., local random_key; local kid; local jwk_json;
local encoded_jwk) and then assign using separate statements
(random_key="$(openssl rand -base64 32 | tr -d '\n')" etc.), so any failure in
openssl or base64 will cause the script to exit under set -euo pipefail; update
the assignments for random_key, kid, jwk_json and encoded_jwk accordingly and
keep the existing transformations (tr -d '\n', openssl rand -hex 8, base64).

In `@src/utils/connectors/postgres.py`:
- Around line 606-617: The current _is_jwe_compact(value: str) only checks shape
and causes false positives; change detection to require an OSMO-specific marker:
decode the first JWE segment (header) using base64url, parse JSON, and return
True only if it contains a designated OSMO claim/header (e.g.,
"osmo_version":"v1" or "osmo_encrypted":true); if decoding/parsing fails return
False. Update decrypt_credential() and DynamicConfig.deserialize() to call this
revised checker (or a new _is_osmo_encrypted helper) so only OSMO-tagged JWEs
trigger the hard-fail decrypt path; leave other JWE-shaped/plain values to be
treated as plaintext and handled via lazy encryption.

---

Nitpick comments:
In `@src/utils/connectors/postgres.py`:
- Around line 637-645: The exception handlers that catch (JWException,
osmo_errors.OSMONotFoundError) currently raise a new OSMOServerError but drop
the original exception; update both spots (the handler inside the credential
decrypt logic and the separate _decrypt() method) to capture the caught
exception (e.g., "except (JWException, osmo_errors.OSMONotFoundError) as err")
and re-raise OSMOServerError using "from err" so the original cause (missing kid
vs unwrap/decrypt failure) is preserved in tracebacks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 102ee310-3a04-4416-87b7-88cbc233c314

📥 Commits

Reviewing files that changed from the base of the PR and between d1e8a17 and c77a4e0.

📒 Files selected for processing (4)

deployments/charts/quick-start/templates/mek-configmap.yaml
deployments/scripts/deploy-k8s.sh
run/start_service_kind.py
src/utils/connectors/postgres.py

deployments/charts/quick-start/templates/mek-configmap.yaml

deployments/scripts/deploy-k8s.sh

src/utils/connectors/postgres.py

- Replace openssl with POSIX-safe /dev/urandom+od in Helm hook (alpine/k8s image lacks openssl) - Separate local declaration from assignment in deploy-k8s.sh to preserve exit codes under set -euo pipefail (SC2155) - Preserve original exception cause with 'from error' in decrypt handlers - Document _is_jwe_compact shape-based heuristic limitation

Two improvements discovered during production testing: 1. WebSocket exception handler (service.py): The generic top_level_exception_handler returns JSONResponse on ALL exceptions, but WebSocket connections cannot receive HTTP responses. This triggers ASGI protocol error: "Expected 'websocket.send' but got 'websocket.http.response.start'". Fix: detect WebSocket scope and close gracefully with code 1011 instead. 2. Graceful MEK mismatch (postgres.py): Changed from raising OSMOServerError (which crashes the service) to logging an error and returning empty string. This keeps the service alive after MEK rotation — affected config fields will be empty but the service continues to function. The previous behavior crashed the entire service on startup when ANY config had encrypted values from a previous MEK, preventing WebSocket connections from being established. Both issues were identified via production testing on EKS where: - MEK ConfigMap was recreated with new key material - Existing DB configs had values encrypted with the old MEK - Service crashed on startup → backend-listener WebSocket reconnect loop - Even when service partially started, WebSocket handler failures produced ASGI protocol errors instead of clean disconnects

coderabbitai

🧹 Nitpick comments (1)

src/service/core/service.py (1)
235-241: Good approach for WebSocket-safe exception handling.

The detection of WebSocket scope and graceful close with code 1011 is the correct pattern to avoid ASGI protocol errors. The context snippets confirm that most WebSocket endpoints have internal exception handling, so this serves as a useful safety net.

Two minor concerns:

Silent exception swallowing (lines 239-240): Per static analysis (S110/BLE001), consider logging at DEBUG level to aid troubleshooting when the close fails unexpectedly.

Byte vs character truncation: str(error)[:123] slices by character count, but RFC 6455 limits the close reason to 123 bytes. Multi-byte UTF-8 characters could still exceed the limit. Consider encoding and truncating by bytes.
🔧 Suggested improvement
     if request.scope.get('type') == 'websocket':
         try:
             websocket = fastapi.WebSocket(request.scope, request.receive, request._send)
-            await websocket.close(code=1011, reason=str(error)[:123])
-        except Exception:
-            pass  # Connection may already be closed
+            # Truncate by bytes to respect RFC 6455 limit
+            reason = str(error).encode('utf-8')[:123].decode('utf-8', errors='ignore')
+            await websocket.close(code=1011, reason=reason)
+        except Exception as close_err:  # pylint: disable=broad-except
+            logging.debug('Failed to close WebSocket gracefully: %s', close_err)
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/service.py` around lines 235 - 241, Replace the silent
except with a debug log and ensure the close reason is truncated to 123 bytes
(not characters): when detecting WebSocket scope (request.scope.get('type') ==
'websocket') keep creating fastapi.WebSocket(request.scope, request.receive,
request._send) and call websocket.close(code=1011, reason=...) but build the
reason by encoding str(error) to UTF-8, truncating to 123 bytes and decoding
back to a valid string (e.g., trim bytes and decode with errors='ignore') so you
never exceed the RFC6455 byte limit; in the except block catch Exception as exc
and emit a process- or module-level logger.debug (or similar) referencing
websocket.close and the exception to avoid silently swallowing failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/service/core/service.py`:
- Around line 235-241: Replace the silent except with a debug log and ensure the
close reason is truncated to 123 bytes (not characters): when detecting
WebSocket scope (request.scope.get('type') == 'websocket') keep creating
fastapi.WebSocket(request.scope, request.receive, request._send) and call
websocket.close(code=1011, reason=...) but build the reason by encoding
str(error) to UTF-8, truncating to 123 bytes and decoding back to a valid string
(e.g., trim bytes and decode with errors='ignore') so you never exceed the
RFC6455 byte limit; in the except block catch Exception as exc and emit a
process- or module-level logger.debug (or similar) referencing websocket.close
and the exception to avoid silently swallowing failures.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d7a6328-fa58-4aea-89b5-b74c0ffad8cb

📥 Commits

Reviewing files that changed from the base of the PR and between ad47c7b and 222d2c7.

📒 Files selected for processing (2)

src/service/core/service.py
src/utils/connectors/postgres.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/utils/connectors/postgres.py

The previous approach (return '', secret) kept the undecryptable JWE in the DB permanently. The service never self-healed because: - _init_configs uses INSERT ON CONFLICT DO NOTHING - The row exists (even with empty fields) → insert skipped - service_auth with empty private_key → JWT signing fails → auth 500 New approach: when _is_jwe_compact() detects a value encrypted with a stale MEK, DELETE the config row from the DB. On next startup, _init_configs() re-inserts the default (e.g., fresh 4096-bit RSA keypair for service_auth) because the row is now absent. The DELETE uses the composite primary key (key, type) to avoid affecting configs of other types. Credentials (decrypt_credential) are NOT deleted — they are user-supplied with no default_factory regeneration. The stale JWE is preserved so the user can re-enter after MEK rotation.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/utils/job/jobs.py (2)

1443-1467: ⚠️ Potential issue | 🔴 Critical

Inconsistent error-log key format will cause migration to fail.

Lines 1373-1377 use get_error_log_key() which returns keys with hash tags (e.g., {workflow_id}-{task_uuid}-{retry_id}-error-logs), but line 1467 constructs error-log keys using the old format without hash tags (f'{prefix}-error-logs').

This means:

xadd/expire write to {workflow_id}-{task_uuid}-{retry_id}-error-logs
migrate_logs_concurrently reads from workflow_id-task_uuid-retry_id-error-logs

The migration will read from non-existent keys and error logs will not be uploaded to S3.

🐛 Proposed fix

                 if task_obj.status.has_error_logs():
-                    prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}'
                     task_error_log_name = task_obj.name
                     if task_obj.retry_id > 0:
                         task_error_log_name += f'_{task_obj.retry_id}'
                     task_error_log_name += common.ERROR_LOGS_SUFFIX_FILE_NAME
                     task_parameters.append(
-                        (workflow_obj.logs, f'{prefix}-error-logs', task_error_log_name))
+                        (workflow_obj.logs,
+                         common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id),
+                         task_error_log_name))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/job/jobs.py` around lines 1443 - 1467, The error-log Redis key is
constructed using the old format f'{prefix}-error-logs' when appending to
task_parameters, causing a mismatch with code that writes keys via
common.get_error_log_key(...); replace the hardcoded f'{prefix}-error-logs' with
the canonical key by calling common.get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) (or otherwise use the same helper used
elsewhere) when building the tuple for error logs so xadd/expire and
migrate_logs_concurrently use the same hashed-key format; adjust the tuple
append in the loop that populates task_parameters accordingly.

1500-1502: ⚠️ Potential issue | 🔴 Critical

Inconsistent error-log key format will cause deletion to miss keys.

Similar to the migration issue, the deletion constructs keys using the old format without hash tags, but the data was written using get_error_log_key() with hash tags.

🐛 Proposed fix

                 if task_obj.status.has_error_logs():
-                    prefix = f'{self.workflow_id}-{task_obj.task_uuid}-{task_obj.retry_id}'
-                    redis_keys_to_delete.append(f'{prefix}-error-logs')
+                    redis_keys_to_delete.append(
+                        common.get_error_log_key(self.workflow_id, task_obj.task_uuid, task_obj.retry_id))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/job/jobs.py` around lines 1500 - 1502, The deletion builds Redis
error-log keys using an old prefix format while writes use
get_error_log_key(...) (which includes hash tags), so replace the manual
prefix-based key construction in the block where
task_obj.status.has_error_logs() is checked with a call to the same key
generator used for writes (i.e., call get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts
task_obj) and append that returned key to redis_keys_to_delete so the deletion
matches the stored keys exactly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/utils/common.py`:
- Around line 1122-1124: Replace the assert validation in the function that
builds the pod-conditions key for workflow_uuid: instead of using "assert
len(workflow_uuid) == 32 and all(...)" raise a ValueError with a clear message
when workflow_uuid is not 32 hex characters (e.g., check length and hex chars
and raise ValueError('Input must be workflow UUID (32 hex characters)') ), then
return the formatted string f'{{{workflow_uuid}}}-pod-conditions'; locate the
check around the variable workflow_uuid in common.py and update it to use
explicit validation + ValueError rather than assert.
- Around line 1127-1144: Multiple places build Redis log keys with hardcoded
formats and missing hash tags, causing mismatch with the new helpers; update
those callers to use get_workflow_log_key(workflow_id) and
get_error_log_key(workflow_id, task_uuid, retry_id). Specifically, in the reader
endpoint (where redis_name = f'{name}-logs') replace the hardcoded construction
with get_workflow_log_key(name); in agent/objects.py (where the writer builds
f'{self._current_job.workflow.workflow_id}-{message_option.pod_log.task}-{message_option.pod_log.retry_id}-error-logs')
use get_error_log_key(self._current_job.workflow.workflow_id,
message_option.pod_log.task_uuid or the correct task UUID field,
message_option.pod_log.retry_id); and in jobs.py cleanup (where keys are built
as f'{prefix}-error-logs') call get_error_log_key(prefix_workflow_id, task_uuid,
retry_id) or get_workflow_log_key(prefix) as appropriate—ensure you pass the
workflow_id and task_uuid values the helpers expect so produced keys include the
Redis hash tag.

---

Outside diff comments:
In `@src/utils/job/jobs.py`:
- Around line 1443-1467: The error-log Redis key is constructed using the old
format f'{prefix}-error-logs' when appending to task_parameters, causing a
mismatch with code that writes keys via common.get_error_log_key(...); replace
the hardcoded f'{prefix}-error-logs' with the canonical key by calling
common.get_error_log_key(self.workflow_id, task_obj.task_uuid,
task_obj.retry_id) (or otherwise use the same helper used elsewhere) when
building the tuple for error logs so xadd/expire and migrate_logs_concurrently
use the same hashed-key format; adjust the tuple append in the loop that
populates task_parameters accordingly.
- Around line 1500-1502: The deletion builds Redis error-log keys using an old
prefix format while writes use get_error_log_key(...) (which includes hash
tags), so replace the manual prefix-based key construction in the block where
task_obj.status.has_error_logs() is checked with a call to the same key
generator used for writes (i.e., call get_error_log_key(self.workflow_id,
task_obj.task_uuid, task_obj.retry_id) or the existing helper that accepts
task_obj) and append that returned key to redis_keys_to_delete so the deletion
matches the stored keys exactly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3345a590-82ce-442f-bea2-1b0284726fac

📥 Commits

Reviewing files that changed from the base of the PR and between 46f557e and 54d123e.

📒 Files selected for processing (3)

src/lib/utils/common.py
src/service/logger/ctrl_websocket.py
src/utils/job/jobs.py

src/lib/utils/common.py

ethany-nv · 2026-03-24T18:58:28Z

deployments/scripts/deploy-k8s.sh

-    local encoded_jwk=$(echo -n "$jwk_json" | base64 | tr -d '\n')
-
-    local mek_manifest="apiVersion: v1
+    # Generate and create MEK (skip if already exists to avoid key material mismatch)


Take a look at deploy_service.rst to see if there are updates to the commands there

Updated both docs/deployment_guide/getting_started/deploy_service.rst and docs/deployment_guide/appendix/deploy_minimal.rst to match the deploy script behavior:

Replaced all hardcoded "kid":"key1" with dynamic KID="key-$(openssl rand -hex 8)" generation

Updated ConfigMap YAML examples to use $KID variable

Added a .. note:: explaining why unique key IDs matter (detecting key material mismatches)

Added a .. tip:: warning not to re-create the MEK ConfigMap if it already exists

See commit ebd427b.

ethany-nv · 2026-03-24T18:59:31Z

src/utils/connectors/postgres.py

+        # Must include type in WHERE clause — configs PK is (key, type).
+        config_type = dynamic_config.get_type().value
+        for key in delete_keys:
+            cmd = 'DELETE FROM configs WHERE key = %s AND type = %s;'


Consider batching the keys into one SQL call, instead of making a call per key

Batched into a single ANY(%s) call, matching the existing pattern in _batch_fetch_external_roles:

if delete_keys: config_type = dynamic_config.get_type().value cmd = 'DELETE FROM configs WHERE key = ANY(%s) AND type = %s;' postgres.execute_commit_command(cmd, (list(delete_keys), config_type))

The encrypt_keys UPDATE loop above is left as-is since each key has a different old_value for optimistic concurrency, making it unsuitable for simple batching.

See commit ebd427b.

Address ethany-nv's review comments on PR NVIDIA#732: 1. Batch per-key DELETE loop into single `ANY(%s)` SQL call, matching the existing pattern in `_batch_fetch_external_roles`. This reduces DB round-trips and makes the operation atomic. 2. Update deployment docs (`deploy_service.rst`, `deploy_minimal.rst`) to use dynamic `kid` generation instead of hardcoded `key1`, matching the deploy script and Helm chart. Added notes explaining why unique key IDs matter and a tip about MEK idempotency.

KeitaW requested a review from a team as a code owner March 23, 2026 00:56

KeitaW force-pushed the fix/mek-reencryption-bloat branch from c02821e to c77a4e0 Compare March 23, 2026 01:07

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

deployments/charts/quick-start/templates/mek-configmap.yaml Outdated Show resolved Hide resolved

deployments/scripts/deploy-k8s.sh Outdated Show resolved Hide resolved

src/utils/connectors/postgres.py Show resolved Hide resolved

KeitaW added 2 commits March 23, 2026 02:11

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

src/lib/utils/common.py Outdated Show resolved Hide resolved

src/lib/utils/common.py Outdated Show resolved Hide resolved

KeitaW force-pushed the fix/mek-reencryption-bloat branch from 54d123e to 46f557e Compare March 23, 2026 10:20

ethany-nv reviewed Mar 24, 2026

View reviewed changes

Merge branch 'NVIDIA:main' into fix/mek-reencryption-bloat

feabab4

Conversation

KeitaW commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Detect JWE before re-encrypting (postgres.py)

2. Unique kid per MEK generation

3. Existence check in deploy-k8s.sh

Validation

Backward Compatibility

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ethany-nv Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

KeitaW Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

ethany-nv Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

KeitaW Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KeitaW commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

1. Detect JWE before re-encrypting (`postgres.py`)

2. Unique `kid` per MEK generation

3. Existence check in `deploy-k8s.sh`

coderabbitai bot commented Mar 23, 2026 •

edited

Loading