[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard) by ak684 · Pull Request #651 · OpenHands/OpenHands-Cloud

ak684 · 2026-05-21T19:39:11Z

What

Make a change to any secret in the KOTS config actually take effect on Deploy, instead of silently waiting for a manual pod restart.

Add a checksum/config-secrets annotation to every long-running deployment that reads the secret env (openhands, openhands-integrations, openhands-mcp), set to a KOTS-rendered sha256 of every secret (password) config value (secretsChecksum in replicated/openhands.yaml). When a secret changes, the hash changes, the pod templates change, and the pods roll so the new value is loaded. CronJobs are not annotated since each run is a fresh pod that re-reads the secret. Only the hash crosses into the chart, never the secret values.
Add a CI check (scripts/check_secret_checksum.py + workflow) that fails if a password config field is not in secretsChecksum, so a new secret can't be added without wiring it into the restart.

Background: the situation that drove this

While wiring up an optional Bitbucket Data Center bot token (OpenHands-Cloud #649, code in OpenHands #14517), I set the token in the admin console, clicked Deploy, and nothing happened. The webhook event reached the app, but the bot-token env var on the running pod was still empty, so OpenHands kept posting as the previous identity.

Root cause: secret-backed env vars (sourced via secretKeyRef) are read only at pod start, and Kubernetes does not restart a pod when a referenced Secret changes. The openhands deployment already carries checksum/* annotations for keycloak, litellm, and the CA bundle, but none for the provider secrets, so a token change did not roll the pod. The only fix at the time was a manual kubectl rollout restart.

This is not specific to BBDC. The same class of bug is already filed for LiteLLM (PLTF-42) and for the image pull secret (PLTF-153), both still in Triage. The BBDC bot token is a third instance of one root cause, which is why a general fix is worth considering rather than another one-off.

Validation

KOTS sha256sum renders a real hash, and a config change rolls the pod: verified on a live Replicated install (replicated-01) using the BBDC bot token. Changing the token rolled the openhands pod and the new token loaded; restoring it rolled back. Same mechanism, more inputs, here.
CI guard: passes when all 33 password fields are covered, and fails (naming the missing field) when one is dropped from the checksum.
helm template renders the checksum/config-secrets annotation.

Tradeoff: this is one option, not necessarily the option

The checksum is coarse. It restarts the openhands pod on any secret change, including secrets that only affect other components (for example an LLM key used only by LiteLLM). It also relies on the explicit field list staying in sync, which the CI guard enforces but does not remove.

Stakater Reloader watches the actual Secret objects and rolls only the workloads that reference them, with no list to maintain, at the cost of a new in-cluster controller dependency. The tracking issue compares the two approaches. This PR is the concrete checksum option referenced there.

Merge order note

bitbucket_data_center_bot_token lands in #649, not main. Once #649 merges, that field must be added to secretsChecksum here, and the CI guard will flag it if it is missing.

Enforcement: this guard only blocks if it is made a required check

The workflow makes the guard run, but running is not blocking. Main's rulesets are deletion, non_fast_forward, and pull_request, with no required_status_checks rule, so CI checks here are advisory: a red guard is visible but does not block merge on its own. (Classic branch protection could not be read with my token, so an admin should confirm in repo settings.)

To actually prevent a bad merge, an admin needs to add check-secret-checksum to the branch's required status checks, ideally strict (require up to date) so it re-evaluates against latest main before merge.

This extra setup (a workflow plus a required, strict check) is part of the cost of the checksum approach, and a point in favor of Reloader, which needs none of it. See PLTF-2701.

Linear: PLTF-2701

AI-generated metadata update by OpenHands on behalf of ak684.

Secret-backed env vars on the openhands deployment (sourced via secretKeyRef) are read only at pod start, and Kubernetes does not restart a pod when a referenced Secret changes. So changing a secret in the KOTS admin console and clicking Deploy silently has no effect until a manual pod restart. Add a checksum/config-secrets pod-template annotation whose value is a KOTS-rendered sha256 of every secret (password) config value. When any secret changes, the hash changes, the pod template changes, and the pod rolls so the new value is loaded. Only the hash crosses into the chart, never the secrets. Tradeoff: this restarts the openhands pod on any secret change, including ones that only affect other components (for example an LLM key used only by LiteLLM). A controller like Reloader would restart only the affected workload.

Adds scripts/check_secret_checksum.py and a workflow that fails when a password-type field in config.yaml is not part of secretsChecksum in openhands.yaml. Without this, a new secret could be added without wiring it into the restart checksum, reintroducing the silent no-restart bug. Same idea as the chart-version-bump check: keep two related things in sync.

all-hands-bot

🟡 Acceptable - Solves a real production problem with a pragmatic approach, though with some complexity tradeoffs.

[IMPROVEMENT OPPORTUNITIES]

The coarse-grained checksum restarts the openhands pod for any secret change, including secrets that only affect other components (like LiteLLM keys). The PR description acknowledges this tradeoff vs. Stakater Reloader. Consider documenting this behavior in the chart README or values.yaml comments so operators understand the restart scope.

The 656-character single-line secretsChecksum could benefit from structure (logical grouping or multi-line formatting in the comment above it) to make future additions clearer, even though the CI check enforces correctness.

[RISK ASSESSMENT]
🟡 MEDIUM - The change is additive and low-risk from a breakage perspective. The main operational risk is unexpected pod restarts when changing unrelated secrets. The CI guard prevents silent failures (the original bug), which significantly reduces the risk of reintroducing the problem.

[VERDICT]
✅ Worth merging - The CI guard is excellent defensive engineering. The checksum approach is simpler than adding a controller dependency, and the tradeoff is reasonable for the deployment scale.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/OpenHands-Cloud/actions/runs/26248820483

all-hands-bot · 2026-05-21T19:44:00Z

+    # annotation (checksum/config-secrets) and forces a rollout so secret-backed
+    # env vars are reloaded. Keep this in sync with the password fields in
+    # config.yaml; CI (check-secret-checksum) fails if a password field is missing here.
+    secretsChecksum: 'repl{{ sha256sum (printf "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s" (ConfigOption "anthropic_api_key") (ConfigOption "openai_api_key") (ConfigOption "google_gemini_api_key") (ConfigOption "deepseek_api_key") (ConfigOption "mistral_api_key") (ConfigOption "azure_api_key") (ConfigOption "azure_client_secret") (ConfigOption "groq_api_key") (ConfigOption "openrouter_api_key") (ConfigOption "aws_secret_access_key") (ConfigOption "custom_api_key") (ConfigOption "admin_password") (ConfigOption "postgres_password") (ConfigOption "redis_password") (ConfigOption "jwt_secret") (ConfigOption "keycloak_admin_password") (ConfigOption "keycloak_client_secret") (ConfigOption "litellm_api_key") (ConfigOption "default_api_key") (ConfigOption "sandbox_api_key") (ConfigOption "keycloak_smtp_password") (ConfigOption "plugin_directory_identity_shared_secret") (ConfigOption "plugin_directory_session_secret") (ConfigOption "automation_service_key") (ConfigOption "automation_webhook_secret") (ConfigOption "bitbucket_data_center_client_secret") (ConfigOption "github_oauth_client_secret") (ConfigOption "github_app_webhook_secret") (ConfigOption "gitlab_oauth_client_secret") (ConfigOption "slack_client_secret") (ConfigOption "slack_signing_secret") (ConfigOption "external_postgres_password") (ConfigOption "laminar_project_api_key")) }}'


🟡 Suggestion: This 656-character line with 33 ConfigOption calls is functionally correct but hard to scan. Consider adding a structured comment above it grouping secrets by category (LLM keys, auth secrets, infra passwords, etc.) to make future additions easier to review, even though the CI check enforces completeness.

all-hands-bot · 2026-05-21T19:44:00Z

+last_name = None
+for line in config_lines:
+    name_match = re.match(r"\s*- name:\s*(\S+)", line)
+    if name_match:
+        last_name = name_match.group(1).strip("\"'")
+    elif re.match(r"\s*type:\s*password\b", line) and last_name:
+        password_fields.append(last_name)


🟡 Suggestion: The parser assumes name: and type: appear on separate lines with type: coming after name:. If someone formats config.yaml inline (e.g., name: foo, type: password), this could miss fields. Consider adding a test case with different YAML formatting, or use a proper YAML parser like ruamel.yaml to be more robust.

all-hands-bot · 2026-05-21T19:44:00Z

@@ -95,6 +95,12 @@ bitbucket:
 enterpriseSSO:
  enabled: false



🟡 Suggestion: The comment explains the mechanism well, but consider adding a note about the restart scope: 'Changing any secret in the KOTS config restarts the openhands pod, even if that secret only affects other components (e.g., LiteLLM keys).' This helps operators understand the blast radius when rotating credentials.

- check_secret_checksum.py: parse config.yaml and openhands.yaml with a real YAML parser instead of line matching, so reformatting cannot make the guard silently miss a field. Adds pyyaml to the workflow. - workflow: also run on push to main, so drift is surfaced right away even though this is not a required status check. - values.yaml: note the restart blast radius (any secret change restarts the openhands pod, including secrets that only affect other components). - openhands.yaml: group the hashed fields by category in the comment.

main is now at 0.7.23 (from #649), so this PR's 0.7.23 would collide with the chart-version check. Bump to 0.7.24. Also remove the push:main trigger on the secret-checksum guard: a non-required check failing on main is noise, the real enforcement is making it a required (ideally strict) status check.

#649 merged the bot_token config field to main, so it has to be in the secret checksum, otherwise a bot-token change would not restart the pod (and the guard would flag it against current main). Adds it to the hash.

…checksum # Conflicts: # charts/openhands/Chart.yaml

jlav

Can you just double check that the checksum is applied anywhere that references these secrets? There may be some other deployments besides openhands that use them.

…main Per review: openhands-integrations and openhands-mcp also read the secret env (via the shared openhands.env include), so they need the same checksum/config-secrets annotation to roll when a secret changes. CronJobs are skipped: each run is a fresh pod that re-reads the secret, so they pick up changes on their next run. Also restore the push:main trigger so the guard runs on main after merge, matching the chart-version checks.

ak684 added 2 commits May 21, 2026 15:38

all-hands-bot reviewed May 21, 2026

View reviewed changes

ak684 added 4 commits May 21, 2026 16:02

Add bitbucket_data_center_bot_token to secretsChecksum

3b1ed1e

#649 merged the bot_token config field to main, so it has to be in the secret checksum, otherwise a bot-token change would not restart the pod (and the guard would flag it against current main). Adds it to the hash.

Merge remote-tracking branch 'origin/main' into alona/secret-restart-…

c8d8473

…checksum # Conflicts: # charts/openhands/Chart.yaml

jlav approved these changes May 21, 2026

View reviewed changes

ak684 added 2 commits May 21, 2026 17:19

Merge branch 'main' into alona/secret-restart-checksum

84b66ce

ak684 merged commit 9a25c87 into main May 21, 2026
6 checks passed

ak684 deleted the alona/secret-restart-checksum branch May 21, 2026 21:47

ak684 added the customer-support Issues and PRs related to customer support requests label May 26, 2026

ak684 changed the title ~~Restart openhands on secret config changes (checksum + CI guard)~~ [PLTF-2701] Restart openhands on secret config changes (checksum + CI guard) May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard)#651

[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard)#651
ak684 merged 8 commits into
mainfrom
alona/secret-restart-checksum

ak684 commented May 21, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot May 21, 2026

Uh oh!

all-hands-bot May 21, 2026

Uh oh!

all-hands-bot May 21, 2026

Uh oh!

jlav left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ak684 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Background: the situation that drove this

Validation

Tradeoff: this is one option, not necessarily the option

Merge order note

Enforcement: this guard only blocks if it is made a required check

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

jlav left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ak684 commented May 21, 2026 •

edited

Loading