[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard)#651
Conversation
Secret-backed env vars on the openhands deployment (sourced via secretKeyRef) are read only at pod start, and Kubernetes does not restart a pod when a referenced Secret changes. So changing a secret in the KOTS admin console and clicking Deploy silently has no effect until a manual pod restart. Add a checksum/config-secrets pod-template annotation whose value is a KOTS-rendered sha256 of every secret (password) config value. When any secret changes, the hash changes, the pod template changes, and the pod rolls so the new value is loaded. Only the hash crosses into the chart, never the secrets. Tradeoff: this restarts the openhands pod on any secret change, including ones that only affect other components (for example an LLM key used only by LiteLLM). A controller like Reloader would restart only the affected workload.
Adds scripts/check_secret_checksum.py and a workflow that fails when a password-type field in config.yaml is not part of secretsChecksum in openhands.yaml. Without this, a new secret could be added without wiring it into the restart checksum, reintroducing the silent no-restart bug. Same idea as the chart-version-bump check: keep two related things in sync.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves a real production problem with a pragmatic approach, though with some complexity tradeoffs.
[IMPROVEMENT OPPORTUNITIES]
The coarse-grained checksum restarts the openhands pod for any secret change, including secrets that only affect other components (like LiteLLM keys). The PR description acknowledges this tradeoff vs. Stakater Reloader. Consider documenting this behavior in the chart README or values.yaml comments so operators understand the restart scope.
The 656-character single-line secretsChecksum could benefit from structure (logical grouping or multi-line formatting in the comment above it) to make future additions clearer, even though the CI check enforces correctness.
[RISK ASSESSMENT]
🟡 MEDIUM - The change is additive and low-risk from a breakage perspective. The main operational risk is unexpected pod restarts when changing unrelated secrets. The CI guard prevents silent failures (the original bug), which significantly reduces the risk of reintroducing the problem.
[VERDICT]
✅ Worth merging - The CI guard is excellent defensive engineering. The checksum approach is simpler than adding a controller dependency, and the tradeoff is reasonable for the deployment scale.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/OpenHands-Cloud/actions/runs/26248820483
| # annotation (checksum/config-secrets) and forces a rollout so secret-backed | ||
| # env vars are reloaded. Keep this in sync with the password fields in | ||
| # config.yaml; CI (check-secret-checksum) fails if a password field is missing here. | ||
| secretsChecksum: 'repl{{ sha256sum (printf "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s" (ConfigOption "anthropic_api_key") (ConfigOption "openai_api_key") (ConfigOption "google_gemini_api_key") (ConfigOption "deepseek_api_key") (ConfigOption "mistral_api_key") (ConfigOption "azure_api_key") (ConfigOption "azure_client_secret") (ConfigOption "groq_api_key") (ConfigOption "openrouter_api_key") (ConfigOption "aws_secret_access_key") (ConfigOption "custom_api_key") (ConfigOption "admin_password") (ConfigOption "postgres_password") (ConfigOption "redis_password") (ConfigOption "jwt_secret") (ConfigOption "keycloak_admin_password") (ConfigOption "keycloak_client_secret") (ConfigOption "litellm_api_key") (ConfigOption "default_api_key") (ConfigOption "sandbox_api_key") (ConfigOption "keycloak_smtp_password") (ConfigOption "plugin_directory_identity_shared_secret") (ConfigOption "plugin_directory_session_secret") (ConfigOption "automation_service_key") (ConfigOption "automation_webhook_secret") (ConfigOption "bitbucket_data_center_client_secret") (ConfigOption "github_oauth_client_secret") (ConfigOption "github_app_webhook_secret") (ConfigOption "gitlab_oauth_client_secret") (ConfigOption "slack_client_secret") (ConfigOption "slack_signing_secret") (ConfigOption "external_postgres_password") (ConfigOption "laminar_project_api_key")) }}' |
There was a problem hiding this comment.
🟡 Suggestion: This 656-character line with 33 ConfigOption calls is functionally correct but hard to scan. Consider adding a structured comment above it grouping secrets by category (LLM keys, auth secrets, infra passwords, etc.) to make future additions easier to review, even though the CI check enforces completeness.
| last_name = None | ||
| for line in config_lines: | ||
| name_match = re.match(r"\s*- name:\s*(\S+)", line) | ||
| if name_match: | ||
| last_name = name_match.group(1).strip("\"'") | ||
| elif re.match(r"\s*type:\s*password\b", line) and last_name: | ||
| password_fields.append(last_name) |
There was a problem hiding this comment.
🟡 Suggestion: The parser assumes name: and type: appear on separate lines with type: coming after name:. If someone formats config.yaml inline (e.g., name: foo, type: password), this could miss fields. Consider adding a test case with different YAML formatting, or use a proper YAML parser like ruamel.yaml to be more robust.
| @@ -95,6 +95,12 @@ bitbucket: | |||
| enterpriseSSO: | |||
| enabled: false | |||
|
|
|||
There was a problem hiding this comment.
🟡 Suggestion: The comment explains the mechanism well, but consider adding a note about the restart scope: 'Changing any secret in the KOTS config restarts the openhands pod, even if that secret only affects other components (e.g., LiteLLM keys).' This helps operators understand the blast radius when rotating credentials.
- check_secret_checksum.py: parse config.yaml and openhands.yaml with a real YAML parser instead of line matching, so reformatting cannot make the guard silently miss a field. Adds pyyaml to the workflow. - workflow: also run on push to main, so drift is surfaced right away even though this is not a required status check. - values.yaml: note the restart blast radius (any secret change restarts the openhands pod, including secrets that only affect other components). - openhands.yaml: group the hashed fields by category in the comment.
main is now at 0.7.23 (from #649), so this PR's 0.7.23 would collide with the chart-version check. Bump to 0.7.24. Also remove the push:main trigger on the secret-checksum guard: a non-required check failing on main is noise, the real enforcement is making it a required (ideally strict) status check.
#649 merged the bot_token config field to main, so it has to be in the secret checksum, otherwise a bot-token change would not restart the pod (and the guard would flag it against current main). Adds it to the hash.
…checksum # Conflicts: # charts/openhands/Chart.yaml
jlav
left a comment
There was a problem hiding this comment.
Can you just double check that the checksum is applied anywhere that references these secrets? There may be some other deployments besides openhands that use them.
…main Per review: openhands-integrations and openhands-mcp also read the secret env (via the shared openhands.env include), so they need the same checksum/config-secrets annotation to roll when a secret changes. CronJobs are skipped: each run is a fresh pod that re-reads the secret, so they pick up changes on their next run. Also restore the push:main trigger so the guard runs on main after merge, matching the chart-version checks.
What
Make a change to any secret in the KOTS config actually take effect on Deploy, instead of silently waiting for a manual pod restart.
checksum/config-secretsannotation to every long-running deployment that reads the secret env (openhands, openhands-integrations, openhands-mcp), set to a KOTS-rendered sha256 of every secret (password) config value (secretsChecksuminreplicated/openhands.yaml). When a secret changes, the hash changes, the pod templates change, and the pods roll so the new value is loaded. CronJobs are not annotated since each run is a fresh pod that re-reads the secret. Only the hash crosses into the chart, never the secret values.scripts/check_secret_checksum.py+ workflow) that fails if a password config field is not insecretsChecksum, so a new secret can't be added without wiring it into the restart.Background: the situation that drove this
While wiring up an optional Bitbucket Data Center bot token (OpenHands-Cloud #649, code in OpenHands #14517), I set the token in the admin console, clicked Deploy, and nothing happened. The webhook event reached the app, but the bot-token env var on the running pod was still empty, so OpenHands kept posting as the previous identity.
Root cause: secret-backed env vars (sourced via
secretKeyRef) are read only at pod start, and Kubernetes does not restart a pod when a referenced Secret changes. The openhands deployment already carrieschecksum/*annotations for keycloak, litellm, and the CA bundle, but none for the provider secrets, so a token change did not roll the pod. The only fix at the time was a manualkubectl rollout restart.This is not specific to BBDC. The same class of bug is already filed for LiteLLM (PLTF-42) and for the image pull secret (PLTF-153), both still in Triage. The BBDC bot token is a third instance of one root cause, which is why a general fix is worth considering rather than another one-off.
Validation
sha256sumrenders a real hash, and a config change rolls the pod: verified on a live Replicated install (replicated-01) using the BBDC bot token. Changing the token rolled the openhands pod and the new token loaded; restoring it rolled back. Same mechanism, more inputs, here.helm templaterenders thechecksum/config-secretsannotation.Tradeoff: this is one option, not necessarily the option
The checksum is coarse. It restarts the openhands pod on any secret change, including secrets that only affect other components (for example an LLM key used only by LiteLLM). It also relies on the explicit field list staying in sync, which the CI guard enforces but does not remove.
Stakater Reloader watches the actual Secret objects and rolls only the workloads that reference them, with no list to maintain, at the cost of a new in-cluster controller dependency. The tracking issue compares the two approaches. This PR is the concrete checksum option referenced there.
Merge order note
bitbucket_data_center_bot_tokenlands in #649, not main. Once #649 merges, that field must be added tosecretsChecksumhere, and the CI guard will flag it if it is missing.Enforcement: this guard only blocks if it is made a required check
The workflow makes the guard run, but running is not blocking. Main's rulesets are
deletion,non_fast_forward, andpull_request, with norequired_status_checksrule, so CI checks here are advisory: a red guard is visible but does not block merge on its own. (Classic branch protection could not be read with my token, so an admin should confirm in repo settings.)To actually prevent a bad merge, an admin needs to add
check-secret-checksumto the branch's required status checks, ideally strict (require up to date) so it re-evaluates against latest main before merge.This extra setup (a workflow plus a required, strict check) is part of the cost of the checksum approach, and a point in favor of Reloader, which needs none of it. See PLTF-2701.
Linear: PLTF-2701
AI-generated metadata update by OpenHands on behalf of ak684.