Skip to content

[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard)#651

Merged
ak684 merged 8 commits into
mainfrom
alona/secret-restart-checksum
May 21, 2026
Merged

[PLTF-2701] Restart openhands on secret config changes (checksum + CI guard)#651
ak684 merged 8 commits into
mainfrom
alona/secret-restart-checksum

Conversation

@ak684

@ak684 ak684 commented May 21, 2026

Copy link
Copy Markdown
Contributor

What

Make a change to any secret in the KOTS config actually take effect on Deploy, instead of silently waiting for a manual pod restart.

  • Add a checksum/config-secrets annotation to every long-running deployment that reads the secret env (openhands, openhands-integrations, openhands-mcp), set to a KOTS-rendered sha256 of every secret (password) config value (secretsChecksum in replicated/openhands.yaml). When a secret changes, the hash changes, the pod templates change, and the pods roll so the new value is loaded. CronJobs are not annotated since each run is a fresh pod that re-reads the secret. Only the hash crosses into the chart, never the secret values.
  • Add a CI check (scripts/check_secret_checksum.py + workflow) that fails if a password config field is not in secretsChecksum, so a new secret can't be added without wiring it into the restart.

Background: the situation that drove this

While wiring up an optional Bitbucket Data Center bot token (OpenHands-Cloud #649, code in OpenHands #14517), I set the token in the admin console, clicked Deploy, and nothing happened. The webhook event reached the app, but the bot-token env var on the running pod was still empty, so OpenHands kept posting as the previous identity.

Root cause: secret-backed env vars (sourced via secretKeyRef) are read only at pod start, and Kubernetes does not restart a pod when a referenced Secret changes. The openhands deployment already carries checksum/* annotations for keycloak, litellm, and the CA bundle, but none for the provider secrets, so a token change did not roll the pod. The only fix at the time was a manual kubectl rollout restart.

This is not specific to BBDC. The same class of bug is already filed for LiteLLM (PLTF-42) and for the image pull secret (PLTF-153), both still in Triage. The BBDC bot token is a third instance of one root cause, which is why a general fix is worth considering rather than another one-off.

Validation

  • KOTS sha256sum renders a real hash, and a config change rolls the pod: verified on a live Replicated install (replicated-01) using the BBDC bot token. Changing the token rolled the openhands pod and the new token loaded; restoring it rolled back. Same mechanism, more inputs, here.
  • CI guard: passes when all 33 password fields are covered, and fails (naming the missing field) when one is dropped from the checksum.
  • helm template renders the checksum/config-secrets annotation.

Tradeoff: this is one option, not necessarily the option

The checksum is coarse. It restarts the openhands pod on any secret change, including secrets that only affect other components (for example an LLM key used only by LiteLLM). It also relies on the explicit field list staying in sync, which the CI guard enforces but does not remove.

Stakater Reloader watches the actual Secret objects and rolls only the workloads that reference them, with no list to maintain, at the cost of a new in-cluster controller dependency. The tracking issue compares the two approaches. This PR is the concrete checksum option referenced there.

Merge order note

bitbucket_data_center_bot_token lands in #649, not main. Once #649 merges, that field must be added to secretsChecksum here, and the CI guard will flag it if it is missing.

Enforcement: this guard only blocks if it is made a required check

The workflow makes the guard run, but running is not blocking. Main's rulesets are deletion, non_fast_forward, and pull_request, with no required_status_checks rule, so CI checks here are advisory: a red guard is visible but does not block merge on its own. (Classic branch protection could not be read with my token, so an admin should confirm in repo settings.)

To actually prevent a bad merge, an admin needs to add check-secret-checksum to the branch's required status checks, ideally strict (require up to date) so it re-evaluates against latest main before merge.

This extra setup (a workflow plus a required, strict check) is part of the cost of the checksum approach, and a point in favor of Reloader, which needs none of it. See PLTF-2701.


Linear: PLTF-2701

AI-generated metadata update by OpenHands on behalf of ak684.

ak684 added 2 commits May 21, 2026 15:38
Secret-backed env vars on the openhands deployment (sourced via secretKeyRef)
are read only at pod start, and Kubernetes does not restart a pod when a
referenced Secret changes. So changing a secret in the KOTS admin console and
clicking Deploy silently has no effect until a manual pod restart.

Add a checksum/config-secrets pod-template annotation whose value is a
KOTS-rendered sha256 of every secret (password) config value. When any secret
changes, the hash changes, the pod template changes, and the pod rolls so the
new value is loaded. Only the hash crosses into the chart, never the secrets.

Tradeoff: this restarts the openhands pod on any secret change, including ones
that only affect other components (for example an LLM key used only by LiteLLM).
A controller like Reloader would restart only the affected workload.
Adds scripts/check_secret_checksum.py and a workflow that fails when a
password-type field in config.yaml is not part of secretsChecksum in
openhands.yaml. Without this, a new secret could be added without wiring it into
the restart checksum, reintroducing the silent no-restart bug. Same idea as the
chart-version-bump check: keep two related things in sync.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves a real production problem with a pragmatic approach, though with some complexity tradeoffs.

[IMPROVEMENT OPPORTUNITIES]

The coarse-grained checksum restarts the openhands pod for any secret change, including secrets that only affect other components (like LiteLLM keys). The PR description acknowledges this tradeoff vs. Stakater Reloader. Consider documenting this behavior in the chart README or values.yaml comments so operators understand the restart scope.

The 656-character single-line secretsChecksum could benefit from structure (logical grouping or multi-line formatting in the comment above it) to make future additions clearer, even though the CI check enforces correctness.

[RISK ASSESSMENT]
🟡 MEDIUM - The change is additive and low-risk from a breakage perspective. The main operational risk is unexpected pod restarts when changing unrelated secrets. The CI guard prevents silent failures (the original bug), which significantly reduces the risk of reintroducing the problem.

[VERDICT]
Worth merging - The CI guard is excellent defensive engineering. The checksum approach is simpler than adding a controller dependency, and the tradeoff is reasonable for the deployment scale.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/OpenHands-Cloud/actions/runs/26248820483

Comment thread replicated/openhands.yaml Outdated
# annotation (checksum/config-secrets) and forces a rollout so secret-backed
# env vars are reloaded. Keep this in sync with the password fields in
# config.yaml; CI (check-secret-checksum) fails if a password field is missing here.
secretsChecksum: 'repl{{ sha256sum (printf "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s" (ConfigOption "anthropic_api_key") (ConfigOption "openai_api_key") (ConfigOption "google_gemini_api_key") (ConfigOption "deepseek_api_key") (ConfigOption "mistral_api_key") (ConfigOption "azure_api_key") (ConfigOption "azure_client_secret") (ConfigOption "groq_api_key") (ConfigOption "openrouter_api_key") (ConfigOption "aws_secret_access_key") (ConfigOption "custom_api_key") (ConfigOption "admin_password") (ConfigOption "postgres_password") (ConfigOption "redis_password") (ConfigOption "jwt_secret") (ConfigOption "keycloak_admin_password") (ConfigOption "keycloak_client_secret") (ConfigOption "litellm_api_key") (ConfigOption "default_api_key") (ConfigOption "sandbox_api_key") (ConfigOption "keycloak_smtp_password") (ConfigOption "plugin_directory_identity_shared_secret") (ConfigOption "plugin_directory_session_secret") (ConfigOption "automation_service_key") (ConfigOption "automation_webhook_secret") (ConfigOption "bitbucket_data_center_client_secret") (ConfigOption "github_oauth_client_secret") (ConfigOption "github_app_webhook_secret") (ConfigOption "gitlab_oauth_client_secret") (ConfigOption "slack_client_secret") (ConfigOption "slack_signing_secret") (ConfigOption "external_postgres_password") (ConfigOption "laminar_project_api_key")) }}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This 656-character line with 33 ConfigOption calls is functionally correct but hard to scan. Consider adding a structured comment above it grouping secrets by category (LLM keys, auth secrets, infra passwords, etc.) to make future additions easier to review, even though the CI check enforces completeness.

Comment thread scripts/check_secret_checksum.py Outdated
Comment on lines +29 to +35
last_name = None
for line in config_lines:
name_match = re.match(r"\s*- name:\s*(\S+)", line)
if name_match:
last_name = name_match.group(1).strip("\"'")
elif re.match(r"\s*type:\s*password\b", line) and last_name:
password_fields.append(last_name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The parser assumes name: and type: appear on separate lines with type: coming after name:. If someone formats config.yaml inline (e.g., name: foo, type: password), this could miss fields. Consider adding a test case with different YAML formatting, or use a proper YAML parser like ruamel.yaml to be more robust.

@@ -95,6 +95,12 @@ bitbucket:
enterpriseSSO:
enabled: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The comment explains the mechanism well, but consider adding a note about the restart scope: 'Changing any secret in the KOTS config restarts the openhands pod, even if that secret only affects other components (e.g., LiteLLM keys).' This helps operators understand the blast radius when rotating credentials.

ak684 added 4 commits May 21, 2026 16:02
- check_secret_checksum.py: parse config.yaml and openhands.yaml with a real
  YAML parser instead of line matching, so reformatting cannot make the guard
  silently miss a field. Adds pyyaml to the workflow.
- workflow: also run on push to main, so drift is surfaced right away even
  though this is not a required status check.
- values.yaml: note the restart blast radius (any secret change restarts the
  openhands pod, including secrets that only affect other components).
- openhands.yaml: group the hashed fields by category in the comment.
main is now at 0.7.23 (from #649), so this PR's 0.7.23 would collide with the
chart-version check. Bump to 0.7.24. Also remove the push:main trigger on the
secret-checksum guard: a non-required check failing on main is noise, the real
enforcement is making it a required (ideally strict) status check.
#649 merged the bot_token config field to main, so it has to be in the secret
checksum, otherwise a bot-token change would not restart the pod (and the guard
would flag it against current main). Adds it to the hash.
…checksum

# Conflicts:
#	charts/openhands/Chart.yaml

@jlav jlav left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just double check that the checksum is applied anywhere that references these secrets? There may be some other deployments besides openhands that use them.

ak684 added 2 commits May 21, 2026 17:19
…main

Per review: openhands-integrations and openhands-mcp also read the secret env
(via the shared openhands.env include), so they need the same
checksum/config-secrets annotation to roll when a secret changes. CronJobs are
skipped: each run is a fresh pod that re-reads the secret, so they pick up
changes on their next run. Also restore the push:main trigger so the guard runs
on main after merge, matching the chart-version checks.
@ak684 ak684 merged commit 9a25c87 into main May 21, 2026
6 checks passed
@ak684 ak684 deleted the alona/secret-restart-checksum branch May 21, 2026 21:47
@ak684 ak684 added the customer-support Issues and PRs related to customer support requests label May 26, 2026
@ak684 ak684 changed the title Restart openhands on secret config changes (checksum + CI guard) [PLTF-2701] Restart openhands on secret config changes (checksum + CI guard) May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

customer-support Issues and PRs related to customer support requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants