Skip to content

Conversation

@abg
Copy link
Member

@abg abg commented Dec 23, 2025

Problem

When BOSH recreates a VM during cluster quorum loss, pre-start would fail before monit services were configured. This prevented the bootstrap errand from running (as bootstrap requires galera-agent, monit services available), creating a deadlock where operators couldn't restore quorum without manual recovery steps.

Solution

Move health validation from pre-start to post-start. This allows monit services to be configured even if MySQL can't join the cluster, enabling the bootstrap errand to run and restore quorum.

Testing

  • Added e2e test that validates the fix
  • Added unit test for related bug fix in StopService

abg added 5 commits December 23, 2025 17:31
StopService was writing SINGLE_NODE to the state file before stopping
the service. This is semantically incorrect - stopping a service should
not modify the recorded cluster state.

Added regression test to verify state file is not modified. Fixed typos
in test descriptions while editing the file.

[TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
- Add bosh.Recreate(deployment, instance) helper
- bosh.Instances now exposes ProcessState

[TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
- Minor refactoring to cleanup test setup

  Consolidates BeforeAll and AfterAll nodes, leveraging DeferCleanup to
  setup cleanup operations next to allocation operations.

[TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
Adds a more complex bootstrap scenario by forcing a cluster to lose
quorum and subsequently triggering a node recreate.  This emulates real
production scenarios where a cluster fails and an unresponsive node is
recreated.

The expectation is that after the work in the current story, "bootstrap"
can be trivially run to restore a working cluster.

As of this commit, this is a failing test because mysql fails in
pre-start and the bosh instance is left in a state that bootstrap cannot
trivially manage the mysql instances.

[TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
Problem:
When BOSH recreates a VM during cluster quorum loss, pre-start would
fail before monit services were configured. This prevented the bootstrap
errand from running (it requires galera-agent to be available), creating
a deadlock where operators couldn't restore quorum.

Solution:
Move health validation to post-start. Now pre-start completes
successfully, monit services are configured, and the bootstrap errand
can run even if post-start validation fails due to lack of quorum.

The post-start validation logic is identical to the previous pre-start
logic: poll port 8114 until galera-init reports healthy, or fail if the
BPM process dies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant