Feature/improve bootstrap reliability #90

abg · 2025-12-23T23:35:41Z

Problem

When BOSH recreates a VM during cluster quorum loss, pre-start would fail before monit services were configured. This prevented the bootstrap errand from running (as bootstrap requires galera-agent, monit services available), creating a deadlock where operators couldn't restore quorum without manual recovery steps.

Solution

Move health validation from pre-start to post-start. This allows monit services to be configured even if MySQL can't join the cluster, enabling the bootstrap errand to run and restore quorum.

Testing

Added e2e test that validates the fix
Added unit test for related bug fix in StopService

StopService was writing SINGLE_NODE to the state file before stopping the service. This is semantically incorrect - stopping a service should not modify the recorded cluster state. Added regression test to verify state file is not modified. Fixed typos in test descriptions while editing the file. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

- Add bosh.Recreate(deployment, instance) helper - bosh.Instances now exposes ProcessState [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

- Minor refactoring to cleanup test setup Consolidates BeforeAll and AfterAll nodes, leveraging DeferCleanup to setup cleanup operations next to allocation operations. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

Adds a more complex bootstrap scenario by forcing a cluster to lose quorum and subsequently triggering a node recreate. This emulates real production scenarios where a cluster fails and an unresponsive node is recreated. The expectation is that after the work in the current story, "bootstrap" can be trivially run to restore a working cluster. As of this commit, this is a failing test because mysql fails in pre-start and the bosh instance is left in a state that bootstrap cannot trivially manage the mysql instances. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

Problem: When BOSH recreates a VM during cluster quorum loss, pre-start would fail before monit services were configured. This prevented the bootstrap errand from running (it requires galera-agent to be available), creating a deadlock where operators couldn't restore quorum. Solution: Move health validation to post-start. Now pre-start completes successfully, monit services are configured, and the bootstrap errand can run even if post-start validation fails due to lack of quorum. The post-start validation logic is identical to the previous pre-start logic: poll port 8114 until galera-init reports healthy, or fail if the BPM process dies.

abg added 5 commits December 23, 2025 17:31

e2e-tests: improve bosh utilities

9236fb2

- Add bosh.Recreate(deployment, instance) helper - bosh.Instances now exposes ProcessState [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

e2e-tests: refactor bootstrap test

515f90f

- Minor refactoring to cleanup test setup Consolidates BeforeAll and AfterAll nodes, leveraging DeferCleanup to setup cleanup operations next to allocation operations. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

cf-foundation-community-automation bot added this to Foundational Infrastructure Working Group Dec 24, 2025

cf-foundation-community-automation bot moved this to Inbox in Foundational Infrastructure Working Group Dec 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/improve bootstrap reliability #90

Feature/improve bootstrap reliability #90

abg commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feature/improve bootstrap reliability #90

Are you sure you want to change the base?

Feature/improve bootstrap reliability #90

Conversation

abg commented Dec 23, 2025

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant