Automatic open #7003

cjen1-msft · 2025-05-09T18:12:44Z

cjen1-msft
May 9, 2025
Maintainer

Extending the discussion from #6985 another automation direction is to add the capacity for a CCF network to automatically recover from a disaster.

The aim is to allow the system to have an automated process to pick the 'best' recovering node when mulitple recover, and to then have all others to try to join that node using the auto-join protocol described in #6985.

Assumptions

There is an external scheduler which can restart nodes, and this scheduler restarts the nodes on the same hardware as previously, allowing for local unsealing.

This scheduler is relatively simple and just restarts the cluster when a sufficient time has passed without the network being live.

Protocol

For a more precise specification of this protocol see the tla and stateright in modelling-autoopen branch of cjen1-msft's ccf fork.

When nodes restart they try to Join the previous active network.
If they are unsuccessful they switch to Recovering and perform the following protocol.

Nodes periodically broadcast the highest transaction id (txid) in their ledger, effectively gossiping the state of their ledgers.

When a node has heard from enough each of the other nodes it expects to be in its cluster, or a timeout has fired; it sends a vote to the node who had the highest txid, with ties broken by the node'd identity.

If any node receives votes from a majority of nodes it notifies all other nodes that it is transitioning to Open, and then transitions to Open.
The other nodes, receiving this restart themselves using a temporary Join configuration file which points to the opened node's network.

Successful path

sequenceDiagram
  participant N1
  participant N2
  participant N3
  
  Note over N1: Locally unseal secrets, txid = 1
  Note over N2: Locally unseal secrets, txid = 2
  Note over N3: Locally unseal secrets, txid = 3

  Note over N1, N3: Gossip

  N1 ->> N2: Gossip(txid=1)
  N1 ->> N3: Gossip(txid=1)

  N2 ->> N1: Gossip(txid=2)
  N2 ->> N3: Gossip(txid=2)

  N3 ->> N1: Gossip(txid=3)
  N3 ->> N2: Gossip(txid=3)

  Note over N1, N3: VoteTimeout
  
  N1 ->> N3: Vote
  N2 ->> N3: Vote

  Note over N1, N3: Open/Join
  
  Note over N3: Open
  N3 ->> N1: Open  
  Note over N1: Join N3

  N3 ->> N2: Open
  Note over N2: Join N3

Deadlock cases

So long as a majority of nodes vote for the same node, the system will open using that node, and not deadlock.
This is equivalent to saying that even if nodes are dead, hence requiring the timeout must trigger, that a majority of the nodes gossiping among themselves before the timeout triggered will prevent deadlock.

However if the timeout triggers before the nodes have sufficiently gossiped, the network could deadlock as shown below.

sequenceDiagram
  participant N1
  participant N2
  participant N3
  participant N4
  participant N5
  
  Note over N1: Locally unseal secrets, txid = 1
  Note over N2: Locally unseal secrets, txid = 2
  Note over N3: Locally unseal secrets, txid = 3
  Note over N4: Locally unseal secrets, txid = 4
  Note over N5: Locally unseal secrets, txid = 5

  Note over N1, N5: Gossip
  N1 ->> N2: Gossip(txid=1)
  N2 ->> N1: Gossip(txid=2)

  N4 ->> N5: Gossip(txid=1)
  N5 ->> N4: Gossip(txid=1)

  Note over N1, N2: Timeout
  N1 ->> N2: Vote
  N2 ->> N2: Vote

  Note over N3: Timeout
  N3 ->> N3: Vote

  Note over N4, N5: Timeout
  N4 ->> N5: Vote
  N5 ->> N5: Vote

  Note over N1, N5: Deadlock

Alternatives

The key tradeoff here is between the risk of a deadlock, no open networks, and a fork, multiple open networks.
There are several other points in this tradeoff space.

One alternative is to allow nodes to vote for every node which has a higher txid in their ledger.
This will generally result in multiple nodes opening their networks.

Another alternative is to use several rounds of communication to try to reach consensus on the recovering node (similarly to one of the standard consensus primitives).
Although, by virtue of using a consensus primitive, this approach can avoid deadlock and ensure a single open network, the delay before recovery is only probabilistically bounded (livelock vs deadlock), and the implementation complexity is higher.

cjen1-msft · 2025-05-13T10:44:03Z

cjen1-msft
May 13, 2025
Maintainer Author

This scheme still has a possible fork.
If a majority recover and choose a replica, then are restarted partitioned from the chosen replica, that replica's fork can proceed without intervention, while the restarted replicas can start a fork.
If we also require that the committing quorum of the recovered network is equal to the original network, then we should be fork free, as there would need to be an intersecting node between the committing quorum of a fork and the DR'd.

This all assumes no manual-dr by the operators takes place, that node identities are guaranteed to be unique, and that no logs are truncated.

2 replies

eddyashton May 14, 2025
Maintainer

To make sure I understand, this fork requires a quorum of the nodes to be restarted? So N nodes start in recovery mode, and then N-1 are started again in recovery mode? I think we'd consider that a second DR attempt, and we understand well that any DR attempt is a potential fork if we cannot guarantee that all predecessors are predeceased.

So equivalently: attempting a DR while the origin service survives creates a fork; attempting a second DR while a viable service exists from a first DR creates a fork.

I also think we could mitigate this by pre-populating the initial configuration from the gossiping quorum. If 3 nodes voted you to be their designated lead-recoverer, we expect them to shortly arrive as Joiners. If they don't (because they've been partitioned), then we could make it so the original recovery is unable to make progress?

cjen1-msft May 20, 2025
Maintainer Author

Yes approximately.
I think the 'correct' fix something like the following:

Node is chosen
Other nodes restart and start to join it
Once a majority of the original cluster are in 'joining' transition-to-open
- Pad the configuration with 'bad' nodes which are marked for retirement, ensuring that a majority of the 'intrinsic identities' of the configuration are required to commit, preventing a fork.
Now the remainder of the nodes join the new cluster

I think this should also be fairly minimal as changes go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic open #7003

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Automatic open #7003

Uh oh!

Uh oh!

cjen1-msft May 9, 2025 Maintainer

Assumptions

Protocol

Successful path

Deadlock cases

Alternatives

Replies: 1 comment · 2 replies

Uh oh!

cjen1-msft May 13, 2025 Maintainer Author

Uh oh!

eddyashton May 14, 2025 Maintainer

Uh oh!

cjen1-msft May 20, 2025 Maintainer Author

cjen1-msft
May 9, 2025
Maintainer

Replies: 1 comment 2 replies

cjen1-msft
May 13, 2025
Maintainer Author

eddyashton May 14, 2025
Maintainer

cjen1-msft May 20, 2025
Maintainer Author