-
-
Notifications
You must be signed in to change notification settings - Fork 567
Description
Description:
I encountered an issue when using etcd as the service discovery mechanism for a ProtoActor cluster. When a node loses connection to etcd (for example, due to network fluctuation or during breakpoint debugging), and the network is restored, A gocoroutine to call startKeepAlive to re-register the lease. While the lease is successfully renewed, the current node does not seem to be re-added to the members list in etcd.Provider. This causes the ActorSystem to remain active but the node is no longer part of the cluster.
Steps to Reproduce:
- Start a node and use
etcdfor cluster service discovery. by default,keepAliveTTL=3sandretryInterval=1s.
provider, _ = etcd.NewWithConfig(b.Config.ClusterBaseKey, clientv3.Config{
Endpoints: []string{"example.etcd.addr:2379"},
Username: "foo",
Password: "bar",
//DialKeepAliveTime: 10 * time.Second,
//DialKeepAliveTimeout: 10 * time.Second,
})
- Disconnect the node from
etcddue to network fluctuations or debugging. - After network recovery, use the scheduled coroutine to call
startKeepAliveand renew the lease. - Notice that the
memberslist inetcd.Providerdoes not include the current node, and as a result, the node'sActorSystemdoes not rejoin the cluster.
Expected Behavior:
After network recovery, the node should successfully re-register itself via startKeepAlive and be added back to the etcd.Provider members list. The node's ActorSystem should then rejoin the cluster and function normally.
Current Behavior:
The node fails to rejoin the cluster. Even though the lease is renewed, the members list in etcd.Provider is not updated to include the node, which causes the node's ActorSystem to no longer participate in the cluster.
Environment:
ProtoActor-Goversion: v0.0.0-20240822202345-3c0e61ca19c9etcdversion: v3- Go version: go 1.22.7
Additional Information:
- Will it work if
keepAliveTTLconfiguration can be customized?
I would appreciate assistance on how to ensure the node can properly rejoin the cluster after network recovery.