Implement Circuit-breaker assistance for recovery of a failing Akka Cluster

**Implement Circuit-breaker at appropriate inter-service interactions**
The two micro-services, the API and the Backend talk over gRPC. Typically, Thingverse would be installed in production on to a Kubernetes cluster. We leverage the Linkerd service mesh to do TLS offloading, Retry and Load balancing. 

As discussed in #12 the network partition is a dreaded situation where things spiral out of control. So, we need to be very calculated and conservative with Retry logic and not further degrade an already precarious situation. 

**What could be done**
I believe, we could start with conservative retry mechanisms to deal with the situation. Beyond that, we should cut off inbound traffic to the failing nodes altogether, and see if the remaining nodes can handle requests if the cluster is still in an healthy state. At the moment, we are focussing on dealing with retry at the service mesh layer, but we need to design keeping in mind the Omnibus release which would run outside of K8s (2.x release train?)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions