Skip to content

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

@arunkpatra

Description

@arunkpatra

Implement Circuit-breaker at appropriate inter-service interactions
The two micro-services, the API and the Backend talk over gRPC. Typically, Thingverse would be installed in production on to a Kubernetes cluster. We leverage the Linkerd service mesh to do TLS offloading, Retry and Load balancing.

As discussed in #12 the network partition is a dreaded situation where things spiral out of control. So, we need to be very calculated and conservative with Retry logic and not further degrade an already precarious situation.

What could be done
I believe, we could start with conservative retry mechanisms to deal with the situation. Beyond that, we should cut off inbound traffic to the failing nodes altogether, and see if the remaining nodes can handle requests if the cluster is still in an healthy state. At the moment, we are focussing on dealing with retry at the service mesh layer, but we need to design keeping in mind the Omnibus release which would run outside of K8s (2.x release train?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestresearchRequires specialized research

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions