Skip to content

A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

@wekb

Description

@wekb

NOTE: I have a change I can submit for review, given branch access.

TL;DR: This PR proposes a second GRPC health endpoint for use with GRPC transcoders running within Kubernetes. Kubernetes supports both liveness and readiness endpoints, but for HTTP->GRPC transcoding use cases, two separate GRPCs need to be registered to differentiate their functionality. (Kubernetes probe definitions)

For grpc-gateway and Envoy for example, the transcoding is 1:1 from HTTP endpoint to GRPC, and transcoding can't multiplex two HTTP endpoints to a single GRPC endpoint.

One alternative is to create a stub HTTP service within an otherwise pure GRPC service, however the business logic of liveness vs readiness would have to be duplicated in the HTTP logic layer, rather than within the parallel GRPC layer.

The proposed change essentially duplicates the Health GRPC and wraps it around the new "Readiness" name. This was chosen because of the simplicity and small change footprint.

An alternative approach could be to add two additional endpoints, one for Liveness and one for Readiness, each perhaps with their own (duplicate) Request and Response structures. There's a certain naming cleanliness here, but it seems overly complex.

Feedback welcome!

Quick summary of liveness vs readiness use cases:

  • A liveness check should be used to verify the application/service itself
    is up and serving GRPC.
    IMPORTANT: Kubernetes will restart containers that fail a liveness check.
  • A readiness check should be used to verify that the application/service
    has completed loading any prerequisite data, and that the service's
    upstream dependencies (e.g., MySQL, Kafka) are available and working.
    IMPORTANT: Kubernetes will remove the container from the service load
    balancers, but will not restart the container.

In other words, liveness represents permanent failure, while readiness
represents a transient failure, and the application-side implementation of
these checks should differ accordingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions