Skip to content

Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

@james-johnston-thumbtack

Description

If two clients concurrently call the /liveness route on the REST API, one of them will time out. This is easy to reproduce from the command line. Note that I use a & after the first curl command so that it runs asynchronously alongside the second curl command. (localhost:17303 is the REST API for kafka scheduler for me)

% ( time curl -i http://localhost:17303/liveness & ; time curl -i http://localhost:17303/liveness )
HTTP/1.1 200 OK
Date: Thu, 15 Sep 2022 03:02:51 GMT
Content-Length: 0

curl -i http://localhost:17303/liveness  0.00s user 0.01s system 0% cpu 2.351 total
HTTP/1.1 500 Internal Server Error
Date: Thu, 15 Sep 2022 03:02:53 GMT
Content-Length: 0

curl -i http://localhost:17303/liveness  0.00s user 0.01s system 0% cpu 5.024 total

The first one completes successfully, as expected. But the second one times out. The server logs show a line like:

[00] ERRO[2022-09-15T03:02:53Z] timeout for isalive probe from liveness channel

This presents a problem if multiple things in a distributed system are simultaneously checking the health. For example, EC2 target health checks documentation points out that "Health checks for a Network Load Balancer are distributed and use a consensus mechanism to determine target health. Therefore, targets receive more than the configured number of health checks."

As best I can tell, the issue is that the IsAlive function fakes a scheduled message from Kafka at

case storeEvents <- isAliveSchedule(epoch):

Which has a hard-coded ID of ||is-alive||:
const isAliveID string = "||is-alive||"

Two concurrent calls to IsAlive will result in two timers with the same ID being added. But the timer code deduplicates those using the ID:
// if found, we stop the existing timer
if t, ok := ts.items[s.ID()]; ok {
..... and since the ID is hard-coded, the second IsAlive call stomps over the first IsAlive call in timers, and thus only one timer event is ever returned in the livenessChan.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions