Description
Currently, liveness and readiness probes are configured with initialDelaySeconds
set to 30s which is fairly high value. However, in case of the container crash, Pyroscope server may need even longer time to recover the storage (it is hard to estimate the procedure duration, but a minute or two is what we may expect).
A proper solution would be to separate implementations of the readiness and liveness checks:
- liveness probe starts serving requests in the very beginning of the server initialisation (before any other component)
- readiness probe starts serving requests only when all the components finished the initialisation
Increasing initialDelaySeconds
further by default for readiness probe might be undesirable because it will introduce noticeable unconditional delay between the server start and the moment when it actually starts serving requests.
As a workaround, I think we may adjust the default settings so that the pod has at least 90s to finish initialisation, but does not prevent server from handling requests if it managed to complete initialisation sooner:
readinessProbe:
enabled: true
httpGet:
path: /healthz
port: 4040
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 10
successThreshold: 1
# Despite the fact that the initial delay is 60 seconds, if the pod crashes
# after initialisation (this is the only realistic reason why the probe may fail),
# it will be restarted.
#
# Note that livenessProbe does not wait for readinessProbe to succeed.
livenessProbe:
enabled: true
httpGet:
path: /healthz
port: 4040
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 3
successThreshold: 1
The current config:
readinessProbe:
# -- Enable Pyroscope server readiness
enabled: true
httpGet:
# -- Pyroscope server readiness check path
path: /healthz
# -- Pyroscope server readiness check port
port: 4040
# -- Pyroscope server readiness initial delay in seconds
initialDelaySeconds: 30
# -- Pyroscope server readiness check frequency in seconds
periodSeconds: 5
# -- Pyroscope server readiness check request timeout
timeoutSeconds: 30
# -- Pyroscope server readiness check failure threshold count
failureThreshold: 3
# -- Pyroscope server readiness check success threshold count
successThreshold: 1
livenessProbe:
# -- Enable Pyroscope server liveness
enabled: true
httpGet:
# -- Pyroscope server liveness check path
path: /healthz
# -- Pyroscope server liveness check port
port: 4040
# -- Pyroscope server liveness check intial delay in seconds
initialDelaySeconds: 30
# -- Pyroscope server liveness check frequency in seconds
periodSeconds: 15
# -- Pyroscope server liveness check request timeout
timeoutSeconds: 30
# -- Pyroscope server liveness check failure threshold
failureThreshold: 3
# -- Pyroscope server liveness check success threshold
successThreshold: 1