Open
Description
This is a feature request.
As EG becomes one of core mircroservices in scalable jupyter deployment, Reliability should be required to EG.
There're many works for reliability such as HA support, and session persistence, but I think the easiest way is to recover to desired status by restarting EG in crash.
If EG provides its liveness status via /healthz
endpoint, we can easily diagnose the status of EG and restart it when it's not healthy.
Of course, industrial enterprise cluster provides great automation of recovering interface like Kubernetes Container Probes
If folks thumb up to this idea, I want to discuss about what and how unhealthy status can be tracked in EG.