Conversation
|
It feels a little janky to return a 500 status code just as a function of the response content here. Is it a possibility for the health check tool to use other info on the response? E.g.
|
|
Hey @jackie-ob , setting the HTTP status code >= 400 is the standard way to signal that a service is unhealthy (in AWS ELB and k8s):
In the case of ELB I don't think the response body can be evaluated to determine the service health. You guys are also using this technique correctly in other places, see https://github.com/Netflix/metaflow-service/blob/master/services/metadata_service/api/admin.py#L59 Please let me know if this MR makes sense or if I should change something. |
|
What issues were you seeing in the cache? I believe they may be addressed in #327 which should keep the caches alive. |
|
@ruial we are planning to ship #327 as @romain-intel mentioned. Today. Would you like to see if that makes the need for this PR go away altogether? If it is still a problem, our preference would be to create a new endpoint (similar to what you referenced) |
|
Sure, I can try the new version, I think the fix from https://github.com/Netflix/metaflow-service/pull/292/files#diff-17bee700a8b2eb91cc38dcb3d7e30cd88769966fdb21c724393a5746f664d3cf is not included tho. A new endpoint in the future with the purpose of health checking the cache (and maybe the database) would also work. The current issue that I have is that all the cache workers subprocesses get terminated and don't restart, so everything in the UI gets a Everything else with Metaflow works well, the only issues we have currently are the cache. Thank you :) |
This endpoint always returns status 200 even if 1 or more caches are not alive. With this change, if 1 of the caches is dead, it will return status 500, which can then be used by health checking tools to restart the service.
I have previously monkey patched #292 into the service I maintain but I still found issues where 1+ cache was not restarted correctly. This is hard to reproduce and happens from time to time. With this change, I no longer need to restart the Metaflow service manually.