Description
Is your feature request related to a problem? Please describe.
We have found that periodically (for reasons that we still need to root cause) our workers run into a series of Failed running eviction job for run ID 0196d798-a08b-7a00-9082-353865f449b4, continually retrying eviction. Since eviction could not be processed, this worker may not complete and the slot may remain forever used unless it eventually completes.
errors. Then hours later when the pod containing the worker is terminated, we see this log: Shutting down workflow worker, but 46 workflow(s) could not be evicted previously, so the shutdown may hang
. For this particular worker, we run 50 concurrent workflows, which if I interpret things correct means that for several hours the worker was in an infinite loop trying to allow 46 workflows to evict and only able to process 4 workflow tasks at a time.
We would like to be able to detect and alert on these situations more proactively. Usually we end up finding out about them because the worker set scales up to the maximum number of replicas for an extended period of time.
Describe the solution you'd like
Since the code already keeps track of when it is in its own infinite loop trying to process the eviction, I think it would be useful to expose that information as a metric so that alerting tools can be used to alert when pods have been in that state for whatever the team monitoring the metric determines to be "too long".
Additional context
If the team is bold enough, it could also be nice to do one or more of the following:
- Provide a setting that forces the worker to shutdown if it has been in an eviction loop for too long.
- Provide more threads than
max_concurrent_workflow_tasks
so that the ability to process workflows isn't as likely to be impeded by the infinite eviction loop.