Skip to content

[Feature Request] Expose SDK metric for worker._count_not_evict_count #875

Open
@millerick

Description

@millerick

Is your feature request related to a problem? Please describe.

We have found that periodically (for reasons that we still need to root cause) our workers run into a series of Failed running eviction job for run ID 0196d798-a08b-7a00-9082-353865f449b4, continually retrying eviction. Since eviction could not be processed, this worker may not complete and the slot may remain forever used unless it eventually completes. errors. Then hours later when the pod containing the worker is terminated, we see this log: Shutting down workflow worker, but 46 workflow(s) could not be evicted previously, so the shutdown may hang. For this particular worker, we run 50 concurrent workflows, which if I interpret things correct means that for several hours the worker was in an infinite loop trying to allow 46 workflows to evict and only able to process 4 workflow tasks at a time.

We would like to be able to detect and alert on these situations more proactively. Usually we end up finding out about them because the worker set scales up to the maximum number of replicas for an extended period of time.

Describe the solution you'd like

Since the code already keeps track of when it is in its own infinite loop trying to process the eviction, I think it would be useful to expose that information as a metric so that alerting tools can be used to alert when pods have been in that state for whatever the team monitoring the metric determines to be "too long".

Additional context

If the team is bold enough, it could also be nice to do one or more of the following:

  1. Provide a setting that forces the worker to shutdown if it has been in an eviction loop for too long.
  2. Provide more threads than max_concurrent_workflow_tasks so that the ability to process workflows isn't as likely to be impeded by the infinite eviction loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions