Idle runner scaledown causing existing jobs to be canceled

Hello, for the past couple of months we have been encountering an issue with our GARM service where intermittently GARM will run `idle scaledown` due to an overflow of idle runners (sometimes during a large wave of incoming queued up jobs, GARM requests more runners than it needs) or a change in the configured idle runner (we have a process in place that modifies the min-idle amount depending on the time of day based on forecasted data).

This causes a race condition with GitHub's webhook reporting of workflow status (i.e. a runner picked up a job but GARM doesn't know about it yet).

GARM then proceeds to scale down an idle runner, that is no longer idle, causing a job cancellation of what ever job is currently in progress. 

Ideally GARM should do a check to update the runner status from GitHub API before making the decision to scale down an idle runner (or any action that would delete a runner).

P.S. I had not reported this earlier because I was planning (and hoping) that switching to the new scaleset feature would fix this issue but since I started working on testing the scaleset feature I have found that it doesn't support multiple labels making it impossible for us to do a drop-in-replacement for our pools. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Idle runner scaledown causing existing jobs to be canceled #455

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Idle runner scaledown causing existing jobs to be canceled #455

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions