Skip to content

Idle runner scaledown causing existing jobs to be canceled #455

@Hdom

Description

@Hdom

Hello, for the past couple of months we have been encountering an issue with our GARM service where intermittently GARM will run idle scaledown due to an overflow of idle runners (sometimes during a large wave of incoming queued up jobs, GARM requests more runners than it needs) or a change in the configured idle runner (we have a process in place that modifies the min-idle amount depending on the time of day based on forecasted data).

This causes a race condition with GitHub's webhook reporting of workflow status (i.e. a runner picked up a job but GARM doesn't know about it yet).

GARM then proceeds to scale down an idle runner, that is no longer idle, causing a job cancellation of what ever job is currently in progress.

Ideally GARM should do a check to update the runner status from GitHub API before making the decision to scale down an idle runner (or any action that would delete a runner).

P.S. I had not reported this earlier because I was planning (and hoping) that switching to the new scaleset feature would fix this issue but since I started working on testing the scaleset feature I have found that it doesn't support multiple labels making it impossible for us to do a drop-in-replacement for our pools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions