-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Hello, for the past couple of months we have been encountering an issue with our GARM service where intermittently GARM will run idle scaledown due to an overflow of idle runners (sometimes during a large wave of incoming queued up jobs, GARM requests more runners than it needs) or a change in the configured idle runner (we have a process in place that modifies the min-idle amount depending on the time of day based on forecasted data).
This causes a race condition with GitHub's webhook reporting of workflow status (i.e. a runner picked up a job but GARM doesn't know about it yet).
GARM then proceeds to scale down an idle runner, that is no longer idle, causing a job cancellation of what ever job is currently in progress.
Ideally GARM should do a check to update the runner status from GitHub API before making the decision to scale down an idle runner (or any action that would delete a runner).
P.S. I had not reported this earlier because I was planning (and hoping) that switching to the new scaleset feature would fix this issue but since I started working on testing the scaleset feature I have found that it doesn't support multiple labels making it impossible for us to do a drop-in-replacement for our pools.