Fix winner selection, checks outside schedule window should not participate in the winner selection process by hashi-divyansh · Pull Request #1270 · hashicorp/nomad-autoscaler

hashi-divyansh · 2026-04-08T10:08:41Z

Description:

This PR updates check participation handling so checks that should be skipped (outside schedule window or canceled evaluation) are represented with explicit sentinel errors and excluded from winner selection.

Updated check runner behavior
- Added sentinel error:
  - errCheckOutsideSchedule
runCheckAndCapCount now returns sentinels for non-participation cases.
- Treat sentinel errors as skip signals (do not include those checks in winner selection).
Refactor: Centralise the schedule validation logic in sdk pkg.

Ticket

Testing

Example job file which can be used for testing.

job "scheduled-job" {
  datacenters = ["dc1"]
  type        = "service"

  group "scheduled-job" {
    # Start from zero and let autoscaler drive count changes.
    count = 0

    network {
      port "http" {
        to = 5678
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo:0.2.3"
        args  = ["-text=scheduled-job"]
        ports = ["http"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }

    scaling {
      enabled = true
      min     = 0
      max     = 10

      policy {
        evaluation_interval = "15s"
        cooldown            = "30s"

        # scale case
        check "scheduled-job-scale-to-two" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "30 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 2
          }
        }

       # descale case
        check "scheduled-job-descale-to-one" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "40 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 1
          }
        }

        # again scale
        check "scheduled-job-scale-three" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "36 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 3
          }
        }

        # Always (except schedule window)
        check "scheduled-job-always-active" {
          source = "prometheus"
          query  = "vector(1)"

          strategy "fixed-value" {
            value = 0
          }
        }
        target "nomad-target" {
          job   = "scheduled-job"
          group = "scheduled-job"
        }
      }
    }
  }
}

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

…winner selection

hashi-divyansh · 2026-04-08T10:19:20Z

+	if ctx.Err() != nil {
+		return sdk.ScalingAction{}, false, nil
 	}


We should check the context as well; since queryMetrics() involves an external API call, the context might be canceled and in that case we should not allow it to participate in the winner selection.

hashi-divyansh · 2026-04-09T05:46:59Z

 	ch.log.Debug("received policy check for evaluation")

 	if err := ch.ensureCompiledSchedule(); err != nil {
-		return sdk.ScalingAction{}, fmt.Errorf("failed to compile check schedule: %w", err)


Earlier, this was returning a 'no action' state, which was interpreted as a formal decision. If other checks returned 'descale,' the logic was choosing the 'no action' over the 'descale.' This was the issue: since this was outside the schedule window, it should not have participated in the decision-making process at all.

jrasell

Looking great @hashi-divyansh; I've added some minor inline comments that'll be good to look at before merging.

jrasell · 2026-04-14T10:42:37Z

 	}
 }

+func TestHandler_calculateNewCount_SkipsNonParticipatingChecks(t *testing.T) {


I think it would be nice to expand the tests, so we confirm that if not checks are participating, we get a no action result.

Added a new test TestHandler_calculateNewCount_NoParticipatingChecks_ReturnsNoAction in commit which will test the case when no checks participated.

mismithhisler · 2026-04-14T12:48:04Z

+// participating reports whether the check took part in winner selection for
+// this evaluation cycle. It is false when the check is skipped or canceled,
+// such as when it falls outside its schedule window.
+func (ch *checkRunner) runCheckAndCapCount(ctx context.Context, currentCount int64, cache *queryMetricsCache) (action sdk.ScalingAction, participating bool, err error) {


So the only time this actually returns a populated scaling action is when "true" is returned? Could we use something like the scale direction of None to indicate we are not doing anything here? This feels like it's bleeding scheduling logic into the main control loop.

No. it won't solve the problem. Returning Direction: None means we have evaluated and suggesting no action. However, a policy can have multiple checks; in that case it will be missleading.

ex: One check suggests descale and another check (outside its schedule window) returns scale direction None (or NoAction) then a/c to current winner selection logic it will choose NoAction.

Please let me know, if you think I didn't get your point 🙂.

I think that is even more reason to look at where we are evaluating schedules. Right now, an empty policy basically means Direction: None. In this code, when we evaluate a schedule and it is outside of our current window, we produce Direction: None. Because of that, we have to add another return value to this function, which is effectively bleeding schedule logic into the main loop that evaluates policies.

What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?

What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?

ok. so you mean change this method's return type to a pointer type and return nil when it is outside the schedule window right?

I looked at this a bit more and have mixed thoughts here. It's perfectly valid to return a bool to indicate, There was no error, but I've returned an empty value that is not valid. You'll usually return this when "looking" for something. In which case, I think calling this ok is more idiomatic, whereas participating is a bit confusing.

Returning a nil pointer is another option, although you risk that someone is going to call this function in the future and do a nil check, and get a panic.

There's also examples in golang code like the sql package where you can return a specific error and check for that. Like here. Here you might return ErrNotInSchedule = errors.New("check is not within defined schedule")

And finally, you could have a method on the checkRunner that validates whether you should even call run, which might make writing tests all that much easier.

if !ch.isWithinSchedule { // or ch.shouldRun if there's ever other logic to add here continue } action, err := ch.run()

I explored your suggestions

Changing method return type to (s*dk.ScalingAction, error)
The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that. Bigger issue is the method calculateNewCount() which calls this runCheckAndCapCount() it's return type is (sdk.ScalingAction, error) so even if we return nil from runCheckAndCapCount() there we have to return an empty value again. I think it will take us to the direction where we will endup refactoring too much code.

Custom error like ErrNotInSchedule
Outside schedule is not an error condition. Treating it as one forces normal control flow through the error path, which makes logs, tests, and caller behavior noisier and less honest. It also does not cover the other non-participation case ex: context canceled.

The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that

Please keep in mind, you're the only one calling this method right now. A community member or other team member might make a contribution to this and assume "well I checked the error, so the object probably isn't nil". So just document this in the function docs.

Outside schedule is not an error condition

We're bikeshedding this a bit. You've run a check runner at a time when it shouldn't actually be run. Is that an error? Maybe, maybe not. It's really up to you. In golang, if you open a file that doesn't exist, you get an error, not a nil error and nil file handle. So my point is, I think it could go either way.

You've run a check runner at a time when it shouldn't actually be run.

We don't have any other option; we're not maintaining any state. We have to run it, to decide whether we're in/out of the schedule window.

I changes the implementation a/c to your 2nd recommendation (custom error) and document about the sentinal error return.

…w-should-not-participate-in-the-leader-selection

mismithhisler · 2026-04-23T17:11:59Z

+				h.log.Debug("skipping check, outside schedule window")
+				continue
+			}
+			if errors.Is(err, errCheckEvaluationCanceled) {


Instead of having a special error for the context being cancelled, could we just check it here?

mismithhisler · 2026-04-23T17:28:28Z

 }

+// validateSchedule validates a policy/check schedule definition.
+func validateSchedule(s *ScalingPolicySchedule, path string) error {


This is a super shallow function. By that I mean, it performs no logic and is simply a passthrough. There are use cases for these, but I'm not sure what this is providing. I think I'd rather just see a single ValidateSchedule(s *ScalingPolicySchedule) function.

Another thing to point out here is that we are complicating this function a good bit by adding a path parameter that is really only used for error formatting, and is already known by the caller! So really if the caller gets an error, it can do the error formatting itself with the known path.

I removed the shallow wrapper, but I’d keep path on the ValidateScalingPolicySchedule(). The function is doing field-level validation, so keeping contextual error formatting inside it avoids duplicating wrapping logic at every call site and keeps error messages consistent across callers.

I added a comment doc to ValidateScalingPolicySchedule() about path parameter.

Adding contextual information to an error is the exact reason error wrapping exists in Golang.

I guess if you had some vendetta against error wrapping that would be one thing, but you are already doing it for the rest of compileSchedule, but just decided for cron validation to now do error formatting within this function.

I guess this leads to me another question though, which is why are we validating in the compile? Shouldn't the policy source have already validated the policy?

Policy-source validation is the primary gate, yes.

Source validation: user-facing/config-time feedback.

Compile validation: runtime safety and predictable behavior for all callers.

mismithhisler · 2026-04-27T16:27:21Z

+			return sdk.ScalingAction{}, fmt.Errorf("failed to query source: %w", err)
 		}
 	}
+	if ctx.Err() != nil {


Do you think there is a better way to do this? (hint maybe look where we are checking the context chan)

I removed context error case handling from this method.

queryMetrics() and runStrategy() will return the context error and runCheckAndCapCount() will return that error to the caller method.

More details on this comment

mismithhisler · 2026-04-27T16:30:31Z

+				h.log.Debug("skipping check, outside schedule window")
+				continue
+			}
+			if errors.Is(err, context.Canceled) {


Do we want to be checking cancelled, or deadline exceeded?

I removed specific context error handling from this method. With the exception of errCheckOutsideSchedule, all errors are now returned to the caller for handling.

Previously, context errors triggered a debug log and allowed the next checkRunner to proceed. Now, an error will fail the entire check for the current evaluation; however, it will be retried during the next evaluation cycle.

I think this is better than the previous approach. Please let me know what you think.

policy: exclude skipped(in case of schedule) or canceled checks from …

4ee10f4

…winner selection

hashi-divyansh commented Apr 8, 2026

View reviewed changes

chore: add debug log in case of context cancellation/error

cf0910e

hashi-divyansh self-assigned this Apr 8, 2026

hashi-divyansh marked this pull request as ready for review April 8, 2026 10:49

hashi-divyansh requested review from a team as code owners April 8, 2026 10:49

hashi-divyansh requested review from gulducat and jrasell April 8, 2026 11:31

hashi-divyansh added the schedule Autoscale based on a predefined schedule label Apr 8, 2026

hashi-divyansh commented Apr 9, 2026

View reviewed changes

jrasell reviewed Apr 14, 2026

View reviewed changes

test(policy): add no-participation winner-selection regression tests

e805db1

hashi-divyansh requested a review from jrasell April 14, 2026 12:29

mismithhisler reviewed Apr 14, 2026

View reviewed changes

hashi-divyansh requested review from mismithhisler and removed request for gulducat April 14, 2026 19:39

hashi-divyansh added 4 commits April 16, 2026 12:55

refactor: centralize schedule validation in sdk

4ea8cd6

refactor(policy): use sentinel errors for check non-participation

26083a1

code-refactor: remove unnecessary changes to ease code review

ea0128b

Merge branch 'main' into b-NMD-1376-checks-outside-the-schedule-windo…

058c583

…w-should-not-participate-in-the-leader-selection

mismithhisler reviewed Apr 23, 2026

View reviewed changes

chore: removed custom error type for context cancelation

b34a6e4

hashi-divyansh requested a review from mismithhisler April 23, 2026 20:22

hashi-divyansh added 2 commits April 27, 2026 21:01

sdk: wrap cron parse errors in schedule validation

f5b0945

chore: added comment for ValidateScalingPolicySchedule

74b865f

mismithhisler reviewed Apr 27, 2026

View reviewed changes

policy: propagate context errors and clean up schedule validation

7b50367

sdk: remove path param from ValidateScalingPolicySchedule

755312f

Conversation

hashi-divyansh commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes to Security Controls

Uh oh!

hashi-divyansh Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrasell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashi-divyansh Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mismithhisler Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashi-divyansh Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashi-divyansh Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hashi-divyansh Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hashi-divyansh commented Apr 8, 2026 •

edited

Loading

hashi-divyansh Apr 8, 2026 •

edited

Loading

hashi-divyansh Apr 15, 2026 •

edited

Loading

mismithhisler Apr 15, 2026 •

edited

Loading

hashi-divyansh Apr 16, 2026 •

edited

Loading

hashi-divyansh Apr 17, 2026 •

edited

Loading

hashi-divyansh Apr 23, 2026 •

edited

Loading