Skip to content

Fix winner selection, checks outside schedule window should not participate in the winner selection process#1270

Open
hashi-divyansh wants to merge 12 commits intomainfrom
b-NMD-1376-checks-outside-the-schedule-window-should-not-participate-in-the-leader-selection
Open

Fix winner selection, checks outside schedule window should not participate in the winner selection process#1270
hashi-divyansh wants to merge 12 commits intomainfrom
b-NMD-1376-checks-outside-the-schedule-window-should-not-participate-in-the-leader-selection

Conversation

@hashi-divyansh
Copy link
Copy Markdown
Member

@hashi-divyansh hashi-divyansh commented Apr 8, 2026

Description:

This PR updates check participation handling so checks that should be skipped (outside schedule window or canceled evaluation) are represented with explicit sentinel errors and excluded from winner selection.

  • Updated check runner behavior
    • Added sentinel error:
      • errCheckOutsideSchedule
  • runCheckAndCapCount now returns sentinels for non-participation cases.
    • Treat sentinel errors as skip signals (do not include those checks in winner selection).
  • Refactor: Centralise the schedule validation logic in sdk pkg.

Ticket

Testing

Example job file which can be used for testing.

job "scheduled-job" {
  datacenters = ["dc1"]
  type        = "service"

  group "scheduled-job" {
    # Start from zero and let autoscaler drive count changes.
    count = 0

    network {
      port "http" {
        to = 5678
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo:0.2.3"
        args  = ["-text=scheduled-job"]
        ports = ["http"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }

    scaling {
      enabled = true
      min     = 0
      max     = 10

      policy {
        evaluation_interval = "15s"
        cooldown            = "30s"

        # scale case
        check "scheduled-job-scale-to-two" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "30 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 2
          }
        }

       # descale case
        check "scheduled-job-descale-to-one" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "40 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 1
          }
        }

        # again scale
        check "scheduled-job-scale-three" {
          source = "prometheus"
          query  = "vector(1)"

          schedule {
            start    = "36 10 * * *"
            duration = "1m"
          }

          strategy "fixed-value" {
            value = 3
          }
        }

        # Always (except schedule window)
        check "scheduled-job-always-active" {
          source = "prometheus"
          query  = "vector(1)"

          strategy "fixed-value" {
            value = 0
          }
        }
        target "nomad-target" {
          job   = "scheduled-job"
          group = "scheduled-job"
        }
      }
    }
  }
}
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

Comment thread policy/check.go Outdated
Comment on lines 277 to 279
if ctx.Err() != nil {
return sdk.ScalingAction{}, false, nil
}
Copy link
Copy Markdown
Member Author

@hashi-divyansh hashi-divyansh Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the context as well; since queryMetrics() involves an external API call, the context might be canceled and in that case we should not allow it to participate in the winner selection.

@hashi-divyansh hashi-divyansh self-assigned this Apr 8, 2026
@hashi-divyansh hashi-divyansh marked this pull request as ready for review April 8, 2026 10:49
@hashi-divyansh hashi-divyansh requested review from a team as code owners April 8, 2026 10:49
@hashi-divyansh hashi-divyansh added the schedule Autoscale based on a predefined schedule label Apr 8, 2026
Comment thread policy/check.go
ch.log.Debug("received policy check for evaluation")

if err := ch.ensureCompiledSchedule(); err != nil {
return sdk.ScalingAction{}, fmt.Errorf("failed to compile check schedule: %w", err)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, this was returning a 'no action' state, which was interpreted as a formal decision. If other checks returned 'descale,' the logic was choosing the 'no action' over the 'descale.' This was the issue: since this was outside the schedule window, it should not have participated in the decision-making process at all.

Copy link
Copy Markdown
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @hashi-divyansh; I've added some minor inline comments that'll be good to look at before merging.

Comment thread policy/check.go Outdated
Comment thread policy/check.go Outdated
Comment thread policy/handler_test.go Outdated
}
}

func TestHandler_calculateNewCount_SkipsNonParticipatingChecks(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to expand the tests, so we confirm that if not checks are participating, we get a no action result.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test TestHandler_calculateNewCount_NoParticipatingChecks_ReturnsNoAction in commit which will test the case when no checks participated.

Comment thread policy/handler.go Outdated
@hashi-divyansh hashi-divyansh requested a review from jrasell April 14, 2026 12:29
Comment thread policy/check.go Outdated
// participating reports whether the check took part in winner selection for
// this evaluation cycle. It is false when the check is skipped or canceled,
// such as when it falls outside its schedule window.
func (ch *checkRunner) runCheckAndCapCount(ctx context.Context, currentCount int64, cache *queryMetricsCache) (action sdk.ScalingAction, participating bool, err error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only time this actually returns a populated scaling action is when "true" is returned? Could we use something like the scale direction of None to indicate we are not doing anything here? This feels like it's bleeding scheduling logic into the main control loop.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. it won't solve the problem. Returning Direction: None means we have evaluated and suggesting no action. However, a policy can have multiple checks; in that case it will be missleading.

ex: One check suggests descale and another check (outside its schedule window) returns scale direction None (or NoAction) then a/c to current winner selection logic it will choose NoAction.

Please let me know, if you think I didn't get your point 🙂.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is even more reason to look at where we are evaluating schedules. Right now, an empty policy basically means Direction: None. In this code, when we evaluate a schedule and it is outside of our current window, we produce Direction: None. Because of that, we have to add another return value to this function, which is effectively bleeding schedule logic into the main loop that evaluates policies.

What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?

Copy link
Copy Markdown
Member Author

@hashi-divyansh hashi-divyansh Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?

ok. so you mean change this method's return type to a pointer type and return nil when it is outside the schedule window right?

Copy link
Copy Markdown
Member

@mismithhisler mismithhisler Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this a bit more and have mixed thoughts here. It's perfectly valid to return a bool to indicate, There was no error, but I've returned an empty value that is not valid. You'll usually return this when "looking" for something. In which case, I think calling this ok is more idiomatic, whereas participating is a bit confusing.

Returning a nil pointer is another option, although you risk that someone is going to call this function in the future and do a nil check, and get a panic.

There's also examples in golang code like the sql package where you can return a specific error and check for that. Like here. Here you might return ErrNotInSchedule = errors.New("check is not within defined schedule")

And finally, you could have a method on the checkRunner that validates whether you should even call run, which might make writing tests all that much easier.

if !ch.isWithinSchedule { // or ch.shouldRun if there's ever other logic to add here
  continue
}
action, err := ch.run()

Copy link
Copy Markdown
Member Author

@hashi-divyansh hashi-divyansh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored your suggestions

  1. Changing method return type to (s*dk.ScalingAction, error)
    The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that. Bigger issue is the method calculateNewCount() which calls this runCheckAndCapCount() it's return type is (sdk.ScalingAction, error) so even if we return nil from runCheckAndCapCount() there we have to return an empty value again. I think it will take us to the direction where we will endup refactoring too much code.

  2. Custom error like ErrNotInSchedule
    Outside schedule is not an error condition. Treating it as one forces normal control flow through the error path, which makes logs, tests, and caller behavior noisier and less honest. It also does not cover the other non-participation case ex: context canceled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that

Please keep in mind, you're the only one calling this method right now. A community member or other team member might make a contribution to this and assume "well I checked the error, so the object probably isn't nil". So just document this in the function docs.

Outside schedule is not an error condition

We're bikeshedding this a bit. You've run a check runner at a time when it shouldn't actually be run. Is that an error? Maybe, maybe not. It's really up to you. In golang, if you open a file that doesn't exist, you get an error, not a nil error and nil file handle. So my point is, I think it could go either way.

Copy link
Copy Markdown
Member Author

@hashi-divyansh hashi-divyansh Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've run a check runner at a time when it shouldn't actually be run.

We don't have any other option; we're not maintaining any state. We have to run it, to decide whether we're in/out of the schedule window.

I changes the implementation a/c to your 2nd recommendation (custom error) and document about the sentinal error return.

@hashi-divyansh hashi-divyansh requested review from mismithhisler and removed request for gulducat April 14, 2026 19:39
Comment thread policy/handler.go Outdated
h.log.Debug("skipping check, outside schedule window")
continue
}
if errors.Is(err, errCheckEvaluationCanceled) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a special error for the context being cancelled, could we just check it here?

Comment thread sdk/schedule.go Outdated
}

// validateSchedule validates a policy/check schedule definition.
func validateSchedule(s *ScalingPolicySchedule, path string) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super shallow function. By that I mean, it performs no logic and is simply a passthrough. There are use cases for these, but I'm not sure what this is providing. I think I'd rather just see a single ValidateSchedule(s *ScalingPolicySchedule) function.

Another thing to point out here is that we are complicating this function a good bit by adding a path parameter that is really only used for error formatting, and is already known by the caller! So really if the caller gets an error, it can do the error formatting itself with the known path.

Copy link
Copy Markdown
Member Author

@hashi-divyansh hashi-divyansh Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the shallow wrapper, but I’d keep path on the ValidateScalingPolicySchedule(). The function is doing field-level validation, so keeping contextual error formatting inside it avoids duplicating wrapping logic at every call site and keeps error messages consistent across callers.

I added a comment doc to ValidateScalingPolicySchedule() about path parameter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding contextual information to an error is the exact reason error wrapping exists in Golang.

I guess if you had some vendetta against error wrapping that would be one thing, but you are already doing it for the rest of compileSchedule, but just decided for cron validation to now do error formatting within this function.

I guess this leads to me another question though, which is why are we validating in the compile? Shouldn't the policy source have already validated the policy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy-source validation is the primary gate, yes.

  • Source validation: user-facing/config-time feedback.
  • Compile validation: runtime safety and predictable behavior for all callers.

Comment thread policy/check.go Outdated
return sdk.ScalingAction{}, fmt.Errorf("failed to query source: %w", err)
}
}
if ctx.Err() != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there is a better way to do this? (hint maybe look where we are checking the context chan)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed context error case handling from this method.

queryMetrics() and runStrategy() will return the context error and runCheckAndCapCount() will return that error to the caller method.

More details on this comment

Comment thread policy/handler.go Outdated
h.log.Debug("skipping check, outside schedule window")
continue
}
if errors.Is(err, context.Canceled) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be checking cancelled, or deadline exceeded?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed specific context error handling from this method. With the exception of errCheckOutsideSchedule, all errors are now returned to the caller for handling.

Previously, context errors triggered a debug log and allowed the next checkRunner to proceed. Now, an error will fail the entire check for the current evaluation; however, it will be retried during the next evaluation cycle.

I think this is better than the previous approach. Please let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

schedule Autoscale based on a predefined schedule

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants