Fix winner selection, checks outside schedule window should not participate in the winner selection process#1270
Conversation
| if ctx.Err() != nil { | ||
| return sdk.ScalingAction{}, false, nil | ||
| } |
There was a problem hiding this comment.
We should check the context as well; since queryMetrics() involves an external API call, the context might be canceled and in that case we should not allow it to participate in the winner selection.
| ch.log.Debug("received policy check for evaluation") | ||
|
|
||
| if err := ch.ensureCompiledSchedule(); err != nil { | ||
| return sdk.ScalingAction{}, fmt.Errorf("failed to compile check schedule: %w", err) |
There was a problem hiding this comment.
Earlier, this was returning a 'no action' state, which was interpreted as a formal decision. If other checks returned 'descale,' the logic was choosing the 'no action' over the 'descale.' This was the issue: since this was outside the schedule window, it should not have participated in the decision-making process at all.
jrasell
left a comment
There was a problem hiding this comment.
Looking great @hashi-divyansh; I've added some minor inline comments that'll be good to look at before merging.
| } | ||
| } | ||
|
|
||
| func TestHandler_calculateNewCount_SkipsNonParticipatingChecks(t *testing.T) { |
There was a problem hiding this comment.
I think it would be nice to expand the tests, so we confirm that if not checks are participating, we get a no action result.
There was a problem hiding this comment.
Added a new test TestHandler_calculateNewCount_NoParticipatingChecks_ReturnsNoAction in commit which will test the case when no checks participated.
| // participating reports whether the check took part in winner selection for | ||
| // this evaluation cycle. It is false when the check is skipped or canceled, | ||
| // such as when it falls outside its schedule window. | ||
| func (ch *checkRunner) runCheckAndCapCount(ctx context.Context, currentCount int64, cache *queryMetricsCache) (action sdk.ScalingAction, participating bool, err error) { |
There was a problem hiding this comment.
So the only time this actually returns a populated scaling action is when "true" is returned? Could we use something like the scale direction of None to indicate we are not doing anything here? This feels like it's bleeding scheduling logic into the main control loop.
There was a problem hiding this comment.
No. it won't solve the problem. Returning Direction: None means we have evaluated and suggesting no action. However, a policy can have multiple checks; in that case it will be missleading.
ex: One check suggests descale and another check (outside its schedule window) returns scale direction None (or NoAction) then a/c to current winner selection logic it will choose NoAction.
Please let me know, if you think I didn't get your point 🙂.
There was a problem hiding this comment.
I think that is even more reason to look at where we are evaluating schedules. Right now, an empty policy basically means Direction: None. In this code, when we evaluate a schedule and it is outside of our current window, we produce Direction: None. Because of that, we have to add another return value to this function, which is effectively bleeding schedule logic into the main loop that evaluates policies.
What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?
There was a problem hiding this comment.
What if, instead of producing Direction: None when we evaluate a schedule, we produce no policy at all?
ok. so you mean change this method's return type to a pointer type and return nil when it is outside the schedule window right?
There was a problem hiding this comment.
I looked at this a bit more and have mixed thoughts here. It's perfectly valid to return a bool to indicate, There was no error, but I've returned an empty value that is not valid. You'll usually return this when "looking" for something. In which case, I think calling this ok is more idiomatic, whereas participating is a bit confusing.
Returning a nil pointer is another option, although you risk that someone is going to call this function in the future and do a nil check, and get a panic.
There's also examples in golang code like the sql package where you can return a specific error and check for that. Like here. Here you might return ErrNotInSchedule = errors.New("check is not within defined schedule")
And finally, you could have a method on the checkRunner that validates whether you should even call run, which might make writing tests all that much easier.
if !ch.isWithinSchedule { // or ch.shouldRun if there's ever other logic to add here
continue
}
action, err := ch.run()
There was a problem hiding this comment.
I explored your suggestions
-
Changing method return type to
(s*dk.ScalingAction, error)
The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that. Bigger issue is the methodcalculateNewCount()which calls thisrunCheckAndCapCount()it's return type is(sdk.ScalingAction, error)so even if we return nil fromrunCheckAndCapCount()there we have to return an empty value again. I think it will take us to the direction where we will endup refactoring too much code. -
Custom error like
ErrNotInSchedule
Outside schedule is not an error condition. Treating it as one forces normal control flow through the error path, which makes logs, tests, and caller behavior noisier and less honest. It also does not cover the other non-participation case ex:context canceled.
There was a problem hiding this comment.
The panic risk is not the main problem; because it is a private method and used at only one place, we can handle that
Please keep in mind, you're the only one calling this method right now. A community member or other team member might make a contribution to this and assume "well I checked the error, so the object probably isn't nil". So just document this in the function docs.
Outside schedule is not an error condition
We're bikeshedding this a bit. You've run a check runner at a time when it shouldn't actually be run. Is that an error? Maybe, maybe not. It's really up to you. In golang, if you open a file that doesn't exist, you get an error, not a nil error and nil file handle. So my point is, I think it could go either way.
There was a problem hiding this comment.
You've run a check runner at a time when it shouldn't actually be run.
We don't have any other option; we're not maintaining any state. We have to run it, to decide whether we're in/out of the schedule window.
I changes the implementation a/c to your 2nd recommendation (custom error) and document about the sentinal error return.
…w-should-not-participate-in-the-leader-selection
| h.log.Debug("skipping check, outside schedule window") | ||
| continue | ||
| } | ||
| if errors.Is(err, errCheckEvaluationCanceled) { |
There was a problem hiding this comment.
Instead of having a special error for the context being cancelled, could we just check it here?
| } | ||
|
|
||
| // validateSchedule validates a policy/check schedule definition. | ||
| func validateSchedule(s *ScalingPolicySchedule, path string) error { |
There was a problem hiding this comment.
This is a super shallow function. By that I mean, it performs no logic and is simply a passthrough. There are use cases for these, but I'm not sure what this is providing. I think I'd rather just see a single ValidateSchedule(s *ScalingPolicySchedule) function.
Another thing to point out here is that we are complicating this function a good bit by adding a path parameter that is really only used for error formatting, and is already known by the caller! So really if the caller gets an error, it can do the error formatting itself with the known path.
There was a problem hiding this comment.
I removed the shallow wrapper, but I’d keep path on the ValidateScalingPolicySchedule(). The function is doing field-level validation, so keeping contextual error formatting inside it avoids duplicating wrapping logic at every call site and keeps error messages consistent across callers.
I added a comment doc to ValidateScalingPolicySchedule() about path parameter.
There was a problem hiding this comment.
Adding contextual information to an error is the exact reason error wrapping exists in Golang.
I guess if you had some vendetta against error wrapping that would be one thing, but you are already doing it for the rest of compileSchedule, but just decided for cron validation to now do error formatting within this function.
I guess this leads to me another question though, which is why are we validating in the compile? Shouldn't the policy source have already validated the policy?
There was a problem hiding this comment.
Policy-source validation is the primary gate, yes.
- Source validation: user-facing/config-time feedback.
- Compile validation: runtime safety and predictable behavior for all callers.
| return sdk.ScalingAction{}, fmt.Errorf("failed to query source: %w", err) | ||
| } | ||
| } | ||
| if ctx.Err() != nil { |
There was a problem hiding this comment.
Do you think there is a better way to do this? (hint maybe look where we are checking the context chan)
There was a problem hiding this comment.
I removed context error case handling from this method.
queryMetrics() and runStrategy() will return the context error and runCheckAndCapCount() will return that error to the caller method.
More details on this comment
| h.log.Debug("skipping check, outside schedule window") | ||
| continue | ||
| } | ||
| if errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Do we want to be checking cancelled, or deadline exceeded?
There was a problem hiding this comment.
I removed specific context error handling from this method. With the exception of errCheckOutsideSchedule, all errors are now returned to the caller for handling.
Previously, context errors triggered a debug log and allowed the next checkRunner to proceed. Now, an error will fail the entire check for the current evaluation; however, it will be retried during the next evaluation cycle.
I think this is better than the previous approach. Please let me know what you think.
Description:
This PR updates check participation handling so checks that should be skipped (outside schedule window or canceled evaluation) are represented with explicit sentinel errors and excluded from winner selection.
errCheckOutsideSchedulerunCheckAndCapCountnow returns sentinels for non-participation cases.Ticket
Testing
Example job file which can be used for testing.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.