-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Added UnavailableOfferingsTTL exposed as the flag & env variable in the karpenter #8013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
911cce0
to
e5eff64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some preliminary style+structure feedback, will need to discuss with the provider working group to determine if there are any unintended consequences\product reasons to push back on accepting this PR. Would love if you could attend and share your use case directly!
@@ -53,6 +54,7 @@ func (o *Options) AddFlags(fs *coreoptions.FlagSet) { | |||
fs.Float64Var(&o.VMMemoryOverheadPercent, "vm-memory-overhead-percent", utils.WithDefaultFloat64("VM_MEMORY_OVERHEAD_PERCENT", 0.075), "The VM memory overhead as a percent that will be subtracted from the total memory for all instance types when cached information is unavailable.") | |||
fs.StringVar(&o.InterruptionQueue, "interruption-queue", env.WithDefaultString("INTERRUPTION_QUEUE", ""), "Interruption queue is the name of the SQS queue used for processing interruption events from EC2. Interruption handling is disabled if not specified. Enabling interruption handling may require additional permissions on the controller service account. Additional permissions are outlined in the docs.") | |||
fs.IntVar(&o.ReservedENIs, "reserved-enis", env.WithDefaultInt("RESERVED_ENIS", 0), "Reserved ENIs are not included in the calculations for max-pods or kube-reserved. This is most often used in the VPC CNI custom networking setup https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html.") | |||
fs.IntVar(&o.UnavailableOfferingsTTL, "unavailable-offerings-ttl", env.WithDefaultInt("UNAVAILABLE_OFFERINGS_TTL", 180), "The Unavailable offerings TTL is the time before offerings that were marked as unavailable are removed from the cache and are available for launch again.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason to switch this TTL to seconds over minutes? Either way, the description should include a '... in minutes ...' or an '... in seconds ...' to indicate to the user the correct unit.
@@ -37,7 +39,8 @@ type UnavailableOfferings struct { | |||
SeqNum uint64 | |||
} | |||
|
|||
func NewUnavailableOfferings() *UnavailableOfferings { | |||
func NewUnavailableOfferings(ctx context.Context) *UnavailableOfferings { | |||
UnavailableOfferingsTTL := time.Duration(int64(options.FromContext(ctx).UnavailableOfferingsTTL)) * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd prefer to pass in only the TTL as an argument than the whole ctx object just to keep the function footprint small. Additionally, why store the option as an int
just to cast it to an int64
just to convert it to a time.Duration
?
Are you able to elaborate on this? Your workloads shouldn't have any impact on what this value should be since they don't affect how long it will take for a spot pool to become available again. Generally speaking, I'm not convinced there is a use-case for tuning this value, but would like to know more about your use-case. |
Fixes #N/A
Description
Exposed UnavailableOfferingsTTL as both a flag and an environment variable in Karpenter. The default value of 3 minutes was too high for our environment, which operates with highly volatile workloads.
How was this change tested?
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.