-
Notifications
You must be signed in to change notification settings - Fork 596
deployer: Optimize default gateway probe timings #12784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,7 +177,9 @@ func defaultGatewayParameters(imageInfo *ImageInfo, omitDefaultSecurityContext b | |
| }, | ||
| }, | ||
| InitialDelaySeconds: 0, | ||
| PeriodSeconds: 10, | ||
| PeriodSeconds: 5, | ||
| TimeoutSeconds: 1, | ||
| FailureThreshold: 3, | ||
| }, | ||
| StartupProbe: &corev1.Probe{ | ||
| ProbeHandler: corev1.ProbeHandler{ | ||
|
|
@@ -189,7 +191,7 @@ func defaultGatewayParameters(imageInfo *ImageInfo, omitDefaultSecurityContext b | |
| InitialDelaySeconds: 0, | ||
| PeriodSeconds: 1, | ||
| TimeoutSeconds: 2, | ||
| FailureThreshold: 60, | ||
| FailureThreshold: 30, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO, the startup, liveness, and readiness probes are should be optimized by 'least likely to cause an outage' and 'has lowest expected value of outage duration'. This leads to a few guidelines:
For these reasons, it's much better, once you've passed startupProbe, to use Can we/should we wait on initial xDS updates to finish before passing startupProbe? Why would you decrease how long startupProbe waits? In an outage, you badly want a new pod to come up. Why not make periodSeconds for readiness and liveness '1'? If /ready never fails after it succeeds, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Envoy is in C++, not go, so my GOMAXPROCS comment is wrong but illustrative -- do not expect a highly loaded Envoy (such as what would result if you, e.g., lost several k8s nodes unexpectedly) to keep responding to /ready with the same latency as before. If you must use HTTP probes for liveness or readiness, give them generous timeouts to avoid removing servers and cascading from high latency to a 100% outage. |
||
| SuccessThreshold: 1, | ||
| }, | ||
| }, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I have seen envoy take more than 1s at load due to prometheus scrapes or large XDS pushes locking up the main thread.
But with failureThreshold 3 that is probably mitigated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had this set to 2 before deciding to have an agent double check these values. Down to add a second buffer here