feat: modify startup probe with INFO and final sync failure error#605
Conversation
|
Instead of rate-limting them, what about turning all the logs into logger.InfoContext(ctx, "starting manager")
if err = ctrlMgr.Start(ctx); err != nil {
if !resolver.IsNRISynchronized() || !wpHandler.IsSynchronized() {
logger.ErrorContext(ctx, "agent terminated before startup synchronization completed",
"nriSynchronized", resolver.IsNRISynchronized(),
"workloadPoliciesSynchronized", wpHandler.IsSynchronized(),
)
}
return fmt.Errorf("failed to start manager: %w", err)
}In this way we don't log warn or errors because it is an expected behavior but we still return an error in case the startup failed. WDYT? |
|
@Andreagit97, thanks for the suggestion. |
yep this was the idea. WDYT? |
|
In addition to what @Andreagit97 said, I want to point out that some of these errors could still be valuable for us to detect performance issues and for users to take actions. Currently the initial delay of agent's startupProbe is hardcoded as 5 seconds. Perhaps we can increase it to like 10 seconds, so we have less noise in the default scenario while keeping the warning/error log level for later. Then we can add a case in e2e tests to make sure that no error logs generated by default. WDYT? On the other hand, it's probably a separate topic, but the |
|
For the e2e test part, I created #606 to follow up. |
|
After another thinking, I think we can combine our ideas. |
|
Please note that here i'm just referring to these 2 logs {"time":"2026-04-23T14:26:07.927529398Z","level":"WARN","msg":"NRI handler has not yet synchronized","component":"agent","component":"resolver"}
{"time":"2026-04-23T14:26:07.934065725Z","level":"ERROR","msg":"WorkloadPolicy handler is not synced","component":"agent","error":"failed to list WorkloadPolicies during HasSynced check: Timeout: failed waiting for *v1alpha1.WorkloadPolicy Informer to sync"}IMO they should both become INFO and probably not rate-limited (we can probably raise the delay to 10s). We could add the rate-limiting but it seems probably a little bit too much for now if we turn these logs to info. One log every 10s it means 6 logs in 1 minutes, that seems reasonable, WDYT? |
|
I agree that with rate-limited only for those 2 logs is a bit too much. But I think turning it into INFO makes regressions easier to miss. |
it's not clear to me what you mean here 🤔 |
|
@Andreagit97 I mean if we modify WARN to INFO, if a startup which takes longer than usual, it might be acceptable for user. However, for our QA or performance point of view, I think WARN is still valuable and we shouldn't make it silent. |
yeah the point it that really depends on the number of pods/policies in the cluster so it's hard to say it won't happen by default in some customer setup, if possible i would avoid to alert users where not necessary 🤔 |
dottorblaster
left a comment
There was a problem hiding this comment.
Plus one on @Andreagit97's comments, then we're ready to merge @kyledong-suse, thanks!
Signed-off-by: Kyle Dong <kyle.dong@suse.com>
What this PR does / why we need it:
modify startup probe with INFO and final sync failure error
Which issue(s) this PR fixes
fixes #582
Special notes for your reviewer:
Checklist: