Skip to content

feat: modify startup probe with INFO and final sync failure error#605

Merged
kyledong-suse merged 1 commit into
rancher-sandbox:mainfrom
kyledong-suse:feature
May 8, 2026
Merged

feat: modify startup probe with INFO and final sync failure error#605
kyledong-suse merged 1 commit into
rancher-sandbox:mainfrom
kyledong-suse:feature

Conversation

@kyledong-suse
Copy link
Copy Markdown
Collaborator

@kyledong-suse kyledong-suse commented May 5, 2026

What this PR does / why we need it:

modify startup probe with INFO and final sync failure error

Which issue(s) this PR fixes

fixes #582

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@Andreagit97
Copy link
Copy Markdown
Collaborator

Andreagit97 commented May 5, 2026

Instead of rate-limting them, what about turning all the logs into INFO (so no more errors) and then when the ctrlmgr stops we simply log a final error

	logger.InfoContext(ctx, "starting manager")
	if err = ctrlMgr.Start(ctx); err != nil {
		if !resolver.IsNRISynchronized() || !wpHandler.IsSynchronized() {
			logger.ErrorContext(ctx, "agent terminated before startup synchronization completed",
				"nriSynchronized", resolver.IsNRISynchronized(),
				"workloadPoliciesSynchronized", wpHandler.IsSynchronized(),
			)
		}
		return fmt.Errorf("failed to start manager: %w", err)
	}

In this way we don't log warn or errors because it is an expected behavior but we still return an error in case the startup failed. WDYT?

@kyledong-suse
Copy link
Copy Markdown
Collaborator Author

@Andreagit97, thanks for the suggestion.
That's a good alternative, but my concern is for those warnings like "failed to add pod container from NRI" and "NRI handler has not yet synchronized", do we still want to keep them as WARN or INFO? If we put them into INFO, I'll suggest to rephrase it to make it sounds like an info message instead of a warning.

@Andreagit97
Copy link
Copy Markdown
Collaborator

If we put them into INFO, I'll suggest to rephrase it to make it sounds like an info message instead of a warning.

yep this was the idea. WDYT?

@holyspectral
Copy link
Copy Markdown
Collaborator

In addition to what @Andreagit97 said, I want to point out that some of these errors could still be valuable for us to detect performance issues and for users to take actions.

Currently the initial delay of agent's startupProbe is hardcoded as 5 seconds. Perhaps we can increase it to like 10 seconds, so we have less noise in the default scenario while keeping the warning/error log level for later.

Then we can add a case in e2e tests to make sure that no error logs generated by default.

WDYT?

On the other hand, it's probably a separate topic, but the startupProbe is not configurable right now. I think maybe it's a good timing to make it configurable, so that users having these error/warning logs, e.g., from a slow cluster or a big cluster, can change it to reduce the noise.

@holyspectral
Copy link
Copy Markdown
Collaborator

For the e2e test part, I created #606 to follow up.

@kyledong-suse
Copy link
Copy Markdown
Collaborator Author

After another thinking, I think we can combine our ideas.
I propose we make the per-probe logs rate-limited only for WARN(no errors), so we don’t spam logs on startups. Add a single ERROR when ctrlMgr.Start() returns and startup sync was not completed.
I don't have strong opinions but we can also additionally bump the startupProbe.initialDelaySeconds to 10s or something else.
WDYT? @Andreagit97 @holyspectral

@Andreagit97
Copy link
Copy Markdown
Collaborator

Please note that here i'm just referring to these 2 logs

{"time":"2026-04-23T14:26:07.927529398Z","level":"WARN","msg":"NRI handler has not yet synchronized","component":"agent","component":"resolver"}
{"time":"2026-04-23T14:26:07.934065725Z","level":"ERROR","msg":"WorkloadPolicy handler is not synced","component":"agent","error":"failed to list WorkloadPolicies during HasSynced check: Timeout: failed waiting for *v1alpha1.WorkloadPolicy Informer to sync"}

IMO they should both become INFO and probably not rate-limited (we can probably raise the delay to 10s). We could add the rate-limiting but it seems probably a little bit too much for now if we turn these logs to info. One log every 10s it means 6 logs in 1 minutes, that seems reasonable, WDYT?

@kyledong-suse
Copy link
Copy Markdown
Collaborator Author

I agree that with rate-limited only for those 2 logs is a bit too much. But I think turning it into INFO makes regressions easier to miss.
What about we change the ERROR to WARN and keeps those logs as WARN, then increase the delay to 10s?

@Andreagit97
Copy link
Copy Markdown
Collaborator

But I think turning it into INFO makes regressions easier to miss.

it's not clear to me what you mean here 🤔

@kyledong-suse
Copy link
Copy Markdown
Collaborator Author

@Andreagit97 I mean if we modify WARN to INFO, if a startup which takes longer than usual, it might be acceptable for user. However, for our QA or performance point of view, I think WARN is still valuable and we shouldn't make it silent.

@Andreagit97
Copy link
Copy Markdown
Collaborator

I mean if we modify WARN to INFO, if a startup which takes longer than usual, it might be acceptable for user.

yeah the point it that really depends on the number of pods/policies in the cluster so it's hard to say it won't happen by default in some customer setup, if possible i would avoid to alert users where not necessary 🤔

@kyledong-suse kyledong-suse changed the title feat: rate limit repeated startup probe errors feat: modify startup probe with INFO and final sync failure error May 7, 2026
@kyledong-suse kyledong-suse marked this pull request as ready for review May 7, 2026 01:38
Copy link
Copy Markdown
Collaborator

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread cmd/agent/main.go Outdated
Comment thread internal/nri/plugin.go Outdated
Comment thread internal/resolver/nri_api.go Outdated
Copy link
Copy Markdown
Member

@dottorblaster dottorblaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus one on @Andreagit97's comments, then we're ready to merge @kyledong-suse, thanks!

Signed-off-by: Kyle Dong <kyle.dong@suse.com>
@kyledong-suse kyledong-suse merged commit 10f4a47 into rancher-sandbox:main May 8, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consistent error logs on startup

4 participants