feat: modify startup probe with INFO and final sync failure error by kyledong-suse · Pull Request #605 · rancher-sandbox/runtime-enforcer

kyledong-suse · 2026-05-05T01:33:23Z

What this PR does / why we need it:

modify startup probe with INFO and final sync failure error

Which issue(s) this PR fixes

fixes #582

Special notes for your reviewer:

Checklist:

squashed commits into logical changes
includes documentation
adds unit tests
adds or updates e2e tests

Andreagit97 · 2026-05-05T11:10:27Z

Instead of rate-limting them, what about turning all the logs into INFO (so no more errors) and then when the ctrlmgr stops we simply log a final error

	logger.InfoContext(ctx, "starting manager")
	if err = ctrlMgr.Start(ctx); err != nil {
		if !resolver.IsNRISynchronized() || !wpHandler.IsSynchronized() {
			logger.ErrorContext(ctx, "agent terminated before startup synchronization completed",
				"nriSynchronized", resolver.IsNRISynchronized(),
				"workloadPoliciesSynchronized", wpHandler.IsSynchronized(),
			)
		}
		return fmt.Errorf("failed to start manager: %w", err)
	}

In this way we don't log warn or errors because it is an expected behavior but we still return an error in case the startup failed. WDYT?

kyledong-suse · 2026-05-05T13:18:23Z

@Andreagit97, thanks for the suggestion.
That's a good alternative, but my concern is for those warnings like "failed to add pod container from NRI" and "NRI handler has not yet synchronized", do we still want to keep them as WARN or INFO? If we put them into INFO, I'll suggest to rephrase it to make it sounds like an info message instead of a warning.

Andreagit97 · 2026-05-05T13:24:06Z

If we put them into INFO, I'll suggest to rephrase it to make it sounds like an info message instead of a warning.

yep this was the idea. WDYT?

holyspectral · 2026-05-05T14:39:49Z

In addition to what @Andreagit97 said, I want to point out that some of these errors could still be valuable for us to detect performance issues and for users to take actions.

Currently the initial delay of agent's startupProbe is hardcoded as 5 seconds. Perhaps we can increase it to like 10 seconds, so we have less noise in the default scenario while keeping the warning/error log level for later.

Then we can add a case in e2e tests to make sure that no error logs generated by default.

WDYT?

On the other hand, it's probably a separate topic, but the startupProbe is not configurable right now. I think maybe it's a good timing to make it configurable, so that users having these error/warning logs, e.g., from a slow cluster or a big cluster, can change it to reduce the noise.

holyspectral · 2026-05-05T14:54:04Z

For the e2e test part, I created #606 to follow up.

kyledong-suse · 2026-05-05T15:24:59Z

After another thinking, I think we can combine our ideas.
I propose we make the per-probe logs rate-limited only for WARN(no errors), so we don’t spam logs on startups. Add a single ERROR when ctrlMgr.Start() returns and startup sync was not completed.
I don't have strong opinions but we can also additionally bump the startupProbe.initialDelaySeconds to 10s or something else.
WDYT? @Andreagit97 @holyspectral

Andreagit97 · 2026-05-05T16:02:04Z

Please note that here i'm just referring to these 2 logs

{"time":"2026-04-23T14:26:07.927529398Z","level":"WARN","msg":"NRI handler has not yet synchronized","component":"agent","component":"resolver"}
{"time":"2026-04-23T14:26:07.934065725Z","level":"ERROR","msg":"WorkloadPolicy handler is not synced","component":"agent","error":"failed to list WorkloadPolicies during HasSynced check: Timeout: failed waiting for *v1alpha1.WorkloadPolicy Informer to sync"}

IMO they should both become INFO and probably not rate-limited (we can probably raise the delay to 10s). We could add the rate-limiting but it seems probably a little bit too much for now if we turn these logs to info. One log every 10s it means 6 logs in 1 minutes, that seems reasonable, WDYT?

kyledong-suse · 2026-05-05T18:58:33Z

I agree that with rate-limited only for those 2 logs is a bit too much. But I think turning it into INFO makes regressions easier to miss.
What about we change the ERROR to WARN and keeps those logs as WARN, then increase the delay to 10s?

Andreagit97 · 2026-05-06T08:26:05Z

But I think turning it into INFO makes regressions easier to miss.

it's not clear to me what you mean here 🤔

kyledong-suse · 2026-05-06T14:52:18Z

@Andreagit97 I mean if we modify WARN to INFO, if a startup which takes longer than usual, it might be acceptable for user. However, for our QA or performance point of view, I think WARN is still valuable and we shouldn't make it silent.

Andreagit97 · 2026-05-06T15:51:47Z

I mean if we modify WARN to INFO, if a startup which takes longer than usual, it might be acceptable for user.

yeah the point it that really depends on the number of pods/policies in the cluster so it's hard to say it won't happen by default in some customer setup, if possible i would avoid to alert users where not necessary 🤔

Andreagit97

Thanks!

dottorblaster

Plus one on @Andreagit97's comments, then we're ready to merge @kyledong-suse, thanks!

Signed-off-by: Kyle Dong <kyle.dong@suse.com>

kyledong-suse self-assigned this May 5, 2026

kyledong-suse temporarily deployed to ci May 5, 2026 01:33 — with GitHub Actions Inactive

kyledong-suse force-pushed the feature branch from 20efa7a to 6d1fd1e Compare May 5, 2026 01:46

kyledong-suse temporarily deployed to ci May 5, 2026 01:46 — with GitHub Actions Inactive

kyledong-suse force-pushed the feature branch from 6d1fd1e to bc7c0fc Compare May 7, 2026 01:12

kyledong-suse temporarily deployed to ci May 7, 2026 01:12 — with GitHub Actions Inactive

kyledong-suse changed the title ~~feat: rate limit repeated startup probe errors~~ feat: modify startup probe with INFO and final sync failure error May 7, 2026

kyledong-suse marked this pull request as ready for review May 7, 2026 01:38

kyledong-suse force-pushed the feature branch from bc7c0fc to ec862f6 Compare May 7, 2026 18:30

kyledong-suse temporarily deployed to ci May 7, 2026 18:30 — with GitHub Actions Inactive

Andreagit97 reviewed May 8, 2026

View reviewed changes

Comment thread cmd/agent/main.go Outdated

Comment thread internal/nri/plugin.go Outdated

Comment thread internal/resolver/nri_api.go Outdated

dottorblaster approved these changes May 8, 2026

View reviewed changes

kyledong-suse temporarily deployed to ci May 8, 2026 13:18 — with GitHub Actions Inactive

kyledong-suse requested a review from Andreagit97 May 8, 2026 13:19

Andreagit97 approved these changes May 8, 2026

View reviewed changes

feat: modify startup probe with INFO and final sync failure error

0fc3a2c

Signed-off-by: Kyle Dong <kyle.dong@suse.com>

kyledong-suse force-pushed the feature branch from 783ef63 to 0fc3a2c Compare May 8, 2026 14:41

kyledong-suse temporarily deployed to ci May 8, 2026 14:41 — with GitHub Actions Inactive

kyledong-suse merged commit 10f4a47 into rancher-sandbox:main May 8, 2026
13 checks passed

Conversation

kyledong-suse commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andreagit97 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyledong-suse commented May 5, 2026

Uh oh!

Andreagit97 commented May 5, 2026

Uh oh!

holyspectral commented May 5, 2026

Uh oh!

holyspectral commented May 5, 2026

Uh oh!

kyledong-suse commented May 5, 2026

Uh oh!

Andreagit97 commented May 5, 2026

Uh oh!

kyledong-suse commented May 5, 2026

Uh oh!

Andreagit97 commented May 6, 2026

Uh oh!

kyledong-suse commented May 6, 2026

Uh oh!

Andreagit97 commented May 6, 2026

Uh oh!

Andreagit97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dottorblaster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kyledong-suse commented May 5, 2026 •

edited

Loading

Andreagit97 commented May 5, 2026 •

edited

Loading