Skip to content

feat: prevent containers from running in NRI if we fails to apply policy on it#392

Open
holyspectral wants to merge 1 commit intorancher-sandbox:mainfrom
holyspectral:prevent-container-from-running-nri
Open

feat: prevent containers from running in NRI if we fails to apply policy on it#392
holyspectral wants to merge 1 commit intorancher-sandbox:mainfrom
holyspectral:prevent-container-from-running-nri

Conversation

@holyspectral
Copy link
Collaborator

@holyspectral holyspectral commented Mar 11, 2026

What this PR does / why we need it:

As part of error handling, when we failed to apply protection on a container, we will fail the container creation flow by default. Users can override this behavior via NRI_FAILOPEN env var.

When a container is prevented from starting, logs can be seen in these places:

In our log:

{"time":"2026-03-11T20:57:12.766551002Z","level":"ERROR","msg":"Runtime-enforcer has prevented the container from starting. To change this behavior, set environment variable NRI_FAILOPEN to true","component":"agent","component":"nri-handler","component":"nri-plugin","reason":"failed to add pod container from NRI","containerName":"ubuntu","podName":"ubuntu-deployment-595f9465f7-dnstl","error":"SOME ERROR"}

containerd:

Mar 11 20:59:31 kind-control-plane containerd[122398]: time="2026-03-11T20:59:31.776225421Z" level=error msg="NRI container start failed" error="rpc error: code = Unknown desc = failed to add pod container from NRI: SOME ERROR. Runtime-enforcer has prevented the container 'ubuntu-cronjob-29554377-5tzw4/ubuntu' from starting. To change this behavior, set environment variable NRI_FAILOPEN to true"

kubernetes:

    lastState:                                                          
      terminated:                                                       
        containerID: containerd://2fa1c89c84c448d36d878a44439cc2901bc2939b58be1bc5f3751be31ac4b23d
        exitCode: 128                                                   
        finishedAt: "2026-03-11T20:57:28Z"                              
        message: 'NRI container start failed: rpc error: code = Unknown desc = failed
          to add pod container from NRI: SOME ERROR. Runtime-enforcer has prevented
          the container ''ubuntu-deployment-595f9465f7-dnstl/ubuntu'' from starting.
          To change this behavior, set environment variable NRI_FAILOPEN to true'
        reason: StartError                                              
        startedAt: "1970-01-01T00:00:00Z"         

Which issue(s) this PR fixes

fixes #262

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@holyspectral holyspectral self-assigned this Mar 11, 2026
@holyspectral holyspectral added the enhancement New feature or request label Mar 11, 2026
@holyspectral holyspectral marked this pull request as draft March 11, 2026 20:28
@holyspectral holyspectral force-pushed the prevent-container-from-running-nri branch from e5c558c to 45cfd00 Compare March 11, 2026 20:49
@holyspectral holyspectral marked this pull request as ready for review March 11, 2026 21:02
As part of error handling, when we failed to apply protection on a
container, we will fail the container creation flow by default.

Users can use NRI_FAILOPEN envvar to change this behavior.

Signed-off-by: Sam Wang (holyspectral) <sam.wang@suse.com>
@holyspectral holyspectral force-pushed the prevent-container-from-running-nri branch from 45cfd00 to 1f93cdd Compare March 11, 2026 22:44
Copy link
Collaborator

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

"namespace", pod.GetNamespace(),
"error", err,
)
return nil, fmt.Errorf("failed to add pod container from NRI: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return an error from StartContainer, the container won't start. What happens if we return an error here in Synchronize?

p := &plugin{
logger: logger.With("component", "nri-plugin"),
resolver: resolver,
failOpen: os.Getenv("NRI_FAILOPEN") == "true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the final user should set this environment variable? I would expect a helm field in values.yaml, WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is something we can talk about. Today we already can specify the environment variable via agent.env but we can also make it more specific by providing a separate option. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this seems an important feature for users, so I would probably add a new helm field

"containerName", container.GetName(),
)

handleError := func(reason string, err error) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this a little bit to avoid duplication?

	handleError := func(reason string, err error) error {
		// fail open defaults
		var errNRI error
		msg := "container is starting WITHOUT enforcement due to NRI_FAILOPEN"
		if !p.failOpen {
			errNRI = fmt.Errorf("%s: %w. Runtime-enforcer has prevented the container '%s/%s' from starting. To change this behavior, set environment variable NRI_FAILOPEN to true",
				reason, err, pod.GetName(), container.GetName())
			msg = errNRI.Error()
		}
		p.logger.ErrorContext(
			ctx,
			msg,
			"containerID", container.GetId(),
			"containerName", container.GetName(),
			"podName", pod.GetName(),
			"podID", pod.GetUid(),
		)
		return errNRI
	}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, we can use error verbosity in both cases since a container is not protected

func (r *Resolver) applyPolicyToPodIfPresent(state *podEntry) error {
policyName := state.policyName()

// if the policy doesn't have the label we do nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if the pod doesn't have the label we do nothing

state.podName(),
state.podNamespace(),
policyName,
// This can happen when the pod runs before the policy is created/reconciled when using GitOps to deploy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really up to us what we want to do. I'm also fine with returning an error since the pod will start without protection, but the user might think this is protected...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design: design a better management for policy failures

2 participants