feat: Add configurable failurePolicy and timeoutSeconds for webhooks #440

dpacheconr · 2025-12-01T12:45:42Z

Summary

This PR makes the MutatingWebhookConfiguration more flexible by exposing configuration options for failurePolicy and timeoutSeconds. Previously, these values were hardcoded, preventing users from customizing webhook behavior for different environments or requirements.

Changes

New Configuration Options in values.yaml

1. `admissionWebhooks.failurePolicy`

Purpose: Controls failure behavior for Instrumentation webhooks (v1alpha2, v1beta1, v1beta2)
Default: Fail
Valid values: Fail, Ignore
Applies to:
- Instrumentation v1beta2 webhook
- Instrumentation v1beta1 webhook
- Instrumentation v1alpha2 webhook

When to use Fail (default):

Enforces strict validation of Instrumentation resources
Rejects CREATE/UPDATE operations if the webhook is unavailable
Ensures all instrumentation configs are validated before deployment

When to use Ignore:

Provides resilience when the operator may be temporarily unavailable
Allows operations to proceed even if webhook validation fails
Potentially allows misconfigured Instrumentation resources

2. `admissionWebhooks.podFailurePolicy`

Purpose: Controls failure behavior for Pod mutation webhook
Default: Ignore
Valid values: Fail, Ignore
Applies to: Pod mutation webhook (mutates pods/v1)

Why separate from failurePolicy:

Pod mutations have different failure characteristics than Instrumentation resources
Default Ignore prevents blocking critical workloads if operator is down
Can be set to Fail if strict instrumentation enforcement is required

3. `admissionWebhooks.timeoutSeconds`

Purpose: Timeout for all webhook calls (all 4 webhooks)
Default: null (uses Kubernetes default, typically 10s)
Valid range: 1-30 seconds (enforced by Kubernetes and validated by Helm)
Applies to: All 4 webhooks in the MutatingWebhookConfiguration

When to adjust:

Increase for environments with high network latency
Increase if webhook responses are slow
Must stay within Kubernetes limits (1-30 seconds)

Webhook Structure

The MutatingWebhookConfiguration contains 4 webhooks:

Instrumentation v1beta2 (minstrumentation-v1beta2.kb.io) - path: /mutate-newrelic-com-v1beta2-instrumentation
Instrumentation v1beta1 (minstrumentation-v1beta1.kb.io) - path: /mutate-newrelic-com-v1beta1-instrumentation
Instrumentation v1alpha2 (minstrumentation-v1alpha2.kb.io) - path: /mutate-newrelic-com-v1alpha2-instrumentation
Pod mutation (mpod.kb.io) - path: /mutate-v1-pod

Template Changes

Updated charts/k8s-agents-operator/templates/instrumentation-crd.yaml:

Line 189, 215, 241: Use {{ .Values.admissionWebhooks.failurePolicy }} for Instrumentation webhooks
Line 267: Use {{ .Values.admissionWebhooks.podFailurePolicy }} for Pod webhook
Lines 202-204, 228-230, 254-256, 279-281: Conditionally add timeoutSeconds when configured

Validation

Added input validation to prevent misconfigurations:

timeoutSeconds: Must be between 1 and 30 seconds (Kubernetes requirement)
failurePolicy: Must be either 'Fail' or 'Ignore'
podFailurePolicy: Must be either 'Fail' or 'Ignore'

Validation errors are raised during Helm template rendering with clear error messages.

Testing

Added comprehensive Helm unit tests (charts/k8s-agents-operator/tests/webhook_configuration_test.yaml):

15 test cases covering:
- Default values behavior
- Custom configuration application
- Combined settings
- Validation edge cases (0, 31, invalid strings)
- Boundary values (1, 30)
- Webhook naming and paths

All tests pass successfully.

Use Cases

Use Case 1: Strict Instrumentation Enforcement

admissionWebhooks:
  failurePolicy: Fail
  podFailurePolicy: Fail
  timeoutSeconds: 10

Ensures all instrumentation is validated and applied, blocking deployments if operator is unavailable.

Use Case 2: High Availability / Resilient Deployments

admissionWebhooks:
  failurePolicy: Ignore
  podFailurePolicy: Ignore
  timeoutSeconds: 5

Allows deployments to proceed even if operator is temporarily down, prioritizing availability.

Use Case 3: High Latency Environments

admissionWebhooks:
  failurePolicy: Fail
  podFailurePolicy: Ignore
  timeoutSeconds: 20

Increases timeout to handle network latency while maintaining validation for Instrumentation resources.

Backward Compatibility

All changes are fully backward compatible:

Default values match previous hardcoded behavior
Existing deployments will continue to work without changes
No breaking changes to API or behavior
E2E tests pass with default values

Testing Checklist

Helm unit tests pass (15/15 tests)
Helm template rendering works with default values
Helm template rendering works with custom values
Validation rejects invalid timeoutSeconds (0, 31, out of range)
Validation rejects invalid policy values
Documentation added to values.yaml
E2E tests pass (to be run in CI)

Documentation

Updated values.yaml with detailed comments explaining each option
Listed all 4 webhooks and their purposes
Provided examples and use cases
Explained when to use each configuration option

This change makes the MutatingWebhookConfiguration more flexible by exposing configuration options for: - failurePolicy for Instrumentation webhooks (v1alpha2, v1beta1, v1beta2) Default: Fail - podFailurePolicy for Pod mutation webhook Default: Ignore - timeoutSeconds for all webhooks (optional) Default: null (uses Kubernetes API server default, typically 10s) The MutatingWebhookConfiguration contains 4 webhooks: 1. Instrumentation v1beta2 webhook (mutates instrumentations.newrelic.com/v1beta2) 2. Instrumentation v1beta1 webhook (mutates instrumentations.newrelic.com/v1beta1) 3. Instrumentation v1alpha2 webhook (mutates instrumentations.newrelic.com/v1alpha2) 4. Pod mutation webhook (mutates pods/v1) This allows users to: - Enforce strict instrumentation by keeping failurePolicy as "Fail" - Provide more resilience by setting failurePolicy to "Ignore" - Adjust timeouts to handle network latency issues - Configure different behavior for pod mutations vs instrumentation resources All changes maintain backward compatibility with existing deployments. Added comprehensive Helm unit tests to validate webhook configuration options.

- Validate timeoutSeconds is between 1 and 30 seconds (Kubernetes requirement) - Validate failurePolicy and podFailurePolicy are either 'Fail' or 'Ignore' - Add comprehensive unit tests for validation logic (15 tests total) - Handle edge cases like timeoutSeconds=0 correctly using hasKey check

dpacheconr added 2 commits December 1, 2025 12:42

dpacheconr requested a review from a team as a code owner December 1, 2025 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add configurable failurePolicy and timeoutSeconds for webhooks #440

feat: Add configurable failurePolicy and timeoutSeconds for webhooks #440

Uh oh!

dpacheconr commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add configurable failurePolicy and timeoutSeconds for webhooks #440

Are you sure you want to change the base?

feat: Add configurable failurePolicy and timeoutSeconds for webhooks #440

Uh oh!

Conversation

dpacheconr commented Dec 1, 2025

Summary

Changes

New Configuration Options in values.yaml

1. admissionWebhooks.failurePolicy

2. admissionWebhooks.podFailurePolicy

3. admissionWebhooks.timeoutSeconds

Webhook Structure

Template Changes

Validation

Testing

Use Cases

Use Case 1: Strict Instrumentation Enforcement

Use Case 2: High Availability / Resilient Deployments

Use Case 3: High Latency Environments

Backward Compatibility

Testing Checklist

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `admissionWebhooks.failurePolicy`

2. `admissionWebhooks.podFailurePolicy`

3. `admissionWebhooks.timeoutSeconds`