Skip to content

feat: introduce policy event logs scope#491

Open
m00lecule wants to merge 15 commits into
aws:mainfrom
m00lecule:feat-policy-event-logs-scope
Open

feat: introduce policy event logs scope#491
m00lecule wants to merge 15 commits into
aws:mainfrom
m00lecule:feat-policy-event-logs-scope

Conversation

@m00lecule

@m00lecule m00lecule commented Nov 30, 2025

Copy link
Copy Markdown
Contributor

Issues:
#482
#463
#464 (comment)
#427

Description of changes:
This feature allows to narrow the ebpf programs events scope (both ACCEPT and DENY or only DENY). To filter out the ACCEPT logs and metrics please specify --enable-policy-event-logs=true and --policy-event-logs-scope=deny.

The policy event logs volume can be overwhelming. The high CPU usage comes usually from userspace process processing the ebpf events published to RINGBUF, which are mostly the ACCEPT events.

Secondly the policy events logs are critical due to observability reasons. There is no command to confirm the NP is dropping traffic. The only way is to enable enable-policy-event-logs=true, which is quite resources heavy and might decrease environment performance.

Finally, by reducing the ACCEPT events we can get a guarantee that all DENY events are written to log files. The policy_events RINGBUF has a fixed size (https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/c/v4events.bpf.c#L53), when exceeding the buffer capacity some old events are dropped and eventually missing in the log files.

Alternatively the feature could be implemented as:

  • userspace logger logLevel, however it would still cause high CPU usage due to high volume processing caused by ACCEPT events,
  • two RING_BUFFERS storing respectively DENY and ACCEPT events separately (probably cleaner approach, less ebpf maps lookups)
  • configure the policy event logs scope per traffic type (egress, ingress) or even per replica set (configuration based on vpc-cni NP config and deployment annotations)

The goal is to discuss the overall architecture. Please consider this PR as prototype, not yet a final implementation.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@m00lecule m00lecule requested a review from a team as a code owner November 30, 2025 22:36
@m00lecule m00lecule force-pushed the feat-policy-event-logs-scope branch from 063e7f0 to 73e00bb Compare November 30, 2025 22:37
@m00lecule

Copy link
Copy Markdown
Contributor Author

@viveksb007 kindly asking for an initial review

@m00lecule m00lecule force-pushed the feat-policy-event-logs-scope branch 5 times, most recently from 47261be to 9f029ee Compare December 3, 2025 18:24
Adding Cluster Network Policy change (aws#496)
@m00lecule m00lecule force-pushed the feat-policy-event-logs-scope branch 3 times, most recently from b7f008a to 4649095 Compare December 16, 2025 21:17
@m00lecule m00lecule force-pushed the feat-policy-event-logs-scope branch from 4649095 to 3dfcb11 Compare December 16, 2025 21:18
@m00lecule m00lecule force-pushed the feat-policy-event-logs-scope branch from e3483fb to 63a4ba6 Compare December 19, 2025 22:57
@m00lecule

m00lecule commented Dec 28, 2025

Copy link
Copy Markdown
Contributor Author

@jayanthvn @viveksb007 May I ask You to find someone to review this PR?

@Pavani-Panakanti Pavani-Panakanti requested a review from Copilot June 23, 2026 23:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “policy event logs scope” control to reduce eBPF policy event volume by filtering which verdicts (ACCEPT+DENY vs DENY-only) are emitted into the shared policy_events ring buffer, addressing high CPU usage and log volume concerns when policy event logging is enabled.

Changes:

  • Introduces a new globally pinned eBPF map (policy_events_scope) to hold the desired event scope.
  • Updates TC ingress/egress programs (v4/v6) to consult the scope map before publishing to the policy_events ring buffer.
  • Adds a new controller flag and wires it through to the eBPF client to program the scope map at startup; updates cleanup test to allow the new pinned map.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/agent/cmd/check-bpf-cleanup-agent/main.go Allows the new globally pinned global_policy_events_scope map during cleanup verification.
pkg/ebpf/c/v4events.bpf.c Adds the new global policy_events_scope map alongside existing global maps.
pkg/ebpf/c/v6events.bpf.c Same as v4: adds the new global policy_events_scope map.
pkg/ebpf/c/tc.v4ingress.bpf.c Gates ringbuf event emission via policy_events_scope.
pkg/ebpf/c/tc.v4egress.bpf.c Gates ringbuf event emission via policy_events_scope.
pkg/ebpf/c/tc.v6ingress.bpf.c Gates ringbuf event emission via policy_events_scope.
pkg/ebpf/c/tc.v6egress.bpf.c Gates ringbuf event emission via policy_events_scope.
pkg/ebpf/bpf_client.go Adds scope-map constants/types, recovery/install handling, and programs the scope map based on config.
pkg/config/controller_config.go Adds --policy-event-logs-scope flag and stores it in controller config.
main.go Passes the new scope setting into NewBpfClient.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/ebpf/bpf_client.go
Comment on lines +241 to +244
if mapName == AWS_EVENTS_SCOPE_MAP {
policyEventsScopeMap = mapInfo
isPolicyEventsScopeMapPresent = true
}
Comment thread pkg/ebpf/bpf_client.go
Comment on lines +265 to +273
recoveredPolicyEventsScopeMap, ok := ebpfClient.globalMaps.Load(POLICY_EVENTS_SCOPE_MAP_PIN_PATH)
if ok {
policyEventsScopeMap = recoveredPolicyEventsScopeMap.(goebpfmaps.BpfMap)
log().Info("Derived existing policyEventsScopeMap identifier")
} else {
log().Errorf("Unable to get policyEventsScopeMap post recovery..error: %v", err)
sdkAPIErr.WithLabelValues("RecoveryFailed").Inc()
return nil, err
}
Comment thread pkg/ebpf/bpf_client.go
Comment on lines +275 to +290
key := uint32(POLICY_EVENTS_SCOPE_MAP_KEY)
scope := uint8(utils.ACCEPT.Index())

if policyEventsLogsScope == "DENY" {
scope = uint8(utils.DENY.Index())
}

value := policy_scope{scope: scope}
log().Infof("Will update Policy Events Scope Map: key=%d value=%v", key, value)
err := policyEventsScopeMap.CreateUpdateMapEntry(uintptr(unsafe.Pointer(&key)), uintptr(unsafe.Pointer(&value)), 0)

if err != nil {
log().Errorf("Policy Events Scope Map update failed: %v", err)
sdkAPIErr.WithLabelValues("updateEbpfMap-policy-events-scope").Inc()
}
log().Infof("Updated Policy Events Scope Map: key=%d value=%v", key, value)
Comment on lines +87 to +88
fs.StringVar(&cfg.PolicyEventLogsScope, flagPolicyEventLogsScope, defaultPolicyEventLogsScope, ""+
"Set the policy event logs scope, if set to ACCEPT both ACCEPT and DENY events are generated, if set to DENY only DENY events are generated - ACCEPT, DENY")
Comment thread pkg/ebpf/c/v4events.bpf.c
Comment on lines +63 to +70
struct bpf_map_def_pvt SEC("maps") policy_events_scope = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__u32), // default key = 0. We are storing a single entry for both ingress and egress traffic
.value_size = sizeof(struct policy_scope),
.max_entries = 1,
.map_flags = BPF_F_NO_PREALLOC,
.pinning = PIN_GLOBAL_NS,
};
Comment thread pkg/ebpf/c/v6events.bpf.c
Comment on lines +66 to +73
struct bpf_map_def_pvt SEC("maps") policy_events_scope = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__u32), // default key = 0. We are storing a single entry for both ingress and egress traffic
.value_size = sizeof(struct policy_scope),
.max_entries = 1,
.map_flags = BPF_F_NO_PREALLOC,
.pinning = PIN_GLOBAL_NS,
};
Comment thread pkg/ebpf/bpf_client.go
}
if mapName == AWS_EVENTS_SCOPE_MAP {
policyEventsScopeMap = mapInfo
isPolicyEventsScopeMapPresent = true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being set to true here ?

Comment thread pkg/ebpf/bpf_client.go
key := uint32(POLICY_EVENTS_SCOPE_MAP_KEY)
scope := uint8(utils.ACCEPT.Index())

if policyEventsLogsScope == "DENY" {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could be misconfigurations like "deny" we should validate the flag and log error if misconfigured, behavior can be fallback to default accept scope


static void publishPolicyEvent(struct data_t *evt) {
__u32 plsc_key = 0;
struct policy_scope *plsc = bpf_map_lookup_elem(&policy_events_scope, &plsc_key);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are adding a new bpf map lookup in datapath. We should evaluate performance

@Pavani-Panakanti

Copy link
Copy Markdown
Collaborator

Hey @m00lecule, Thanks for these changes. Overall problem and direction of gating events before putting then in ring buffer looks good. We need to add some unit tests and integration tests for these changes. As we are adding a bpf map and making bpf file changes, the changes needs some extensive testing with upgrades, downgrades and recovery with restarts.
Can you also provide some scale numbers you are working with to see how much improvement this provides in cpu ?

@m00lecule

m00lecule commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Hey @m00lecule, Thanks for these changes. Overall problem and direction of gating events before putting then in ring buffer looks good. We need to add some unit tests and integration tests for these changes. As we are adding a bpf map and making bpf file changes, the changes needs some extensive testing with upgrades, downgrades and recovery with restarts. Can you also provide some scale numbers you are working with to see how much improvement this provides in cpu ?

cluster ~20 nodes and ~2000 pods. @Pavani-Panakanti will you takeover this PR (implement missing parts)? In general we don't want to see more than 300m cpu utilization when only DENY events are logged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants