Skip to content

Fix dynamic PID notification fan-out#2517

Open
MrAlias wants to merge 4 commits into
open-telemetry:mainfrom
MrAlias:fix-dynamic-pid-notification-fan-out
Open

Fix dynamic PID notification fan-out#2517
MrAlias wants to merge 4 commits into
open-telemetry:mainfrom
MrAlias:fix-dynamic-pid-notification-fan-out

Conversation

@MrAlias

@MrAlias MrAlias commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

  • The DynamicPIDSelector previously returned the same single-consumer channels from AddedPIDsNotify and RemovedNotify, making notification deliveries a work-queue instead of a broadcast and allowing competing consumers to steal add/remove batches.
  • New network/stat filters and DynamicAppIPs trackers introduced additional subscribers which created races where the application matcher could miss remove notifications, leaving deselected PIDs instrumented and leaking telemetry.

Description

  • Replaced single shared notify channels with per-subscriber channels tracked on the notifier and implemented addedNotify and removedNotify to create a new buffered channel for each caller.
  • Changed the drain loops to clone the subscriber list and fan out each PID batch to every subscriber, and to wait until at least one subscriber exists before sending.
  • Updated AddedPIDsNotify and RemovedNotify to return per-subscriber channels via the new notifier methods.
  • Updated DynamicSignalProcessEventGate to subscribe once before its select loop so repeated process events do not register orphaned notification subscribers.
  • Added regression tests for notification fan-out and stable gate subscriptions.

Testing

  • go test ./pkg/appolly/discover -run TestDynamicPIDSelector

@MrAlias MrAlias added this to the v0.10.0 milestone Jun 26, 2026
@MrAlias MrAlias requested a review from a team as a code owner June 26, 2026 16:36
@MrAlias MrAlias added bug Something isn't working go Related to Go code labels Jun 26, 2026
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.12422% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (914ffa4) to head (33c815d).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pkg/appolly/discover/dynamic_pid_selector.go 88.18% 14 Missing and 1 partial ⚠️
pkg/appolly/dynamic_signal_gate.go 36.36% 5 Missing and 2 partials ⚠️
pkg/appolly/discover/matcher_dynamic.go 50.00% 3 Missing and 1 partial ⚠️
pkg/selection/dynamic_app_ips.go 0.00% 2 Missing ⚠️
pkg/selection/dynamic_flow_attrs.go 0.00% 2 Missing ⚠️
pkg/selection/pid_selector.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2517      +/-   ##
==========================================
- Coverage   69.24%   68.84%   -0.41%     
==========================================
  Files         345      349       +4     
  Lines       46747    47456     +709     
==========================================
+ Hits        32372    32672     +300     
- Misses      12330    12723     +393     
- Partials     2045     2061      +16     
Flag Coverage Δ
integration-test 50.02% <0.00%> (-0.90%) ⬇️
integration-test-arm 26.40% <0.00%> (-1.90%) ⬇️
integration-test-vm-5.15-lts ?
integration-test-vm-6.18-lts ?
k8s-integration-test 35.97% <0.00%> (+0.27%) ⬆️
oats-test 35.16% <0.00%> (-0.21%) ⬇️
unittests 63.63% <89.58%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@MrAlias MrAlias marked this pull request as draft June 26, 2026 16:54
@MrAlias

This comment was marked as resolved.

* Fix dynamic PID notification fan-out stalls

* Preserve stalled PID notification subscribers

Keep dynamic PID notifier fan-out from blocking on a full subscriber while still preserving that subscriber for future deliveries. A slow consumer now drains from its own mailbox instead of being removed from the notifier permanently.

Update the full-subscriber tests to assert that active subscribers still receive updates and the temporarily full subscriber catches up once drained. Stub the ignored-path decorator test so it no longer depends on the host PID 1 executable.

* Bound PID notifier subscriber backlog

Bind internal dynamic PID notification subscriptions to their owning run context so shutdown removes subscribers from the notifier instead of leaving stale channels behind.

Cap each subscriber mailbox backlog so legacy non-draining subscribers cannot retain every future PID. Keep the old channel API as a compatibility fallback while routing in-repo consumers through context-aware notification helpers.

* Preserve live PID notifier backlogs

Keep bounded pending queues for legacy dynamic PID notification subscribers, which have no cancellation path and can otherwise retain every future PID event.

Allow context-bound subscribers to queue the full pending set until they drain or their context is canceled. The in-repo consumers use those subscriptions, so dynamic selector state no longer drops add or remove events during a temporary burst.

* Preserve duplicate PID notify edges

Keep every queued PID notification entry instead of deduplicating by PID inside the subscriber mailbox.

Dynamic consumers treat add and remove notifications as transitions, so repeated PID entries must survive while a subscriber is backlogged. Legacy subscribers remain bounded by pending entry count, and context-bound subscribers keep the full queued sequence until they drain or cancel.
@MrAlias MrAlias modified the milestones: v0.10.0, v0.11.0 Jun 26, 2026
MrAlias added 2 commits June 26, 2026 13:06
The repeated request-body recovery tests caused pkg/ebpf/common to fail at compile time, which also stopped the collectt lint analyzer before it could inspect the package.

Keep one copy of each test so the package builds while preserving the coverage added for request-body recovery behavior.
The lint workflow now reaches golangci-lint after the duplicate test removal and reports the length assertion in the dynamic PID selector tests.

Use testify's Len helper so the assertion matches the repository lint policy while keeping the same expected backlog bound.
@MrAlias MrAlias marked this pull request as ready for review June 26, 2026 20:51
@MrAlias

MrAlias commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

cc @damemi @grcevski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working go Related to Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant