[AMCC-177][metric filterlist] Fix various bugs in filterlist component#47825
[AMCC-177][metric filterlist] Fix various bugs in filterlist component#47825olivielpeau wants to merge 7 commits intomainfrom
Conversation
DogStatsD workers use `flushFilterList` (a post-aggregation filter) only for histogram metrics, whose expanded names (`.count`, `.avg`, etc.) are only known after aggregation. Regular DogStatsD metrics are pre-filtered in the listeners before reaching workers. The workers were initialized with `GetMetricFilterList()` (the full filter list) but updated via `SetSamplersFilterList` with the histogram-specific subset. This mismatch caused workers to over-filter during the brief window between startup and the first `OnUpdateMetricFilterList` callback. Fix by adding `GetHistoFilterList()` to the `filterlist.Component` interface and using it during worker initialization, making it consistent with the value sent on every subsequent update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GetTagFilterList returned &fl.tagFilterList without a lock; callbacks in setTagFilterList and OnUpdateTagFilterList passed the same pointer to long-running workers. A concurrent setTagFilterList overwrite raced with any reader holding that pointer. Fix by changing ShouldStripTags to a value receiver, so tagMatcher satisfies filterlist.TagMatcher directly. GetTagFilterList now holds RLock and returns a value copy; callbacks receive a value copy too. BenchmarkTimeSampler (5 runs, -race): no statistically significant change (~76 ns/op, 0 allocs/op before and after). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…stration methods Registration callbacks now only register; callers initialize themselves directly via Get*() methods. Pass initial filter list values to NewBufferedAggregator so the aggregator is fully initialized at construction time, eliminating the post-constructor assignment pattern. Add nil-guard for TagMatcher in trackContext so nil is a valid no-op value for test callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a new test suite under test/new-e2e/tests/agent-metric-pipelines/metric-filterlist/ that verifies metrics listed in metric_filterlist are not forwarded to the intake, while unlisted metrics are forwarded normally. Uses the awshost provisioner with DogStatsD UDP delivery via /dev/udp on the remote host, and fakeintake for assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Various var renames/comments, all no-ops *except* the change of import in comp/filterlist/impl/tagfilterlist.go so that the correct logger is used (instead of the trace agent's logger)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 335412017e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fl.updateMetricMtx.Lock() | ||
| fl.metricFilterListUpdate = append(fl.metricFilterListUpdate, onUpdate) | ||
| fl.updateMetricMtx.Unlock() |
There was a problem hiding this comment.
Replay current metric filter list on callback registration
OnUpdateMetricFilterList now only stores the callback, so it no longer delivers the current matcher when a consumer subscribes. Both the DogStatsD server and demultiplexer subscribe after worker startup (comp/dogstatsd/server/server.go handleMessages, pkg/aggregator/demultiplexer_agent.go:303), so if an RC/local metric-filter update lands between worker construction and subscription, that update is never forwarded and workers can keep a stale filter list until a later update arrives.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Because of the order in which components are initialized and started, RC only starts pulling updates once the components that depend on it (e.g. dogstatsd, demultiplexer) have finished initialization + starting.
This explains why the scenario described above cannot happen, and therefore why delivering the current matcher in the callback is not necessary anymore.
| func (fl *FilterList) OnUpdateTagFilterList(onUpdate func(filterlistdef.TagMatcher)) { | ||
| fl.updateTagMtx.Lock() | ||
| fl.tagFilterListUpdate = append(fl.tagFilterListUpdate, onUpdate) | ||
| fl.updateTagMtx.Unlock() |
There was a problem hiding this comment.
Replay current tag filter list on callback registration
OnUpdateTagFilterList also became registration-only, which creates the same startup gap for tag filters: the demultiplexer subscribes in run (pkg/aggregator/demultiplexer_agent.go:304) after creating aggregator/sampler workers, so a tag-filter update that occurs before this subscription is dropped and tag stripping can remain stale for the process lifetime unless another update is pushed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
(same explanation as above)
Files inventory check summaryFile checks results against ancestor f833cb1c: Results for datadog-agent_7.78.0~devel.git.589.3354120.pipeline.102422554-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
21 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 9ebd458 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.12 | [-1.94, +4.18] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.12 | [-1.94, +4.18] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.55 | [+0.50, +0.60] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.55 | [+0.37, +0.72] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.28 | [+0.25, +0.32] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.04 | [-0.35, +0.43] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.03 | [-0.21, +0.26] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | +0.02 | [-0.14, +0.19] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.02 | [-0.04, +0.08] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.02 | [-0.17, +0.20] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.11, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.00 | [-0.19, +0.18] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.03 | [-0.21, +0.14] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.06 | [-0.28, +0.17] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.06 | [-0.14, +0.02] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.08 | [-0.51, +0.35] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.10 | [-0.60, +0.40] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.14 | [-0.19, -0.09] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.21 | [-0.36, -0.05] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.29 | [-0.37, -0.21] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.30 | [-0.41, -0.20] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.42 | [-0.56, -0.28] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.55 | [-0.61, -0.49] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.79 | [-2.39, +0.80] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 705 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 275.95MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 670 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 173.71MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 494.67MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 203.56MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 346.30 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 408.27MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
What does this PR do?
This PR fixes several bugs in the
comp/filterlistcomponent that handles metric and tag filter lists (used by themetric_filterlistfeature), and adds an e2e test for metric filtering (in order to prevent any re-occurence of the regression fixed in #47604)Reviewer tips
Fixes
(mid/high impact) Data race in
GetTagFilterListand tag filter callbacks:GetTagFilterListpreviously returned&fl.tagFilterListwithout holding a lock; callbacks passed the same pointer to long-running workers, causing a race whensetTagFilterListoverwrote the value concurrently. Fixed by changingShouldStripTagsto a value receiver and makingGetTagFilterListreturn a copy underRLock.(low impact) Incorrect filter list used during DogStatsD worker initialization: Workers were initialized with
GetMetricFilterList()(the full filter list) but subsequently updated viaSetSamplersFilterListwith the histogram-specific subset. This caused over-filtering between startup and the firstOnUpdateMetricFilterListcallback. Fixed by addingGetHistoFilterList()to the interface and using it at worker initialization. Low impact because in practice this "over-filtering" should never actually impact which metrics are actually filtered or not (it only impacts performance slightly because a larger filter list is unnecessarily used).(low impact) Immediate callback invocation on registration (
OnUpdate*): Registration callbacks now only register; callers initialize viaGet*()methods directly. Initial filter list values are passed toNewBufferedAggregatorat construction time, eliminating the post-constructor assignment pattern. Added nil-guard forTagMatcherintrackContext.(low impact) Wrong logger used in
tagfilterlist.go: The file imported the trace agent logger instead of the correct one.(low impact) Warning log format in
tagfilterlist: Fixed format string.E2E Test
Added a new E2E test suite under
test/new-e2e/tests/agent-metric-pipelines/metric-filterlist/that verifies:metric_filterlistare not forwarded to the intake.Uses the
awshostprovisioner with DogStatsD UDP delivery andfakeintakefor assertions.Describe how you validated your changes
All changes should be covered by unit tests. No additional manual testing was performed.
Additional Notes