fix(engine): centralize telemetry timer management in runtime manager by cijothomas · Pull Request #2804 · open-telemetry/otel-arrow

cijothomas · 2026-05-01T05:29:37Z

Summary

Centralizes telemetry timer management in the runtime control manager.
Previously, every node independently called
start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded
interval that was not configurable and not enforceable. The runtime
manager now registers telemetry timers for all nodes at pipeline
startup, using the existing engine.telemetry.reporting_interval
config (default: 1s).

This removes start_periodic_telemetry calls and cancel-handle
management from all 15 node implementations. Shutdown cancellation is
handled centrally via cancel_all().

The idle perf test config is updated to use reporting_interval: 5s
with a matching 5s Prometheus scrape interval, so the idle benchmark
reflects a more realistic deployment configuration.

Notes

Perf exporter: Previously used a custom telemetry interval from its
config. This is now replaced by the engine-wide reporting_interval.
No sampling fidelity is lost — the custom interval only controlled
metric flush cadence.

start_periodic_telemetry API: The API and runtime control messages
(StartTelemetryTimer/CancelTelemetryTimer) are still present. A node
could still override the central timer. TBD whether to keep this or
remove it — will track via a follow-up issue.

Idle CPU results

Idle CPU dropped significantly with the 5s reporting interval in the
idle test config. For example, the 4-core idle benchmark went from
~0.95% to ~0.13% CPU (docker stats, 16-core host). CI perf runs will
provide authoritative numbers on dedicated hardware.

codecov · 2026-05-01T05:32:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.01%. Comparing base (f018901) to head (f8ee74d).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   86.03%   86.01%   -0.03%     
==========================================
  Files         704      704              
  Lines      264591   264491     -100     
==========================================
- Hits       227654   227514     -140     
- Misses      36413    36453      +40     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`86.95% <100.00%> (-0.03%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.76% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.25% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fixes open-telemetry#1305 Previously, every node (receiver, processor, exporter) independently called effect_handler.start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded 1-second interval. This was: - Not configurable by operators - Not enforceable (each node picked its own interval) - A significant contributor to idle CPU (~50 millicores on 4 cores) The runtime manager now registers telemetry timers for all nodes centrally during pipeline startup, using the configured engine.telemetry.reporting_interval. This: - Removes start_periodic_telemetry calls from all 15 node files - Eliminates per-node cancel handle management on shutdown - Enforces a single, consistent collection cadence by construction - Uses the existing configurable reporting_interval (default 1s) The idle test configuration is updated to use reporting_interval: 5s and a matching 5s Prometheus scrape interval, reducing idle CPU from ~0.9% to ~0.1% on 4 cores. Also fixes the idle-state-template Prometheus endpoint URLs to use the correct /api/v1 prefix.

Copilot

Pull request overview

This PR centralizes periodic node telemetry scheduling inside the engine runtime-control manager so pipelines use the configured engine-wide reporting interval instead of having each node start and cancel its own telemetry timer. It also updates the idle perf test to use a slower 5s telemetry/scrape cadence that better matches the new centralized behavior.

Changes:

Pre-register telemetry timers for all nodes in the runtime-control manager and sync control-plane timer metrics immediately.
Remove per-node start_periodic_telemetry() / cancel-handle management from receivers, processors, exporters, and validation code.
Add a dedicated idle perf-test engine config with engine.telemetry.reporting_interval: 5s and align Prometheus scraping to 5s.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tools/pipeline_perf_test/test_suites/integration/templates/configs/engine/continuous/otlp-attr-otlp-idle.yaml`	New idle perf-test engine config with 5s telemetry interval.
`tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2`	Switches idle test to the new config and sets 5s scrape interval.
`rust/otap-dataflow/crates/validation/src/validation_exporter.rs`	Removes exporter-local telemetry timer startup.
`rust/otap-dataflow/crates/engine/src/processor.rs`	Removes processor wrapper telemetry timer lifecycle.
`rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs`	Centralizes telemetry timer registration in the runtime manager and updates tests.
`rust/otap-dataflow/crates/engine/src/control.rs`	Adds helper to enumerate registered node IDs.
`rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs`	Removes topic receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs`	Removes syslog receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otlp_receiver/mod.rs`	Removes OTLP receiver telemetry timer/cancel plumbing.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otap_receiver/mod.rs`	Removes OTAP receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/internal_telemetry_receiver/mod.rs`	Removes internal telemetry receiver timer startup.
`rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs`	Removes fake generator telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs`	Removes topic exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/perf_exporter/mod.rs`	Removes perf exporter custom telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/parquet_exporter/mod.rs`	Removes parquet exporter telemetry timer management and related test.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_http_exporter/mod.rs`	Removes OTLP/HTTP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_grpc_exporter/mod.rs`	Removes OTLP/gRPC exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs`	Removes OTAP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/geneva_exporter/mod.rs`	Removes Geneva exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/azure_monitor_exporter/exporter.rs`	Removes Azure Monitor exporter telemetry timer/cancel handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Register telemetry timers for all nodes centrally, using the
+        // configured reporting interval. This replaces per-node
+        // start_periodic_telemetry calls and ensures a single, consistent
+        // collection cadence across all nodes.
+        for node_id in result.control_senders.node_ids() {
+            result
+                .telemetry_timers
+                .start(node_id, result.control_plane_metrics_flush_interval);
        }
+
+        // Sync the metrics shadow with the pre-registered timers so the
+        // `telemetry_timers.active` gauge reflects reality before the first
+        // scheduler tick instead of reporting 0 for one full reporting interval.
+        result.runtime_control_metrics.set_timer_counts(
+            result.tick_timers.timer_states.len(),
+            result.telemetry_timers.timer_states.len(),
+        );
+
+        result


github-project-automation Bot added this to OTel-Arrow May 1, 2026

github-actions Bot added the rust Pull requests that update Rust code label May 1, 2026

cijothomas force-pushed the fix/centralize-telemetry-timers branch from b783a18 to 767831c Compare May 1, 2026 17:21

cijothomas force-pushed the fix/centralize-telemetry-timers branch from 767831c to dbdd6d7 Compare May 1, 2026 18:08

cijothomas added 2 commits May 1, 2026 15:45

fix(engine): sync telemetry_timers.active gauge after pre-registration

d7d142e

Merge branch 'main' into fix/centralize-telemetry-timers

f8ee74d

cijothomas requested review from Copilot May 1, 2026 23:33

Copilot started reviewing on behalf of cijothomas May 1, 2026 23:34 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): centralize telemetry timer management in runtime manager#2804

fix(engine): centralize telemetry timer management in runtime manager#2804
cijothomas wants to merge 3 commits intoopen-telemetry:mainfrom
cijothomas:fix/centralize-telemetry-timers

cijothomas commented May 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cijothomas commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Idle CPU results

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cijothomas commented May 1, 2026 •

edited

Loading

codecov Bot commented May 1, 2026 •

edited

Loading