Skip to content

fix(engine): centralize telemetry timer management in runtime manager#2804

Draft
cijothomas wants to merge 3 commits intoopen-telemetry:mainfrom
cijothomas:fix/centralize-telemetry-timers
Draft

fix(engine): centralize telemetry timer management in runtime manager#2804
cijothomas wants to merge 3 commits intoopen-telemetry:mainfrom
cijothomas:fix/centralize-telemetry-timers

Conversation

@cijothomas
Copy link
Copy Markdown
Member

@cijothomas cijothomas commented May 1, 2026

Fixes #1305

Summary

Centralizes telemetry timer management in the runtime control manager.
Previously, every node independently called
start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded
interval that was not configurable and not enforceable. The runtime
manager now registers telemetry timers for all nodes at pipeline
startup, using the existing engine.telemetry.reporting_interval
config (default: 1s).

This removes start_periodic_telemetry calls and cancel-handle
management from all 15 node implementations. Shutdown cancellation is
handled centrally via cancel_all().

The idle perf test config is updated to use reporting_interval: 5s
with a matching 5s Prometheus scrape interval, so the idle benchmark
reflects a more realistic deployment configuration.

Notes

Perf exporter: Previously used a custom telemetry interval from its
config. This is now replaced by the engine-wide reporting_interval.
No sampling fidelity is lost — the custom interval only controlled
metric flush cadence.

start_periodic_telemetry API: The API and runtime control messages
(StartTelemetryTimer/CancelTelemetryTimer) are still present. A node
could still override the central timer. TBD whether to keep this or
remove it — will track via a follow-up issue.

Idle CPU results

Idle CPU dropped significantly with the 5s reporting interval in the
idle test config. For example, the 4-core idle benchmark went from
~0.95% to ~0.13% CPU (docker stats, 16-core host). CI perf runs will
provide authoritative numbers on dedicated hardware.

@github-actions github-actions Bot added the rust Pull requests that update Rust code label May 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.01%. Comparing base (f018901) to head (f8ee74d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   86.03%   86.01%   -0.03%     
==========================================
  Files         704      704              
  Lines      264591   264491     -100     
==========================================
- Hits       227654   227514     -140     
- Misses      36413    36453      +40     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 86.95% <100.00%> (-0.03%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.76% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cijothomas cijothomas force-pushed the fix/centralize-telemetry-timers branch from b783a18 to 767831c Compare May 1, 2026 17:21
Fixes open-telemetry#1305

Previously, every node (receiver, processor, exporter) independently
called effect_handler.start_periodic_telemetry(Duration::from_secs(1))
with a hardcoded 1-second interval. This was:
- Not configurable by operators
- Not enforceable (each node picked its own interval)
- A significant contributor to idle CPU (~50 millicores on 4 cores)

The runtime manager now registers telemetry timers for all nodes
centrally during pipeline startup, using the configured
engine.telemetry.reporting_interval. This:
- Removes start_periodic_telemetry calls from all 15 node files
- Eliminates per-node cancel handle management on shutdown
- Enforces a single, consistent collection cadence by construction
- Uses the existing configurable reporting_interval (default 1s)

The idle test configuration is updated to use reporting_interval: 5s
and a matching 5s Prometheus scrape interval, reducing idle CPU from
~0.9% to ~0.1% on 4 cores.

Also fixes the idle-state-template Prometheus endpoint URLs to use
the correct /api/v1 prefix.
@cijothomas cijothomas force-pushed the fix/centralize-telemetry-timers branch from 767831c to dbdd6d7 Compare May 1, 2026 18:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes periodic node telemetry scheduling inside the engine runtime-control manager so pipelines use the configured engine-wide reporting interval instead of having each node start and cancel its own telemetry timer. It also updates the idle perf test to use a slower 5s telemetry/scrape cadence that better matches the new centralized behavior.

Changes:

  • Pre-register telemetry timers for all nodes in the runtime-control manager and sync control-plane timer metrics immediately.
  • Remove per-node start_periodic_telemetry() / cancel-handle management from receivers, processors, exporters, and validation code.
  • Add a dedicated idle perf-test engine config with engine.telemetry.reporting_interval: 5s and align Prometheus scraping to 5s.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/pipeline_perf_test/test_suites/integration/templates/configs/engine/continuous/otlp-attr-otlp-idle.yaml New idle perf-test engine config with 5s telemetry interval.
tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2 Switches idle test to the new config and sets 5s scrape interval.
rust/otap-dataflow/crates/validation/src/validation_exporter.rs Removes exporter-local telemetry timer startup.
rust/otap-dataflow/crates/engine/src/processor.rs Removes processor wrapper telemetry timer lifecycle.
rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs Centralizes telemetry timer registration in the runtime manager and updates tests.
rust/otap-dataflow/crates/engine/src/control.rs Adds helper to enumerate registered node IDs.
rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs Removes topic receiver telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs Removes syslog receiver telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/receivers/otlp_receiver/mod.rs Removes OTLP receiver telemetry timer/cancel plumbing.
rust/otap-dataflow/crates/core-nodes/src/receivers/otap_receiver/mod.rs Removes OTAP receiver telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/receivers/internal_telemetry_receiver/mod.rs Removes internal telemetry receiver timer startup.
rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs Removes fake generator telemetry timer startup.
rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs Removes topic exporter telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/exporters/perf_exporter/mod.rs Removes perf exporter custom telemetry timer startup.
rust/otap-dataflow/crates/core-nodes/src/exporters/parquet_exporter/mod.rs Removes parquet exporter telemetry timer management and related test.
rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_http_exporter/mod.rs Removes OTLP/HTTP exporter telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_grpc_exporter/mod.rs Removes OTLP/gRPC exporter telemetry timer/cancel handling.
rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs Removes OTAP exporter telemetry timer/cancel handling.
rust/otap-dataflow/crates/contrib-nodes/src/exporters/geneva_exporter/mod.rs Removes Geneva exporter telemetry timer/cancel handling.
rust/otap-dataflow/crates/contrib-nodes/src/exporters/azure_monitor_exporter/exporter.rs Removes Azure Monitor exporter telemetry timer/cancel handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +324 to +342
// Register telemetry timers for all nodes centrally, using the
// configured reporting interval. This replaces per-node
// start_periodic_telemetry calls and ensures a single, consistent
// collection cadence across all nodes.
for node_id in result.control_senders.node_ids() {
result
.telemetry_timers
.start(node_id, result.control_plane_metrics_flush_interval);
}

// Sync the metrics shadow with the pre-registered timers so the
// `telemetry_timers.active` gauge reflects reality before the first
// scheduler tick instead of reporting 0 for one full reporting interval.
result.runtime_control_metrics.set_timer_counts(
result.tick_timers.timer_states.len(),
result.telemetry_timers.timer_states.len(),
);

result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Periodic telemetry API improvement

2 participants