Skip to content

Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325

Draft
cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
cijothomas:cijothomas/shutdown
Draft

Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325
cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
cijothomas:cijothomas/shutdown

Conversation

@cijothomas
Copy link
Copy Markdown
Member

The main executable has no signal handling today— when K8s sent SIGTERM (or a local user hit Ctrl+C), the process was killed immediately without draining in-flight data.

This PR adds OS signal handling that follows the same double-signal convention as the Go OTel Collector:

First SIGINT/SIGTERM → sends graceful shutdown messages to all pipelines with a 60s drain deadline
Second signal → forces immediate exit via process::exit(1)

@github-actions github-actions Bot added the rust Pull requests that update Rust code label Mar 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.55%. Comparing base (bde436e) to head (573aa7d).
⚠️ Report is 207 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2325      +/-   ##
==========================================
- Coverage   87.58%   87.55%   -0.03%     
==========================================
  Files         571      571              
  Lines      194095   194550     +455     
==========================================
+ Hits       169996   170339     +343     
- Misses      23573    23685     +112     
  Partials      526      526              
Components Coverage Δ
otap-dataflow 89.57% <14.28%> (-0.05%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.61% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 52.44% <ø> (ø)
quiver 91.91% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

// Give pipelines a generous deadline to drain (60 s by default —
// matches the default Kubernetes terminationGracePeriodSeconds).
let deadline =
std::time::Instant::now() + std::time::Duration::from_secs(60);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: named constant for 60 secs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional nit: should be configurable


// ── Second signal: force exit ───────────────────────
let signal_name = Self::recv_termination_signal().await;
otel_error!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it seems like it is sync, but if this ever uses a bufferred writer (which I doubt it will) this message may not get printed before exit. Maybe an eprintln! right under this for good measure?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. Will swap to eprintln.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the purpose of raw_error!
I would prefer to use it for consistency.

let mut errors = Vec::new();
for sender in &senders {
if let Err(e) = sender.try_send_shutdown(
deadline,
Copy link
Copy Markdown
Member

@lalitb lalitb Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One correctness issue here: try_send_shutdown() can drop the shutdown request if the pipeline control channel is full at signal time. That makes graceful shutdown best-effort under backpressure.

For this PR, I think the smallest fix could be a bounded retry inside the existing rt.block_on(async { ... }) block, e.g. retry try_send_shutdown() a few times with a short tokio::time::sleep(...).await between attempts before giving up and logging the error. That avoids trait changes and closes the immediate gap.

As a follow-up, we can either add a proper async shutdown send on the trait or move shutdown onto a dedicated out-of-band signal such as a watch channel.

Longer term, a dedicated out-of-band shutdown signal (e.g. watch channel per pipeline) also gives us a clean reusable ShutdownHandle that supervisor and OpAMP can call directly - same shutdown path regardless of trigger source.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, separate channel for control signals makes sense.

}

#[cfg(not(unix))]
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the windows equivalent that handles both ctrl_c, ctrl_break.
#[cfg(windows)]
{
use tokio::signal::windows::{ctrl_c, ctrl_break};

let mut sigint = ctrl_c()
    .expect("failed to register Ctrl-C handler");
let mut sigterm = ctrl_break()
    .expect("failed to register Ctrl-Break handler");

tokio::select! {
    _ = sigterm.recv() => "CTRL_BREAK (SIGTERM-equivalent)",
    _ = sigint.recv()  => "CTRL_C (SIGINT-equivalent)",
}

}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: On windows platform Ctrl+C can't be reliably sent to a process without console handle and it will be ignored. Safest option is to use CTRL_BREAK.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to lack of recent activity. It will be closed in 30 days if no further activity occurs. If this PR is still relevant, please comment or push new commits to keep it active.

@github-actions github-actions Bot added the stale Not actively pursued label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code stale Not actively pursued

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants