Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325
Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2325 +/- ##
==========================================
- Coverage 87.58% 87.55% -0.03%
==========================================
Files 571 571
Lines 194095 194550 +455
==========================================
+ Hits 169996 170339 +343
- Misses 23573 23685 +112
Partials 526 526
🚀 New features to boost your workflow:
|
| // Give pipelines a generous deadline to drain (60 s by default — | ||
| // matches the default Kubernetes terminationGracePeriodSeconds). | ||
| let deadline = | ||
| std::time::Instant::now() + std::time::Duration::from_secs(60); |
There was a problem hiding this comment.
nit: named constant for 60 secs?
There was a problem hiding this comment.
additional nit: should be configurable
|
|
||
| // ── Second signal: force exit ─────────────────────── | ||
| let signal_name = Self::recv_termination_signal().await; | ||
| otel_error!( |
There was a problem hiding this comment.
Currently it seems like it is sync, but if this ever uses a bufferred writer (which I doubt it will) this message may not get printed before exit. Maybe an eprintln! right under this for good measure?
There was a problem hiding this comment.
agree. Will swap to eprintln.
There was a problem hiding this comment.
This is the purpose of raw_error!
I would prefer to use it for consistency.
| let mut errors = Vec::new(); | ||
| for sender in &senders { | ||
| if let Err(e) = sender.try_send_shutdown( | ||
| deadline, |
There was a problem hiding this comment.
One correctness issue here: try_send_shutdown() can drop the shutdown request if the pipeline control channel is full at signal time. That makes graceful shutdown best-effort under backpressure.
For this PR, I think the smallest fix could be a bounded retry inside the existing rt.block_on(async { ... }) block, e.g. retry try_send_shutdown() a few times with a short tokio::time::sleep(...).await between attempts before giving up and logging the error. That avoids trait changes and closes the immediate gap.
As a follow-up, we can either add a proper async shutdown send on the trait or move shutdown onto a dedicated out-of-band signal such as a watch channel.
Longer term, a dedicated out-of-band shutdown signal (e.g. watch channel per pipeline) also gives us a clean reusable ShutdownHandle that supervisor and OpAMP can call directly - same shutdown path regardless of trigger source.
There was a problem hiding this comment.
+1, separate channel for control signals makes sense.
| } | ||
|
|
||
| #[cfg(not(unix))] | ||
| { |
There was a problem hiding this comment.
Here is the windows equivalent that handles both ctrl_c, ctrl_break.
#[cfg(windows)]
{
use tokio::signal::windows::{ctrl_c, ctrl_break};
let mut sigint = ctrl_c()
.expect("failed to register Ctrl-C handler");
let mut sigterm = ctrl_break()
.expect("failed to register Ctrl-Break handler");
tokio::select! {
_ = sigterm.recv() => "CTRL_BREAK (SIGTERM-equivalent)",
_ = sigint.recv() => "CTRL_C (SIGINT-equivalent)",
}
}
There was a problem hiding this comment.
FYI: On windows platform Ctrl+C can't be reliably sent to a process without console handle and it will be ignored. Safest option is to use CTRL_BREAK.
|
This pull request has been marked as stale due to lack of recent activity. It will be closed in 30 days if no further activity occurs. If this PR is still relevant, please comment or push new commits to keep it active. |
The main executable has no signal handling today— when K8s sent SIGTERM (or a local user hit Ctrl+C), the process was killed immediately without draining in-flight data.
This PR adds OS signal handling that follows the same double-signal convention as the Go OTel Collector:
First SIGINT/SIGTERM → sends graceful shutdown messages to all pipelines with a 60s drain deadline
Second signal → forces immediate exit via process::exit(1)