feat: wire SIGINT/SIGTERM into root context for graceful shutdown#946
Conversation
OLake currently doesn't wire OS signals into its root context, so when
the OS delivers a clean-shutdown signal — Kubernetes pod eviction,
Karpenter consolidation, docker stop, Ctrl-C in a dev shell — the
process exits without giving the CDC, backfill or destination-writer
paths any chance to reach their existing ctx.Done() branches. The
cancellation machinery is already there: pkg/waljs/pgoutput.go's
receive loop already has case <-ctx.Done(): return nil, the
drivers/abstract/{backfill,cdc,incremental}.go dispatchers already
accept and propagate context.Context, and utils/cxgroup.go's CxGroup
propagates cancellation automatically once its parent is signal-aware.
Only the root-level signal wiring was missing.
Wraps the cobra root context with signal.NotifyContext(SIGINT, SIGTERM)
in protocol.CreateRootCommand before abstract.NewAbstractDriver
captures it, and calls RootCmd.SetContext(ctx) so every subcommand
inherits a signal-aware context. The wiring is split into a small
testable helper, signalAwareRootContext, which takes a parent context
and returns the signal-aware child.
A small watcher goroutine calls stop() after the first cancellation so
a subsequent SIGINT/SIGTERM falls through to the Go runtime default
(terminate), matching the behaviour an operator hitting Ctrl-C twice
expects.
Refs datazip-inc#926
Two unit tests for signalAwareRootContext: - TestSignalAwareRootContextCancelsOnSIGTERM verifies that delivering SIGTERM to the process cancels the returned context. Uses the standard test-as-helper-process pattern (re-execs the test binary with an env var) so the signal is delivered to a sub-process whose only running test is this one. Skips on Windows where signal semantics differ. - TestSignalAwareRootContextPreservesParentCancellation verifies that cancelling the supplied parent context also cancels the returned context, so the helper composes correctly with caller-supplied contexts (the path used by CreateRootCommand once cobra sets a context on RootCmd).
|
Please update the PR description using the repository PR template instead of a custom format. Include the issue reference, checklist updates, testing details, and full change summary. For reference, you can check existing PRs or follow: |
|
@Jonathan-7 Can you also upload a short video of you showing your code fixes and demonstrating with your changes the olake sync happens successfully. It really helps for the reviewer to also understand how you have implemented your changes and most importantly to prevent heavily AI generated PR's. |
|
Hey @nayanj98 on the AI concern: I'm running OLake as a Kubernetes CronJob and graceful shutdown was something I needed for my own deployment. I built the Postgres driver image locally and tested the SIGTERM path against a real Postgres with CDC traffic. The 88 lines across two files. Recording a video is more work than just reading the diff. Happy to answer specific questions about the implementation here. If a video is a hard requirement close it, no hard feelings. |
|
@Jonathan-7 Could you please help resolve the review comments within the next 1–2 days? We internally need this PR; otherwise, we may have to address the comments on our end. |
- root_test.go: refactor to table-driven test covering both SIGINT and SIGTERM (datazip-inc#3); add docstrings + inline comments on both test functions (datazip-inc#4, datazip-inc#5). - root.go: document signalAwareRootContext as a pure cancellation propagator. Source/destination consistency on cancel remains a per-driver / per-writer responsibility — the wrapper only makes process signals visible through ctx.Done() and does not make source checkpoints and destination commits atomic. Kafka's existing 2PC TODO is referenced (datazip-inc#1, datazip-inc#2).
|
Hey @saksham-datazip, took a closer look at the comments. On the Kafka duplicate-data point, I couldn't reproduce it. Ran against a real Kafka 3.9.1 broker with about half a million records, killed the pod mid-flush, re-ran. Destination cleanup ran under cancellation, the partial parquet file got removed, PostCDC saw One issue I did find on Postgres though: if SIGTERM lands right after a pgoutput keepalive, the slot advances server-side while In response I've pushed a table-driven test covering both SIGINT and SIGTERM as you suggested, comments on both test functions, and a corrected docstring on For @vikaxsh on the |
Signed-off-by: Ankit Sharma <111491139+hash-data@users.noreply.github.com>

Description
Wires SIGINT/SIGTERM into the cobra root context so the CDC, backfill and destination-writer paths reach their existing
ctx.Done()branches on Kubernetes pod eviction,docker stopor Ctrl-C, instead of being killed mid-read.Today there is no signal handling on master —
git grep -nE 'signal\.Notify|os\.Interrupt|syscall\.SIG(INT|TERM|HUP)'returns no matches. The cancellation machinery downstream is already in place:pkg/waljs/pgoutput.go— pgoutput receive loop already hascase <-ctx.Done(): return nil.drivers/abstract/{backfill,cdc,incremental}.go— dispatchers already accept and propagatecontext.Context.utils/cxgroup.go—CxGrouppropagates cancellation automatically once its parent is signal-aware.Only the root-level signal wiring was missing.
protocol.CreateRootCommandnow wraps the cobra root context withsignal.NotifyContext(ctx, SIGINT, SIGTERM)beforeabstract.NewAbstractDrivercaptures it, and callsRootCmd.SetContext(ctx)so every subcommand inherits a signal-aware context. The wiring is split into a small testable helper,signalAwareRootContext. A small watcher goroutine callsstop()after the first cancellation so a subsequent SIGINT/SIGTERM falls through to the Go runtime default (terminate), matching the behavior an operator hitting Ctrl-C twice expects.Scope — signal wiring alone is not full graceful shutdown; it is the prerequisite. After this change, source/destination cancellation safety remains a per-driver/per-writer responsibility: each
PostCDCand each destinationwriter.Closegates its final commit onctx.Done(). Tested across drivers and destinations during review: Kafka + Parquet behaves consistently (offsets not committed, partial parquet files removed on cancel, retry reprocesses cleanly). For Postgres, the in-flight keepalive inpkg/waljs/pgoutput.goacknowledges the slot independently of the defer chain, so a SIGTERM landing between a keepalive and PostCDC can leave the slot ahead ofstate.json. This race exists with or without this PR's signal handling, worth its own follow-up ticket.Fixes #926
Type of change
How Has This Been Tested?
protocol/root_test.go—TestSignalAwareRootContextCancelsOnSignal(table-driven, re-execs a helper child per signal case covering both SIGINT and SIGTERM) andTestSignalAwareRootContextPreservesParentCancellation(asserts parent-cancel propagation through the wrapper).syncagainst a disposable Postgres with logical replication and a CDC-generating writer.docker stop -t 30on the running container produced a non-143exit and the process reached the deferred cleanup path rather than being killed mid-read.go build ./...,go vet ./...,gofmt, andgolangci-lintare clean.Screenshots or Recordings
N/A — no UI/output changes.
Documentation
Related PR's (If Any):
None.