Skip to content

feat: wire SIGINT/SIGTERM into root context for graceful shutdown#946

Merged
vikaxsh merged 9 commits into
datazip-inc:stagingfrom
Jonathan-7:feat/sigterm-context-clean
May 23, 2026
Merged

feat: wire SIGINT/SIGTERM into root context for graceful shutdown#946
vikaxsh merged 9 commits into
datazip-inc:stagingfrom
Jonathan-7:feat/sigterm-context-clean

Conversation

@Jonathan-7
Copy link
Copy Markdown
Contributor

@Jonathan-7 Jonathan-7 commented May 8, 2026

Description

Wires SIGINT/SIGTERM into the cobra root context so the CDC, backfill and destination-writer paths reach their existing ctx.Done() branches on Kubernetes pod eviction, docker stop or Ctrl-C, instead of being killed mid-read.

Today there is no signal handling on master — git grep -nE 'signal\.Notify|os\.Interrupt|syscall\.SIG(INT|TERM|HUP)' returns no matches. The cancellation machinery downstream is already in place:

  • pkg/waljs/pgoutput.go — pgoutput receive loop already has case <-ctx.Done(): return nil.
  • drivers/abstract/{backfill,cdc,incremental}.go — dispatchers already accept and propagate context.Context.
  • utils/cxgroup.goCxGroup propagates cancellation automatically once its parent is signal-aware.

Only the root-level signal wiring was missing.

protocol.CreateRootCommand now wraps the cobra root context with signal.NotifyContext(ctx, SIGINT, SIGTERM) before abstract.NewAbstractDriver captures it, and calls RootCmd.SetContext(ctx) so every subcommand inherits a signal-aware context. The wiring is split into a small testable helper, signalAwareRootContext. A small watcher goroutine calls stop() after the first cancellation so a subsequent SIGINT/SIGTERM falls through to the Go runtime default (terminate), matching the behavior an operator hitting Ctrl-C twice expects.

func CreateRootCommand(_ bool, driver any) *cobra.Command {
    RootCmd.AddCommand(commands...)
    ctx := signalAwareRootContext(RootCmd.Context())
    RootCmd.SetContext(ctx)
    connector = abstract.NewAbstractDriver(ctx, driver.(abstract.DriverInterface))
    return RootCmd
}

func signalAwareRootContext(parent context.Context) context.Context {
    ctx, stop := signal.NotifyContext(parent, syscall.SIGINT, syscall.SIGTERM)
    go func() { <-ctx.Done(); stop() }() // restore default after first cancellation
    return ctx
}

Scope — signal wiring alone is not full graceful shutdown; it is the prerequisite. After this change, source/destination cancellation safety remains a per-driver/per-writer responsibility: each PostCDC and each destination writer.Close gates its final commit on ctx.Done(). Tested across drivers and destinations during review: Kafka + Parquet behaves consistently (offsets not committed, partial parquet files removed on cancel, retry reprocesses cleanly). For Postgres, the in-flight keepalive in pkg/waljs/pgoutput.go acknowledges the slot independently of the defer chain, so a SIGTERM landing between a keepalive and PostCDC can leave the slot ahead of state.json. This race exists with or without this PR's signal handling, worth its own follow-up ticket.

Fixes #926

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • Unit tests in protocol/root_test.goTestSignalAwareRootContextCancelsOnSignal (table-driven, re-execs a helper child per signal case covering both SIGINT and SIGTERM) and TestSignalAwareRootContextPreservesParentCancellation (asserts parent-cancel propagation through the wrapper).
  • Manual: built the Postgres driver image locally and ran sync against a disposable Postgres with logical replication and a CDC-generating writer. docker stop -t 30 on the running container produced a non-143 exit and the process reached the deferred cleanup path rather than being killed mid-read.

go build ./..., go vet ./..., gofmt, and golangci-lint are clean.

Screenshots or Recordings

N/A — no UI/output changes.

Documentation

  • N/A (internal signal-handling improvement, no user-facing surface change)

Related PR's (If Any):

None.

Jonathan-7 added 2 commits May 8, 2026 18:00
OLake currently doesn't wire OS signals into its root context, so when
the OS delivers a clean-shutdown signal — Kubernetes pod eviction,
Karpenter consolidation, docker stop, Ctrl-C in a dev shell — the
process exits without giving the CDC, backfill or destination-writer
paths any chance to reach their existing ctx.Done() branches. The
cancellation machinery is already there: pkg/waljs/pgoutput.go's
receive loop already has case <-ctx.Done(): return nil, the
drivers/abstract/{backfill,cdc,incremental}.go dispatchers already
accept and propagate context.Context, and utils/cxgroup.go's CxGroup
propagates cancellation automatically once its parent is signal-aware.
Only the root-level signal wiring was missing.

Wraps the cobra root context with signal.NotifyContext(SIGINT, SIGTERM)
in protocol.CreateRootCommand before abstract.NewAbstractDriver
captures it, and calls RootCmd.SetContext(ctx) so every subcommand
inherits a signal-aware context. The wiring is split into a small
testable helper, signalAwareRootContext, which takes a parent context
and returns the signal-aware child.

A small watcher goroutine calls stop() after the first cancellation so
a subsequent SIGINT/SIGTERM falls through to the Go runtime default
(terminate), matching the behaviour an operator hitting Ctrl-C twice
expects.

Refs datazip-inc#926
Two unit tests for signalAwareRootContext:

- TestSignalAwareRootContextCancelsOnSIGTERM verifies that delivering
  SIGTERM to the process cancels the returned context. Uses the
  standard test-as-helper-process pattern (re-execs the test binary
  with an env var) so the signal is delivered to a sub-process whose
  only running test is this one. Skips on Windows where signal
  semantics differ.

- TestSignalAwareRootContextPreservesParentCancellation verifies that
  cancelling the supplied parent context also cancels the returned
  context, so the helper composes correctly with caller-supplied
  contexts (the path used by CreateRootCommand once cobra sets a
  context on RootCmd).
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 8, 2026

CLA assistant check
All committers have signed the CLA.

@Jonathan-7 Jonathan-7 temporarily deployed to integration_tests May 10, 2026 16:25 — with GitHub Actions Inactive
@saksham-datazip
Copy link
Copy Markdown
Collaborator

Hey, please change the base branch of this PR from master to staging.
image

@saksham-datazip
Copy link
Copy Markdown
Collaborator

Please update the PR description using the repository PR template instead of a custom format. Include the issue reference, checklist updates, testing details, and full change summary.

For reference, you can check existing PRs or follow:
https://olake.io/docs/community/issues-and-prs/

@nayanj98
Copy link
Copy Markdown
Collaborator

nayanj98 commented May 11, 2026

@Jonathan-7 Can you also upload a short video of you showing your code fixes and demonstrating with your changes the olake sync happens successfully. It really helps for the reviewer to also understand how you have implemented your changes and most importantly to prevent heavily AI generated PR's.

@Jonathan-7 Jonathan-7 changed the base branch from master to staging May 14, 2026 13:29
@Jonathan-7
Copy link
Copy Markdown
Contributor Author

Hey @nayanj98 on the AI concern: I'm running OLake as a Kubernetes CronJob and graceful shutdown was something I needed for my own deployment. I built the Postgres driver image locally and tested the SIGTERM path against a real Postgres with CDC traffic. The docker stop -t 30 recipe in the body is what I ran.

88 lines across two files. Recording a video is more work than just reading the diff. Happy to answer specific questions about the implementation here. If a video is a hard requirement close it, no hard feelings.

@Jonathan-7 Jonathan-7 requested a deployment to integration_tests May 16, 2026 05:29 — with GitHub Actions Waiting
Comment thread protocol/root.go
Comment thread protocol/root.go
Comment thread protocol/root_test.go Outdated
Comment thread protocol/root_test.go
Comment thread protocol/root_test.go
@saksham-datazip
Copy link
Copy Markdown
Collaborator

@Jonathan-7 Could you please help resolve the review comments within the next 1–2 days? We internally need this PR; otherwise, we may have to address the comments on our end.

@Jonathan-7 Jonathan-7 temporarily deployed to integration_tests May 21, 2026 06:49 — with GitHub Actions Inactive
@Jonathan-7 Jonathan-7 requested a deployment to integration_tests May 21, 2026 06:49 — with GitHub Actions Waiting
Comment thread protocol/root.go
- root_test.go: refactor to table-driven test covering both SIGINT and
  SIGTERM (datazip-inc#3); add docstrings + inline comments on both test functions
  (datazip-inc#4, datazip-inc#5).
- root.go: document signalAwareRootContext as a pure cancellation
  propagator. Source/destination consistency on cancel remains a
  per-driver / per-writer responsibility — the wrapper only makes
  process signals visible through ctx.Done() and does not make source
  checkpoints and destination commits atomic. Kafka's existing 2PC TODO
  is referenced (datazip-inc#1, datazip-inc#2).
@Jonathan-7
Copy link
Copy Markdown
Contributor Author

Hey @saksham-datazip, took a closer look at the comments.

On the Kafka duplicate-data point, I couldn't reproduce it. Ran against a real Kafka 3.9.1 broker with about half a million records, killed the pod mid-flush, re-ran. Destination cleanup ran under cancellation, the partial parquet file got removed, PostCDC saw context canceled and didn't commit offsets. Retry reprocessed cleanly from the prior committed offset, no duplicates in the destination. Code-wise it lines up: Kafka's PostCDC short-circuits, parquet's closePqFiles deletes local files when ctx.Err() != nil.

One issue I did find on Postgres though: if SIGTERM lands right after a pgoutput keepalive, the slot advances server-side while state.json stays at the prior PostCDC checkpoint, and retry fails with lsn mismatch. The keepalive ack at pkg/waljs/pgoutput.go:97-101 goes out on the wire independently of whether this PR's signal handling runs, so the race is the same with or without this change. Worth its own ticket.

In response I've pushed a table-driven test covering both SIGINT and SIGTERM as you suggested, comments on both test functions, and a corrected docstring on signalAwareRootContext that drops the 2PC claim, now scoped down to "wrapper propagates cancellation, source/destination consistency stays a per-driver concern".

For @vikaxsh on the connector.go relocation, reasonable suggestion, but leaning toward keeping it in protocol/root.go for this PR since the wrap happens right before the context is consumed there.

Comment thread protocol/root_test.go Outdated
Signed-off-by: Ankit Sharma <111491139+hash-data@users.noreply.github.com>
@hash-data hash-data temporarily deployed to integration_tests May 22, 2026 09:46 — with GitHub Actions Inactive
@hash-data hash-data temporarily deployed to integration_tests May 22, 2026 09:46 — with GitHub Actions Inactive
@vikaxsh vikaxsh temporarily deployed to integration_tests May 23, 2026 06:17 — with GitHub Actions Inactive
@vikaxsh vikaxsh temporarily deployed to integration_tests May 23, 2026 06:17 — with GitHub Actions Inactive
@vikaxsh vikaxsh merged commit 07f190f into datazip-inc:staging May 23, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improvement: wire SIGINT/SIGTERM into root context for graceful shutdown

6 participants