Skip to content

SIGTERM does not drain in-flight records (invariant 7 violation) #2519

Description

@devarismeroxa

Summary

Conduit does not handle SIGTERM. pkg/conduit/entrypoint.go:70 registers only os.Interrupt (SIGINT):

signal.Notify(signalChan, os.Interrupt)

docker stop, kubectl delete pod, and systemctl stop all send SIGTERM, which hits Go's default disposition — the process dies immediately with zero drain. This violates documented data-integrity invariant 7 ("Shutdown is graceful by default. SIGTERM drains in-flight records and checkpoints before exit").

Related gaps (same fix)

  • A second SIGINT calls a bare os.Exit(exitCodeInterrupt) (entrypoint.go:77-78) that bypasses the graceful-stop path entirely.
  • registerCleanupV2 force-stop escalation (pkg/conduit/runtime.go:499-519) force-stops after a fixed fraction of exitTimeout regardless of checkpoint completion; ForceStop cancels the connector context without verifying no un-acked record was already forwarded downstream (invariant-1 adjacency).

Severity

With at-least-once (inv. 3) and crash-safe positions (inv. 2) holding, an un-drained SIGTERM behaves like a crash — recoverable on restart, not silent data loss. But it violates a documented invariant, produces duplicate storms and unclean checkpoints on every Kubernetes pod recycle, and undermines the container/12-factor deployment story. Pre-existing (not a v0.15.0 regression).

Fix (v0.15.1, Tier 1 — data path)

  • Design doc first (docs/design-documents/): signal set, drain sequence, grace deadline, force-stop-respects-checkpoint semantics.
  • Register SIGTERM; make the drain path the default; remove the second-signal os.Exit bypass; fix V2 force-stop to escalate only after checkpoint or a hard deadline, never mid-ack.
  • Regression + chaos test (seeds tests/chaos, which doesn't exist yet): SIGTERM/SIGKILL at random points under load → assert no double-delivery beyond at-least-once and no lost/torn checkpoint.

Tier

Tier 1 (data path) — requires human sign-off + failure-mode analysis per CLAUDE.md.

Scoped in the Phase 1 execution plan (#2518), §0.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions