PostHog
diff --git a/‎CLAUDE.md‎
Lines changed: 67 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎controlplane/control_cancel_test.go‎
Lines changed: 1 addition & 1 deletion b/‎controlplane/control_cancel_test.go‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎controlplane/flight_ingress_metrics.go‎
Lines changed: 14 additions & 0 deletions b/‎controlplane/flight_ingress_metrics.go‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎controlplane/k8s_pool.go‎
Lines changed: 14 additions & 0 deletions b/‎controlplane/k8s_pool.go‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎controlplane/k8s_pool_test.go‎
Lines changed: 7 additions & 0 deletions b/‎controlplane/k8s_pool_test.go‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎controlplane/org_acquire_gate.go‎
Lines changed: 78 additions & 0 deletions b/‎controlplane/org_acquire_gate.go‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎controlplane/org_acquire_gate_test.go‎
Lines changed: 98 additions & 0 deletions b/‎controlplane/org_acquire_gate_test.go‎
Lines changed: 98 additions & 0 deletions
@@ -162,6 +162,73 @@ DML with RETURNING is rejected at extended-query Describe time with SQLSTATE `0A
 - LIMIT 0 does NOT prevent CTE side effects — Postgres CTEs are optimization fences, so writable CTEs execute even with LIMIT 0.
 - DuckDB does not currently support MERGE. If it adds MERGE RETURNING, add `MERGE` to the prefix check in `isDMLReturning`.
 
+## Worker Session Model (k8s / remote backend) — LOAD-BEARING CONTRACT
+
+In the **control-plane remote/k8s backend** a worker pod serves **exactly one
+client query session at a time**. This is deliberate: `workerDuckDBLimits`
+(`controlplane/control.go`) gives the single session ~75% of the *whole pod's*
+RAM + all CPU cores — it does NOT divide by session count. Two sessions on one
+pod would each believe they own 75% → ~150% overcommit → nondeterministic OOM /
+a heavy query killed by a co-resident one. Do not break the following:
+
+- **One session per worker is enforced, not emergent.** The CP spawns remote
+  worker pods with `DUCKGRES_DUCKDB_MAX_SESSIONS=1` (`k8s_pool.go::spawnWorker`).
+  A 2nd concurrent `CreateSession` on a worker is rejected, not silently
+  overcommitted. Internal control/maintenance work uses the worker's side
+  connections (`controlDB`/`warmupDB`), which are NOT counted sessions — so
+  cap=1 does not starve them. Do not raise this to >1 for k8s workers, and do not
+  route internal work through `CreateSession`.
+- **`OrgReservedPool` (remote/multitenant) must never co-assign.** It reuses only
+  idle (`activeSessions==0`, Hot, org-owned) workers via
+  `findIdleAssignedWorkerLocked`, or claims/spawns a fresh one. There is NO
+  least-loaded "share onto a busy worker" path (that exists only in the
+  single-tenant flat `K8sWorkerPool.AcquireWorker`, which is not used in remote
+  mode). Do NOT add one, and do not resurrect a `leastLoaded*` helper here.
+- **At org max workers + all busy → fail fast with the clear org-cap message**
+  (`WorkerClaimMissReasonOrgCap`, see `warm_capacity_policy.go`). Never
+  busy-wait at cap.
+- **Under cap + all busy → hold for a spawn** up to `warmAcquireTimeout` (bounded
+  by the client ctx). This applies to default/exclusive requests too, not just
+  colocated.
+- **FIFO anti-snatch:** the slow acquisition path is serialized per org by
+  `orgAcquireGate` (`org_acquire_gate.go`) so a worker the CP scaled up for an
+  earlier waiter cannot be snatched by a later connection. Keep the gate
+  cancel-safe (a queued waiter whose ctx is cancelled must be skipped, not
+  deadlock the gate).
+- **Destroy-before-reuse ordering:** `SessionManager.DestroySession`
+  (`session_mgr.go`) MUST await the worker-side `DestroySession` RPC *before*
+  `ReleaseWorker`, so a reused (hot-idle) worker's prior session is gone before
+  the next one is assigned (otherwise cap=1 spuriously rejects the reuse).
+- **Cap-drift is recovered, not fatal:** if a worker still rejects a CP-scheduled
+  session at its cap (CP↔worker accounting drift — should never happen),
+  `SessionManager.CreateSessionWithProtocol` does NOT fail the client: it logs
+  loudly (ERROR), bumps `duckgres_control_plane_worker_session_cap_drift_total`,
+  retires (recycles) the inconsistent worker, and re-acquires a fresh one
+  (bounded by `maxWorkerSessionCapDriftRetries`). Detection is
+  `isWorkerSessionCapError` (matches the worker's "max sessions reached"
+  message). A nonzero drift metric means the scheduling invariant is broken —
+  fix the root cause, don't just lean on the retry.
+
+Touching any of: `controlplane/org_reserved_pool.go`, `org_acquire_gate.go`,
+`k8s_pool.go::spawnWorker`/`AcquireWorker`, `control.go::workerDuckDBLimits`, or
+`duckdbservice` session counting → update the unit tests
+(`org_reserved_pool_test.go`, `org_acquire_gate_test.go`,
+`duckdbservice/service_test.go`) AND the `one_session_per_worker` assertion in
+`tests/e2e-mw-dev/harness.sh`.
+
+## Worker Drain Protocol (graceful shutdown, #690)
+
+Remote worker pods drain on SIGTERM (pod deletion): they reject new work, keep
+in-flight work alive, then exit; the CP marks them `Draining` (not crashed) and
+retires them cleanly. Drain readiness is tracked by a refcount (`activeWork` in
+`duckdbservice/service.go`) of "drain tokens" — one taken per unit of in-flight
+work (query, txn, metadata stream, COPY, activation), released when it finishes.
+Invariants: take exactly one token when work starts and release exactly one when
+it ends on **every** path (a leak hangs drain to the shutdown timeout, an early
+release lets shutdown kill live work); `reapIdle` releases tokens stranded by a
+`GetFlightInfo` whose `DoGet` never arrived. `terminationGracePeriodSeconds=3600`
+(`k8s_pool.go`) must stay above `workerShutdownDrainTime` (55m).
+
 ## TODO Reference
 
 See `TODO.md` for the full feature roadmap and known issues.
@@ -113,7 +113,7 @@ func TestSessionCreationErrorResponse(t *testing.T) {
 		{
 			name:    "org capacity exhausted",
 			reason:  configstore.WorkerClaimMissReasonOrgCap,
-			message: "Duckgres worker capacity for this organization is currently exhausted; retry later",
+			message: "your organization has reached its maximum number of concurrent Duckgres workers and they are all busy; retry once a query finishes",
 		},
 		{
 			name:    "global capacity exhausted",
 
@@ -88,6 +88,20 @@ func observeControlPlaneWorkerAcquireFailure(reason string) {
 	controlPlaneWorkerAcquireFailuresCounter.WithLabelValues(reason).Inc()
 }
 
+// controlPlaneWorkerSessionCapDriftCounter counts times a worker rejected a
+// control-plane-scheduled CreateSession because it already held its max session
+// — a CP↔worker accounting drift that must never happen under the
+// one-session-per-worker contract. Should sit at 0; a sustained nonzero rate
+// means scheduling is double-assigning workers (alert on it).
+var controlPlaneWorkerSessionCapDriftCounter = promauto.NewCounter(prometheus.CounterOpts{
+	Name: "duckgres_control_plane_worker_session_cap_drift_total",
+	Help: "Times a worker rejected a CP-scheduled CreateSession at its session cap (CP↔worker accounting drift; recovered by recycling the worker and retrying).",
+})
+
+func observeWorkerSessionCapDrift() {
+	controlPlaneWorkerSessionCapDriftCounter.Inc()
+}
+
 func observeFlightSessionsReaped(trigger string, count int) {
 	if count <= 0 {
 		return
 
@@ -775,6 +775,20 @@ func (p *K8sWorkerPool) spawnWorker(ctx context.Context, id int, image string, p
 							Name:  "DUCKGRES_KEY",
 							Value: workerRPCMountDir + "/" + workerRPCKeyKey,
 						},
+						{
+							// One client query session per worker pod: the pod's full
+							// resources (workerDuckDBLimits gives the session ~75% of pod
+							// RAM + all cores) belong to a single query, so queries never
+							// contend and a heavy query can't be OOM'd by a co-resident
+							// one. The CP scheduler (OrgReservedPool) already never
+							// co-assigns; this is the hard worker-side guarantee — a 2nd
+							// CreateSession is rejected rather than silently overcommitting.
+							// Internal control/maintenance work runs on the worker's side
+							// connections (controlDB/warmupDB), which are NOT counted
+							// sessions, so this does not starve them.
+							Name:  "DUCKGRES_DUCKDB_MAX_SESSIONS",
+							Value: "1",
+						},
 					},
 					SecurityContext: &corev1.SecurityContext{
 						AllowPrivilegeEscalation: boolPtr(false),
 
@@ -3793,6 +3793,7 @@ func assertSpawnedWorkerPod(t *testing.T, pod *corev1.Pod) {
 	foundSharedWarmWorkerEnv := false
 	foundTLSCertEnv := false
 	foundTLSKeyEnv := false
+	foundMaxSessionsEnv := false
 	for _, env := range c.Env {
 		if env.Name == "DUCKGRES_DUCKDB_TOKEN" && env.ValueFrom != nil &&
 			env.ValueFrom.SecretKeyRef != nil &&
@@ -3808,6 +3809,9 @@ func assertSpawnedWorkerPod(t *testing.T, pod *corev1.Pod) {
 		if env.Name == "DUCKGRES_KEY" && env.Value == "/etc/duckgres/worker-rpc/tls.key" {
 			foundTLSKeyEnv = true
 		}
+		if env.Name == "DUCKGRES_DUCKDB_MAX_SESSIONS" && env.Value == "1" {
+			foundMaxSessionsEnv = true
+		}
 	}
 	if !foundEnv {
 		t.Fatal("bearer token env var not found or incorrect")
@@ -3818,6 +3822,9 @@ func assertSpawnedWorkerPod(t *testing.T, pod *corev1.Pod) {
 	if !foundTLSCertEnv || !foundTLSKeyEnv {
 		t.Fatal("expected worker RPC TLS env vars to be present")
 	}
+	if !foundMaxSessionsEnv {
+		t.Fatal("expected DUCKGRES_DUCKDB_MAX_SESSIONS=1 (one query session per worker)")
+	}
 
 	if len(pod.Spec.Volumes) == 0 {
 		t.Fatal("expected configmap volume")
 
@@ -0,0 +1,78 @@
+//go:build kubernetes
+
+package controlplane
+
+import (
+	"context"
+	"sync"
+)
+
+// orgAcquireGate is a cancellable FIFO turnstile. It serializes the slow
+// worker-acquisition path (no idle worker → claim/spawn) for one org so that a
+// newly-spawned or freed worker is handed to the EARLIEST waiting connection,
+// and a later-arriving connection cannot snatch it. Plain sync.Mutex is
+// unsuitable: it is not strictly FIFO and a goroutine blocked in Lock() cannot
+// abort when its request context is cancelled (client disconnect / deadline).
+//
+// Holders must call release() exactly once (defer) after acquire() returns nil.
+type orgAcquireGate struct {
+	mu    sync.Mutex
+	held  bool
+	queue []*gateWaiter
+}
+
+type gateWaiter struct {
+	ready    chan struct{} // closed when the gate is granted to this waiter
+	canceled bool          // set under mu when the waiter abandoned before grant
+}
+
+func newOrgAcquireGate() *orgAcquireGate { return &orgAcquireGate{} }
+
+// acquire blocks until this caller owns the gate (FIFO) or ctx is done. On a nil
+// return the caller owns the gate and MUST call release().
+func (g *orgAcquireGate) acquire(ctx context.Context) error {
+	g.mu.Lock()
+	if !g.held {
+		g.held = true
+		g.mu.Unlock()
+		return nil
+	}
+	w := &gateWaiter{ready: make(chan struct{})}
+	g.queue = append(g.queue, w)
+	g.mu.Unlock()
+
+	select {
+	case <-w.ready:
+		return nil
+	case <-ctx.Done():
+		g.mu.Lock()
+		select {
+		case <-w.ready:
+			// Granted concurrently with cancellation: we now own the gate, so we
+			// must pass it on rather than leak it.
+			g.mu.Unlock()
+			g.release()
+		default:
+			w.canceled = true
+			g.mu.Unlock()
+		}
+		return ctx.Err()
+	}
+}
+
+// release hands the gate to the next live waiter (FIFO), or marks it free.
+func (g *orgAcquireGate) release() {
+	g.mu.Lock()
+	for len(g.queue) > 0 {
+		w := g.queue[0]
+		g.queue = g.queue[1:]
+		if w.canceled {
+			continue // waiter gave up; skip it
+		}
+		close(w.ready) // grant; held stays true (ownership transfers)
+		g.mu.Unlock()
+		return
+	}
+	g.held = false
+	g.mu.Unlock()
+}
@@ -0,0 +1,98 @@
+//go:build kubernetes
+
+package controlplane
+
+import (
+	"context"
+	"sync"
+	"testing"
+	"time"
+)
+
+// The gate must grant ownership to one holder at a time and release to waiters
+// in FIFO arrival order — this is what stops a later connection from snatching a
+// worker a longer-waiting one is owed.
+func TestOrgAcquireGateFIFOOrder(t *testing.T) {
+	g := newOrgAcquireGate()
+
+	// First acquire wins immediately.
+	if err := g.acquire(context.Background()); err != nil {
+		t.Fatalf("first acquire: %v", err)
+	}
+
+	const n = 5
+	entered := make([]int, 0, n)
+	var mu sync.Mutex
+	var wg sync.WaitGroup
+	starts := make([]chan struct{}, n)
+
+	for i := 0; i < n; i++ {
+		starts[i] = make(chan struct{})
+		wg.Add(1)
+		go func(idx int) {
+			defer wg.Done()
+			<-starts[idx] // queue in deterministic order
+			if err := g.acquire(context.Background()); err != nil {
+				t.Errorf("waiter %d acquire: %v", idx, err)
+				return
+			}
+			mu.Lock()
+			entered = append(entered, idx)
+			mu.Unlock()
+			g.release()
+		}(i)
+	}
+
+	// Release each waiter onto the queue one at a time so arrival order is 0..n-1.
+	for i := 0; i < n; i++ {
+		close(starts[i])
+		time.Sleep(10 * time.Millisecond)
+	}
+
+	g.release() // hand the gate to the FIFO queue head
+	wg.Wait()
+
+	for i := 0; i < n; i++ {
+		if entered[i] != i {
+			t.Fatalf("gate granted out of FIFO order: got %v want [0 1 2 3 4]", entered)
+		}
+	}
+}
+
+// A waiter whose context is cancelled while queued must not deadlock the gate:
+// release() skips it and grants to the next live waiter.
+func TestOrgAcquireGateCancelledWaiterIsSkipped(t *testing.T) {
+	g := newOrgAcquireGate()
+	if err := g.acquire(context.Background()); err != nil {
+		t.Fatalf("first acquire: %v", err)
+	}
+
+	// Waiter A queues, then cancels.
+	ctxA, cancelA := context.WithCancel(context.Background())
+	aDone := make(chan error, 1)
+	go func() { aDone <- g.acquire(ctxA) }()
+	time.Sleep(20 * time.Millisecond)
+
+	// Waiter B queues behind A.
+	bGot := make(chan struct{}, 1)
+	go func() {
+		if err := g.acquire(context.Background()); err == nil {
+			bGot <- struct{}{}
+		}
+	}()
+	time.Sleep(20 * time.Millisecond)
+
+	cancelA()
+	if err := <-aDone; err == nil {
+		t.Fatal("expected cancelled waiter A to return an error")
+	}
+
+	// Hand off the gate: A is cancelled, so B must get it.
+	g.release()
+	select {
+	case <-bGot:
+	case <-time.After(2 * time.Second):
+		t.Fatal("waiter B did not acquire the gate after A cancelled (gate leaked)")
+	}
+	g.release()
+}
Original file line number	Diff line number	Diff line change
`@@ -113,7 +113,7 @@ func TestSessionCreationErrorResponse(t *testing.T) {`
`113`	`113`	`{`
`114`	`114`	`name: "org capacity exhausted",`
`115`	`115`	`reason: configstore.WorkerClaimMissReasonOrgCap,`
`116`		`- message: "Duckgres worker capacity for this organization is currently exhausted; retry later",`
	`116`	`+ message: "your organization has reached its maximum number of concurrent Duckgres workers and they are all busy; retry once a query finishes",`
`117`	`117`	`},`
`118`	`118`	`{`
`119`	`119`	`name: "global capacity exhausted",`