Status: Future planning (v3.0)
Depends on: v2.0 (job-centric document model), Phase 10 (multi-job threading)
Key insight: ALL worker state lives in Couchbase Lite collections. CBL has built-in WebSocket replication. Replicate the CBL database = replicate the entire worker.
Related docs:
DESIGN_2_0.md– v2.0 architecture (job-centric document model)CBL_DATABASE.md– CBL collection layoutMULTI_PIPELINE_PLAN.md– Multi-pipeline threading design
Today, if the changes_worker process stops — crash, OOM, host failure, deployment — processing stops. Period. There is no standby, no failover, no handoff. The DLQ catches output failures, but it doesn't help when the entire worker is down.
This is acceptable for dev/staging but not for production workloads where _changes feed latency must stay under SLA (e.g., "orders must appear in PostgreSQL within 60 seconds of mutation in Couchbase").
After the v2.0 redesign, every piece of worker state lives in Couchbase Lite collections:
| What | Collection | Without it the worker... |
|---|---|---|
| Where it left off | checkpoints |
...re-processes from since=0 (full re-sync) |
| What to process | jobs |
...has no jobs to run |
| Where to read from | inputs_changes |
...doesn't know the source |
| Where to write to | outputs_rdbms/http/cloud |
...doesn't know the destination |
| Processing rules | jobs.schema_mapping |
...can't map JSON → output |
| Infrastructure config | config |
...uses defaults |
| Failed documents | dlq |
...loses failure history |
| Data coercions | data_quality |
...loses coercion audit trail |
| ML/AI results | enrichments |
...loses enrichment data |
| Auth sessions | sessions |
...needs fresh auth |
CBL has built-in WebSocket replication to:
- Another Couchbase Lite instance (peer-to-peer)
- Sync Gateway (cloud/on-prem)
- Edge Server
So: replicate the CBL database → a second worker has everything it needs to take over.
CBL WebSocket Replication
─────────────────────────
(continuous)
┌────────────────────┐ ┌────────────────────┐
│ PRIMARY WORKER │ ───── replicate ────► │ STANDBY WORKER │
│ │ │ │
│ CBL: changes_worker_db │ CBL: changes_worker_db
│ ├── jobs │ │ ├── jobs │
│ ├── checkpoints │ ◄──── replicate ──── │ ├── checkpoints │
│ ├── inputs_changes│ │ ├── inputs_changes│
│ ├── outputs_* │ │ ├── outputs_* │
│ ├── dlq │ │ ├── dlq │
│ ├── config │ │ ├── config │
│ └── ... │ │ └── ... │
│ │ │ │
│ Status: ACTIVE │ │ Status: STANDBY │
│ Processing jobs ✅│ │ Idle, watching 👀 │
└────────────────────┘ └────────────────────┘
│ │
│ health check fails │
▼ ▼
┌─────────┐ ┌───────────┐
│ DOWN │ ────── lease expires ──────────► │ PROMOTE │
└─────────┘ │ to ACTIVE│
└───────────┘
How it works:
- Primary runs all jobs normally. CBL continuously replicates to standby via WebSocket.
- Standby has a copy of the entire CBL database (jobs, checkpoints, config, DLQ, everything). It does NOT process jobs — it just receives replicated data.
- Heartbeat: Primary writes a heartbeat document to CBL every N seconds (e.g.,
heartbeat::primary,time: <epoch>). This replicates to standby. - Failure detection: Standby watches the heartbeat. If
now() - heartbeat.time > failover_timeout_seconds, the primary is presumed dead. - Promotion: Standby promotes itself to active, loads jobs from its local CBL, and starts processing from the last replicated checkpoint.
- Recovery gap: The gap between the primary's last checkpoint save and its death is the "replay window." At worst, the standby re-processes
every_n_docsdocuments (duplicates, but the pipeline is designed for at-least-once delivery).
Pros:
- Simple. One primary, one standby.
- Zero data loss for config/jobs/DLQ (replicated continuously).
- Minimal replay window for checkpoints (depends on checkpoint frequency).
- Standby can serve read-only dashboard/metrics while in standby mode.
Cons:
- Standby is idle (wasted resources when primary is healthy).
- Single standby — no protection against double failure.
- Failover is not instant (detection timeout + job startup time).
┌─────────────────────────────────────────────────┐
│ Sync Gateway / Edge Server │
│ (central CBL replication hub) │
│ │
│ All workers replicate TO and FROM here │
└──────────┬──────────────────────┬────────────────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ WORKER A │ │ WORKER B │
│ │ │ │
│ Owns: │ │ Owns: │
│ job::aaa │ │ job::ccc │
│ job::bbb │ │ job::ddd │
│ │ │ │
│ CBL (local)│ │ CBL (local)│
└─────────────┘ └─────────────┘
How it works:
- Job assignment: Each job document has an
ownerfield:{"owner": "worker-a", "owner_lease_expires": <epoch>}. - Lease-based ownership: A worker claims a job by writing its ID and a lease expiry to the job document. The lease is renewed periodically (e.g., every 30s with a 90s TTL).
- Replication hub: All workers replicate their CBL to a central Sync Gateway (or Edge Server). This is the single source of truth for job assignments.
- Failover: If Worker A's lease on
job::aaaexpires (Worker A is dead), Worker B sees the expired lease via replication and claims the job. - Checkpoint continuity: Worker B reads
checkpoint::aaafrom its replicated CBL and resumes fromlast_seq. - Conflict resolution: CBL uses last-write-wins for the
ownerfield. Sync Gateway channels can enforce that only the owner can write to a job's checkpoint/DLQ.
Pros:
- No idle standby — all workers are productive.
- Horizontal scaling — add workers, they claim unclaimed jobs.
- Automatic rebalancing when a worker joins/leaves.
- Each worker is a full, independent changes_worker process.
Cons:
- Requires Sync Gateway or Edge Server as the replication hub.
- Lease management adds complexity (renewal, expiry, race conditions).
- Conflict resolution for job ownership is non-trivial.
- Network partition can cause split-brain (two workers claim the same job).
┌────────────────────┐ ┌──────────────────┐ ┌────────────────────┐
│ PRIMARY WORKER │ │ Sync Gateway │ │ STANDBY WORKER │
│ │ │ (or Edge Server)│ │ │
│ CBL ─── push ────►│ │ │◄──── pull ─── CBL │
│ pull ◄────│ │ Bucket: │ │ │
│ │ │ worker_state │ │ │
│ Status: ACTIVE │ │ │ │ Status: STANDBY │
└────────────────────┘ └──────────────────┘ └────────────────────┘
How it works:
- Primary pushes CBL changes to Sync Gateway. Standby pulls from Sync Gateway.
- This is the same as Option A, but uses Sync Gateway as the replication transport instead of peer-to-peer CBL replication.
- Advantage: Sync Gateway handles auth, channels, conflict resolution, and access control. You can use Sync Gateway channels to scope replication per job.
- Advantage: Multiple standbys can pull from the same Sync Gateway — no extra config per standby.
- Advantage: Sync Gateway is already in the architecture (it's the
_changessource). You're already running it.
| Collection | Replication Direction | Conflict Strategy | Notes |
|---|---|---|---|
jobs |
Bidirectional | Last-write-wins | Job config rarely changes. Owner field resolves via lease. |
checkpoints |
Primary → Standby (push) | Primary always wins | Standby should never write checkpoints while in standby mode. |
inputs_changes |
Bidirectional | Last-write-wins | Admin edits from either node. |
outputs_* |
Bidirectional | Last-write-wins | Same as inputs. |
config |
Bidirectional | Last-write-wins | Infrastructure config. |
dlq |
Primary → Standby (push) | No conflict (unique doc IDs) | DLQ doc IDs include timestamp — no collision. |
data_quality |
Primary → Standby (push) | No conflict (unique doc IDs) | Same — timestamp in doc ID. |
enrichments |
Primary → Standby (push) | No conflict (unique doc IDs) | Same. |
sessions |
Bidirectional | Last-write-wins | Session refresh can happen on either node. |
users |
Bidirectional | Last-write-wins | Rare changes. |
audit_log |
Primary → Standby (push) | No conflict (unique doc IDs) | Append-only. |
New section in the config document:
{
"data": {
"replication": {
"enabled": false,
"mode": "active_passive",
"role": "primary",
"target": {
"type": "sync_gateway",
"url": "wss://sg.example.com:4984",
"database": "worker_state",
"auth": {
"method": "basic",
"username": "replicator",
"password": "secret"
}
},
"continuous": true,
"heartbeat_interval_seconds": 10,
"failover_timeout_seconds": 30,
"collections": [
"inputs_changes", "outputs_rdbms", "outputs_http", "outputs_cloud",
"jobs", "checkpoints", "dlq", "data_quality", "enrichments",
"config", "sessions", "users", "audit_log"
],
"conflict_resolution": "last_write_wins",
"push_filter": null,
"pull_filter": null
}
}
}The active worker writes a heartbeat document to the config collection:
{
"type": "heartbeat",
"worker_id": "worker-a",
"hostname": "prod-worker-01.example.com",
"pid": 12345,
"time": 1768521600,
"jobs_running": ["job::aaa", "job::bbb"],
"uptime_seconds": 86400,
"version": "2.0.0"
}Doc ID: heartbeat::worker-a
The standby checks: if now() - heartbeat.time > failover_timeout_seconds → promote.
Time Primary Standby
──── ─────── ───────
T+0 Processing jobs, writing Receiving replicated data,
heartbeats every 10s monitoring heartbeat
T+10 Heartbeat: time=T+10 Sees heartbeat, all good
T+20 Heartbeat: time=T+20 Sees heartbeat, all good
T+25 💥 CRASH / OOM / network down Last heartbeat was T+20
T+30 (dead) now() - T+20 = 10s < 30s timeout
→ still waiting...
T+50 (dead) now() - T+20 = 30s = timeout!
→ PROMOTE TO ACTIVE
T+51 (dead) Load jobs from local CBL
Load checkpoints from local CBL
Start PipelineManager
Resume from last replicated seq
T+55 (dead) Processing jobs ✅
Writing own heartbeats
Checkpoint gap: ~25s of data
may be re-processed (at-least-once)
| Scenario | Replay window | Data guarantee |
|---|---|---|
| Checkpoint saves every batch | 0–1 batch of docs re-processed | At-least-once (some docs delivered twice) |
every_n_docs: 1000 |
Up to 1000 docs re-processed | At-least-once |
| Continuous feed mode | Up to 1 doc re-processed (checkpoints per row) | Effectively exactly-once |
| DLQ entries | None — DLQ is replicated continuously | No data loss |
| Config/job changes | None — replicated continuously | No data loss |
At-least-once is the guarantee. The system is designed for idempotent outputs (UPSERT, PUT). Duplicate deliveries during failover are safe if:
- RDBMS uses
ON CONFLICT DO UPDATE(already the default) - HTTP output uses
PUT(already the default) - Cloud output uses
key_templatewithdoc_id(already the default — same key = overwrite)
A new collection to track replication status (v3.0):
CBL Database: changes_worker_db
│
└── Scope: "changes-worker"
└── Collection: replication_state ← replication metadata
{
"type": "replication_state",
"id": "repl::sg-prod",
"target_url": "wss://sg.example.com:4984/worker_state",
"status": "active",
"direction": "push_and_pull",
"last_push_seq": "12345",
"last_pull_seq": "12340",
"push_pending": 5,
"pull_pending": 0,
"last_error": null,
"connected_at": 1768521600,
"total_bytes_pushed": 1048576,
"total_bytes_pulled": 1048000,
"total_docs_pushed": 5000,
"total_docs_pulled": 4990
}The most dangerous failure mode: both workers think they're the active primary.
┌──────────┐ network ┌──────────┐
│ Worker A │ ──── partition ──── ✂️ ────│ Worker B │
│ ACTIVE │ (no replication) │ ACTIVE │ ← promoted due to missed heartbeats
│ │ │ │
│ Processing job::aaa │ Processing job::aaa ← DUPLICATE!
└──────────┘ └──────────┘
Mitigations:
| Strategy | How | Trade-off |
|---|---|---|
| Fencing via SG | Before processing, check a "lock" doc on Sync Gateway. If another worker holds the lock, stand down. | Requires SG reachability. If SG is also partitioned, deadlock. |
| Lease-based ownership | Job ownership has a TTL. Only the lease holder processes. If lease can't be renewed (no SG), stop processing after lease expires. | Worker stops if it can't reach SG — may be overly conservative. |
| Accept duplicates | Both workers process the same feed. Output is idempotent (UPSERT). Checkpoints diverge, but data converges. | Wastes resources. Some outputs are not idempotent (POST). |
| STONITH (Shoot The Other Node In The Head) | Worker A detects partition and kills Worker B via out-of-band mechanism (API call, cloud instance termination). | Complex, fragile, cloud-provider-specific. |
Recommended for v3.0: Lease-based ownership + accept duplicates as a safety net. If the worker can't renew its lease within lease_ttl_seconds, it pauses all jobs and waits. This is "safe but conservative" — better to stop processing than to create conflicts.
- Add
replicationconfig toconfigdocument - Create
replication_statecollection - Implement CBL replicator setup using CFFI bindings (
CBLReplicator_New,CBLReplicator_Start) - Configure push/pull for all collections
- Add replication status to
/_metricsendpoint - Add replication status to admin UI dashboard
- Implement heartbeat writer (active worker)
- Implement heartbeat monitor (standby worker)
- Implement promotion logic (standby → active)
- Implement demotion logic (active → standby when another active is detected)
- Add
rolefield to config:"primary","standby","auto" - Add
GET /_ha/statusendpoint (role, last heartbeat, replication lag)
- Add
owner,owner_lease_expiresfields to job documents - Implement lease acquisition (claim unowned/expired jobs)
- Implement lease renewal (periodic, in pipeline thread)
- Implement lease release (on graceful shutdown)
- Implement lease expiry detection (claim jobs from dead workers)
- Split-brain protection: pause jobs if lease can't be renewed
- Multi-worker job partitioning
- Dynamic rebalancing when workers join/leave
- Per-worker metrics with
worker_idlabel - Admin UI: multi-worker view
- Sync Gateway channel-based access control per worker
| Metric | Type | Description |
|---|---|---|
worker_role |
gauge (0=standby, 1=active) | Current HA role |
worker_heartbeat_age_seconds |
gauge | Time since last heartbeat from the other worker |
worker_failovers_total |
counter | Number of standby→active promotions |
replication_push_pending |
gauge | Docs waiting to be pushed |
replication_pull_pending |
gauge | Docs waiting to be pulled |
replication_push_bytes_total |
counter | Total bytes pushed |
replication_pull_bytes_total |
counter | Total bytes pulled |
replication_errors_total |
counter | Replication errors |
replication_connected |
gauge (0/1) | Replication connection status |
job_lease_renewals_total |
counter | Lease renewal attempts |
job_lease_expired_total |
counter | Leases that expired (claimed from dead workers) |
┌─────────────┐ ┌──────────────┐
│ Worker A │ ──push──► Sync Gateway │
│ (active) │ ◄─pull── │ │
└─────────────┘ └──────────────┘
▲
┌─────────────┐ │
│ Worker B │ ──push/pull───┘
│ (standby) │
└─────────────┘
Failover time: ~30 seconds
Replay window: < 1000 docs
┌─────────────┐
│ Worker A │──── owns job::1, job::2, job::3
│ (active) │
└─────────────┘ ┌──────────────┐
│ Sync Gateway │ ← replication hub
┌─────────────┐ │ │
│ Worker B │──── owns job::4, job::5
│ (active) │ └──────────────┘
└─────────────┘
▲
┌─────────────┐ │
│ Worker C │──── standby (claims jobs if A or B dies)
│ (standby) │───────────────┘
└─────────────┘
Active/active for throughput, standby for safety
Consider v3.1 Active/Active with dynamic rebalancing,
or run multiple independent worker clusters partitioned
by source (one cluster per Sync Gateway / Edge Server).
Q: Can I run HA without Sync Gateway?
A: Yes — CBL supports peer-to-peer replication between two CBL databases directly. But Sync Gateway gives you auth, channels, conflict resolution, and a central hub for N>2 workers.
Q: What if both workers are down?
A: When any worker starts, it reads jobs and checkpoints from its local CBL and resumes. No data is lost (CBL is on-disk). You just have a processing gap.
Q: Does replication add latency to the pipeline?
A: No. Replication is asynchronous and continuous. The pipeline writes to local CBL (microseconds), and CBL replicates in the background. The only latency is the replication lag (typically <1 second on a LAN).
Q: What about the _changes feed itself — is that replicated?
A: No. The _changes feed is consumed from Sync Gateway / Edge Server directly. Only the worker's internal state (checkpoints, config, DLQ) is replicated. Both workers consume from the same source.
Q: Can I use Edge Server instead of Sync Gateway as the replication hub?
A: Yes. Edge Server supports CBL replication over WebSocket. It's lighter than Sync Gateway and doesn't require a Couchbase Server bucket.