|
| 1 | +# SignalCli Receive Liveness — Active Heartbeat Design |
| 2 | + |
| 3 | +> **Status**: Proposed — design only, no code beyond the passive watchdog (`ReceiveStalenessTimeoutMs`) has been written. This document weighs whether to build an active heartbeat and how. |
| 4 | +
|
| 5 | +## Overview |
| 6 | + |
| 7 | +The [`CasCap.Api.SignalCli`](../../src/CasCap.Api.SignalCli) JSON-RPC transport ([`SignalCliJsonRpcClientService`](../../src/CasCap.Api.SignalCli/Services/SignalCliJsonRpcClientService.cs)) holds a long-lived WebSocket to the signal-cli REST API, which in turn bridges to the Java `signal-cli` daemon. A known failure mode (see [001](001-signalcli-audit-remediation.md) and the upstream issue drafts) leaves **outbound sending healthy while inbound delivery silently stops** — e.g. a poisoned `msg-cache` envelope kills only the daemon's receive thread. The WebSocket stays open, the process stays alive, and nothing detects the outage. |
| 8 | + |
| 9 | +A **passive** watchdog (`ReceiveStalenessTimeoutMs`) was added as a cheap backstop: it forces a reconnect when no inbound frame arrives within a timeout. Its fatal limitation is that it cannot distinguish *"the receive path is dead"* from *"nobody has messaged this account."* For a low-traffic account (SmartHaus is queried every 3–5 days) the timeout would have to exceed the longest legitimate quiet period (~7 days), making detection so slow it is nearly worthless. An account with no inbound integration at all has no organic traffic, so the passive watchdog is permanently disabled there. |
| 10 | + |
| 11 | +An **active heartbeat** removes this ambiguity by *generating* inbound traffic on a schedule and verifying it round-trips, giving minutes-level detection latency regardless of organic traffic — and working identically for quiet accounts and zero-inbound accounts. |
| 12 | + |
| 13 | +## The Critical Constraint: No Visible Messages |
| 14 | + |
| 15 | +> **Will my Android Signal app show a "hello world" ping every hour? — No. It must not, and it does not have to.** |
| 16 | +
|
| 17 | +This is the dealbreaker requirement: a heartbeat that posts a visible message to **Note to Self** (or any chat) every N minutes is unacceptable. Fortunately, Signal's protocol gives us several inbound-generating actions that produce **no chat artifact** on the primary (Android) device: |
| 18 | + |
| 19 | +| Mechanism | Visible on Android? | Generates inbound frame? | Notes | |
| 20 | +| --- | --- | --- | --- | |
| 21 | +| Message to *Note to Self* | **Yes** (chat entry) | Yes (sync transcript) | ❌ Unacceptable — rejected | |
| 22 | +| **Typing indicator to self** | No | Yes (sync transcript) | Ephemeral; never stored | |
| 23 | +| **Receipt to self** ([`SendReceipt`](../../src/CasCap.Api.SignalCli/Services/SignalCliRestClientService.cs)) | No | Yes (sync transcript) | Already implemented | |
| 24 | +| **Reaction add+remove** ([`SendReaction`](../../src/CasCap.Api.SignalCli/Services/SignalCliRestClientService.cs) / `RemoveReaction`) | Briefly | Yes (sync transcript) | Visible flicker — avoid | |
| 25 | + |
| 26 | +### Why a self-action loops back as inbound |
| 27 | + |
| 28 | +`signal-cli` is a **linked device**, not the primary. When *any* linked device sends *anything* (even an ephemeral typing indicator), the Signal server fans out a **sync transcript message** to all of the account's other linked devices — including `signal-cli` itself. That transcript arrives on the **same WebSocket receive stream** we are trying to prove is alive. So the heartbeat does not require a second account or a real recipient: signal-cli pinging *itself* via a non-visible action is sufficient to exercise the full inbound pipeline. |
| 29 | + |
| 30 | +The recommended primitive is a **typing indicator addressed to our own number** (or to a self-owned group), because it is the only option that is guaranteed ephemeral end-to-end — never persisted, never rendered, never notified. |
| 31 | + |
| 32 | +> **Open verification item**: confirm signal-cli/REST-API exposes a `sendTyping`/typing endpoint and that the resulting sync transcript is delivered back over `/v1/receive`. If typing indicators are not surfaced on the receive stream, fall back to a self-`SendReceipt`, which the codebase already supports. This must be validated against a live linked device before implementation (see Phase 0). |
| 33 | +
|
| 34 | +## Reasons to Build It (and Reasons Not To) |
| 35 | + |
| 36 | +**For:** |
| 37 | + |
| 38 | +- Detects the silent-inbound-failure symptom in minutes, not days — the only fault we currently cannot observe. |
| 39 | +- Account-traffic-agnostic: works for quiet accounts and zero-inbound accounts alike. |
| 40 | +- Reuses existing send primitives and the existing reconnect machinery. |
| 41 | + |
| 42 | +**Against / cost:** |
| 43 | + |
| 44 | +- Adds periodic outbound traffic and a self-message loop — small but non-zero load on the REST API and daemon. |
| 45 | +- Requires careful non-visibility verification (Phase 0) to avoid the unacceptable Android-notification outcome. |
| 46 | +- The passive watchdog already covers high-traffic accounts; the heartbeat only earns its keep on low/zero-traffic accounts. |
| 47 | + |
| 48 | +**Recommendation:** Build it, but gate it behind config that is **off by default**, and ship Phase 0 (non-visibility proof) before any production rollout. |
| 49 | + |
| 50 | +## Reusing Existing Health-Check Infrastructure |
| 51 | + |
| 52 | +The heartbeat should mirror the established **staleness-health-check** pattern rather than inventing a new observability surface. |
| 53 | + |
| 54 | +### 1. The staleness-health-check pattern — the closest precedent |
| 55 | + |
| 56 | +The ideal shape is a stream-staleness health check that reports `Healthy`/`Degraded`/`Unhealthy` based on **how long since the last inbound item**, with severity modulated by whether traffic is *expected* right now. The SignalCli analogue: |
| 57 | + |
| 58 | +- "last item received" timestamp → `_lastFrameTicks` (already stamped by the passive watchdog). |
| 59 | +- "no traffic expected right now" → "heartbeat disabled / account legitimately idle." |
| 60 | +- Severity tiers map cleanly: frames flowing → `Healthy`; heartbeat overdue but reconnect in progress → `Degraded`; heartbeat round-trip failed N times → `Unhealthy`. |
| 61 | + |
| 62 | +This means the heartbeat's *result* should be surfaced as an `IHealthCheck` (e.g. `SignalCliReceiveHeartbeatHealthCheck`) using `TimeProvider`-driven elapsed-time logic, so it plugs into the existing `/healthz` endpoints and alerting with zero new plumbing. |
| 63 | + |
| 64 | +### 2. `KubernetesProbeTypes` enum — config wiring |
| 65 | + |
| 66 | +[`KubernetesProbeTypes`](../../../CasCap.Common/src/CasCap.Common.Abstractions/_Enums.cs) (`None`/`Readiness`/`Liveness`/`Startup`, `[Flags]`) plus the [`GetTags()`](../../../CasCap.Common/src/CasCap.Common.Extensions.Diagnostics.HealthChecks/Extensions/KubernetesExtensions.cs) extension is the established pattern every SmartHaus feature uses to register a health check (see [`BuderusServiceCollectionExtensions`](../../src/CasCap.Api.Buderus/Extensions/ServiceCollectionExtensions.cs)). The heartbeat health check registers identically: |
| 67 | + |
| 68 | +```csharp |
| 69 | +if (config.HeartbeatHealthCheck != KubernetesProbeTypes.None) |
| 70 | + services.AddHealthChecks() |
| 71 | + .AddCheck<SignalCliReceiveHeartbeatHealthCheck>( |
| 72 | + SignalCliReceiveHeartbeatHealthCheck.Name, |
| 73 | + tags: config.HeartbeatHealthCheck.GetTags()); |
| 74 | +``` |
| 75 | + |
| 76 | +A dead receive thread is a **liveness** failure (the pod should be restarted), so the recommended default tag once enabled is `Liveness` — but it stays `None` (disabled) until Phase 0 validation passes. |
| 77 | + |
| 78 | +### 3. `SignalCliConnectionHealthCheck` / `HttpEndpointCheckBase` — not a fit |
| 79 | + |
| 80 | +The existing [`SignalCliConnectionHealthCheck`](../../src/CasCap.Api.SignalCli/HealthChecks/SignalCliConnectionHealthCheck.cs) only probes the REST API's `/v1/about` endpoint via [`HttpEndpointCheckBase`](../../../CasCap.Common/src/CasCap.Common.Extensions.Diagnostics.HealthChecks/Diagnostics/HealthChecks/HttpEndpointCheckBase.cs). That proves the **Go REST API** is reachable — it says nothing about the **Java daemon's receive thread**, which is precisely the layer that fails silently. The heartbeat is complementary, not a replacement. |
| 81 | + |
| 82 | +## Proposed Implementation |
| 83 | + |
| 84 | +### Components |
| 85 | + |
| 86 | +```mermaid |
| 87 | +flowchart LR |
| 88 | + Timer[Heartbeat timer\nHeartbeatIntervalMs] -->|self typing indicator| REST[signal-cli REST API] |
| 89 | + REST --> Daemon[Java signal-cli daemon] |
| 90 | + Daemon -->|sync transcript| WS[WebSocket receive loop] |
| 91 | + WS -->|stamp _lastHeartbeatSeenTicks| State[(Liveness state)] |
| 92 | + Timer -->|expects echo within HeartbeatTimeoutMs| State |
| 93 | + State -->|overdue x N| HC[SignalCliReceiveHeartbeatHealthCheck] |
| 94 | + State -->|overdue| Reconnect[Abort WebSocket -> reconnect] |
| 95 | +``` |
| 96 | + |
| 97 | +1. **Heartbeat sender** — a `PeriodicTimer` loop (mirroring the existing watchdog loop in `SignalCliJsonRpcClientService`) sends a self-addressed **typing indicator** every `HeartbeatIntervalMs`, tagging each ping with the send timestamp. |
| 98 | +2. **Echo detector** — the receive loop already deserializes every inbound frame; extend it to recognise the self-sync transcript and stamp `_lastHeartbeatSeenTicks`. (If the transcript can't be correlated precisely, treat *any* inbound frame after a ping as proof of life — the goal is liveness, not exactly-once accounting.) |
| 99 | +3. **Failure action** — if a ping is not echoed within `HeartbeatTimeoutMs`, log `LogError`, increment a consecutive-miss counter, and `Abort()` the WebSocket to force the existing reconnect path. A reconnect alone will **not** clear a poisoned `msg-cache`; the loud error is the actionable signal, and the health check escalates to `Unhealthy` after `HeartbeatFailureThreshold` consecutive misses so Kubernetes restarts the pod. |
| 100 | +4. **Health check** — `SignalCliReceiveHeartbeatHealthCheck` reports status from the consecutive-miss counter and `_lastHeartbeatSeenTicks` elapsed time, using `TimeProvider` per the staleness-health-check precedent. |
| 101 | + |
| 102 | +### Configuration (additions to `SignalCliConfig`) |
| 103 | + |
| 104 | +| Property | Type | Default | Purpose | |
| 105 | +| --- | --- | --- | --- | |
| 106 | +| `HeartbeatIntervalMs` | `int` | `0` (disabled) | How often to send the self-ping. Suggested production value ~30 min. `0` disables the heartbeat entirely. | |
| 107 | +| `HeartbeatTimeoutMs` | `int` | `300000` | Max wait for a ping to echo back before counting a miss (~5 min). | |
| 108 | +| `HeartbeatFailureThreshold` | `int` | `3` | Consecutive missed echoes before the health check reports `Unhealthy`. | |
| 109 | +| `HeartbeatHealthCheck` | `KubernetesProbeTypes` | `None` | Probe tags for the heartbeat health check; `None` until Phase 0 passes, then `Liveness`. | |
| 110 | + |
| 111 | +All four follow the existing `IAppConfig` conventions (validation attributes, `<see cref>` deep links to the consuming service) and must be synced across the five config layers (`appsettings.json`, `appsettings.Development.json`, the gitignored Local tiers, and the prod ConfigMap [`haus-appsettings.yaml`](../../../KNX_K8S/src/workloads/configmaps/prd-k3s/haus-appsettings.yaml)). The existing `ReceiveStalenessTimeoutMs` passive watchdog remains and can coexist (belt-and-braces for high-traffic accounts). |
| 112 | + |
| 113 | +### Recommended Settings by Account |
| 114 | + |
| 115 | +| Account | `ReceiveStalenessTimeoutMs` | `HeartbeatIntervalMs` | Rationale | |
| 116 | +| --- | --- | --- | --- | |
| 117 | +| **SmartHaus** | `0` (passive watchdog useless at 3–5 day cadence) | ~`1800000` (30 min) once Phase 0 passes | Active heartbeat is the only viable detector for a quiet account. | |
| 118 | +| **Zero-inbound account** | `0` | `0` for now (no inbound integration) → enable if/when inbound is added | Nothing receives inbound yet; revisit when integration lands. | |
| 119 | +| **High-traffic account** (hypothetical) | non-zero (e.g. a few hours) | `0` | Organic traffic makes the cheap passive watchdog sufficient. | |
| 120 | + |
| 121 | +## Phased Plan |
| 122 | + |
| 123 | +- [ ] **`HB-0` (Blocking)** — *Non-visibility proof.* Manually send a self typing indicator (and a self receipt as fallback) via the REST API against a real linked device. Confirm: (a) **nothing** appears in the Android chat list or notifications, and (b) the action produces an inbound frame on `/v1/receive`. Do not proceed unless both hold. Record findings here. |
| 124 | +- [ ] **`HB-1`** — Add the four config properties to `SignalCliConfig` + sync all config layers + README. |
| 125 | +- [ ] **`HB-2`** — Implement the heartbeat sender loop and echo detection in `SignalCliJsonRpcClientService`, reusing the existing `PeriodicTimer`/abort-reconnect pattern. |
| 126 | +- [ ] **`HB-3`** — Implement `SignalCliReceiveHeartbeatHealthCheck` (model on the staleness-health-check pattern) and wire it via `KubernetesProbeTypes`/`GetTags()`. |
| 127 | +- [ ] **`HB-4`** — Unit tests (fake `TimeProvider`, simulated missed/late/on-time echoes) + a manually-run integration test against the live demo daemon. |
| 128 | +- [ ] **`HB-5`** — Enable on SmartHaus (`HeartbeatIntervalMs` ~30 min, `HeartbeatHealthCheck = Liveness`); leave zero-inbound accounts disabled. |
| 129 | + |
| 130 | +## Open Questions |
| 131 | + |
| 132 | +1. Does signal-cli's REST API expose typing indicators, and do they round-trip on `/v1/receive`? (Phase 0 — if not, use self-`SendReceipt`.) |
| 133 | +2. Can we correlate the echoed sync transcript back to a specific ping timestamp, or do we accept "any inbound after a ping = alive"? The latter is simpler and sufficient for liveness. |
| 134 | +3. Should a sustained heartbeat failure (poisoned cache that survives reconnects) escalate beyond pod restart — e.g. an out-of-band alert via a *different* notification channel, since the Signal path itself is the thing that's broken? |
0 commit comments