Fix race between publisher connected / subscriber waiting.#1263
Fix race between publisher connected / subscriber waiting.#1263fancycode wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1263 +/- ##
==========================================
+ Coverage 62.80% 62.89% +0.08%
==========================================
Files 129 129
Lines 18810 18861 +51
==========================================
+ Hits 11814 11863 +49
Misses 5897 5897
- Partials 1099 1101 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
3c822ee to
c21c347
Compare
|
Tested in production on our deployment (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, ~30 concurrent users, Talk Desktop 2.1.x on Windows/macOS). Built from the Monitoring for regressions — will report back after a few days of normal usage. |
|
Completed testing over ~1 hour with 3 test calls (2 participants each, one on internal network + Talk Desktop, one on iOS v23.0.2 via mobile/external network). Results: The Both subscribers successfully received Comparison with #1254: Same observable outcome — zero race-condition errors under both patches. The reference-counting approach in #1263 is architecturally sounder (no TTL expiry risk) and more transparent in logs. No regressions found. Will continue monitoring production over the next few days and report back. |
|
Production incident with 1263-test: "No such feed (1)" cascade not prevented We have been running However, we later observed a production incident that PR#1263 did not prevent. What happened At T+0: three VideoRoom sessions closed simultaneously, including one with a 43 MB recording. This created a burst of cleanup events on the single WebSocket connection between HPB and Janus. At T+37s: a publisher (internal client, Talk Desktop v2.0.5) joined the room. Janus immediately logged: At T+67s: a subscriber joined (external client using TURN/TLS relay). At T+68s: Janus logged: Note the feed ID: 1 — a default/uninitialized value, not the publisher's actual Janus feed ID. This indicates HPB initiated the subscriber join before it had received and processed the publisher's The cascade 49 total At peak (90/min) Janus's message queue backed up enough that HPB's The cascade self-resolved after ~17 minutes. The subscriber eventually connected successfully. Root cause analysis There are two overlapping issues, both rooted in the same architectural constraint. The underlying constraint: single WebSocket, single Janus message processing thread All HPB↔Janus communication goes through one persistent WebSocket connection. Janus processes messages from that connection in a single thread. This means any burst of events — even from routine activity like three simultaneous session closures — creates a processing queue. Messages behind that queue are delayed regardless of server hardware capacity. Three simultaneous closures is already enough to open a race window; five or ten would make it proportionally wider and more likely to affect multiple subscribers simultaneously. Issue 1 — feed=1 race (root cause) HPB initiates the subscriber This is distinct from the Issue 2 — tight retry loop (amplifier) Once Suggestions Fix 1 — wait for publisher feed ID before initiating subscriber join Before sending the subscriber Fix 2 — exponential backoff in the retry loop In addition, adding backoff between case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED:
retryCount++
backoff := time.Duration(retryCount*retryCount) * 100 * time.Millisecond
if backoff > 2*time.Second {
backoff = 2*time.Second
}
waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType)
if err := waiter.Wait(ctx); err != nil {
stop()
...
}
stop()
select {
case <-ctx.Done():
callback(ctx.Err(), nil)
return
case <-time.After(backoff):
}
goto retryThis does not affect the happy path (first retry succeeds immediately when publisher is ready) but prevents the retry loop from amplifying a temporary Janus queue backup into a 17-minute ICE storm. |
|
3-week production update (June 22) We have been running the Original race condition: resolved Zero occurrences of "No such feed" with random feed IDs (7, 15, 43, etc.) since June 3. The Revisiting the June 6 incident On reflection, the "No such feed (1)" cascade we reported on June 6 more likely originated from stale subscriber handles held by outdated Talk clients (v1.1.7) rather than HPB sending an uninitialized feed ID. We observe a persistent baseline of "No such feed" with feed IDs specifically equal to 1 or 2 across all our production logs — always correlated with the older client version, not with publisher join timing. We may have misattributed causation in that report. That said, the backoff suggestion from our June 6 comment remains valid independently: a tight retry loop amplifies any "No such feed" storm regardless of its origin, and adding backoff would limit the blast radius. Summary PR #1263 is stable and effective on our deployment. Happy to retest against any updated revision if you push changes. |
Alternative implementation to #1254, fixes #1257.