Fix race between publisher connected / subscriber waiting. by fancycode · Pull Request #1263 · strukturag/nextcloud-spreed-signaling

fancycode · 2026-06-03T18:53:46Z

Alternative implementation to #1254, fixes #1257.

codecov · 2026-06-03T19:01:12Z

Codecov Report

❌ Patch coverage is 88.29787% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.89%. Comparing base (d82f7f2) to head (c21c347).

Files with missing lines	Patch %	Lines
sfu/janus/test/janus.go	0.00%	8 Missing ⚠️
async/notifier.go	97.36%	1 Missing and 1 partial ⚠️
sfu/janus/remote_publisher.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1263      +/-   ##
==========================================
+ Coverage   62.80%   62.89%   +0.08%     
==========================================
  Files         129      129              
  Lines       18810    18861      +51     
==========================================
+ Hits        11814    11863      +49     
  Misses       5897     5897              
- Partials     1099     1101       +2

Flag	Coverage Δ
go-1.25	`62.83% <88.29%> (+0.07%)`	⬆️
go-1.26	`62.87% <88.29%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
root	`∅ <ø> (∅)`
api	`81.70% <ø> (ø)`
async	`87.47% <97.36%> (+0.70%)`	⬆️
client	`70.38% <ø> (ø)`
cmd/client	`6.34% <ø> (ø)`
cmd/proxy	`60.59% <ø> (ø)`
cmd/server	`0.00% <ø> (ø)`
config	`82.14% <ø> (ø)`
container	`93.79% <ø> (ø)`
dns	`90.82% <ø> (ø)`
etcd	`81.96% <ø> (ø)`
geoip	`68.09% <ø> (ø)`
grpc	`75.56% <ø> (ø)`
internal	`96.12% <ø> (ø)`
log	`94.44% <ø> (ø)`
metrics	`90.00% <ø> (ø)`
mock	`∅ <ø> (∅)`
nats	`73.55% <ø> (ø)`
pool	`81.91% <ø> (ø)`
proxy	`81.35% <ø> (ø)`
security	`77.50% <ø> (-1.25%)`	⬇️
server	`72.17% <ø> (+0.03%)`	⬆️
session	`78.57% <ø> (ø)`
sfu	`43.71% <50.00%> (+0.05%)`	⬆️
talk	`69.27% <ø> (+0.28%)`	⬆️
test	`86.04% <ø> (ø)`

Files with missing lines	Coverage Δ
sfu/janus/janus.go	`57.66% <100.00%> (+0.18%)`	⬆️
sfu/janus/publisher.go	`45.42% <100.00%> (+0.20%)`	⬆️
sfu/janus/remote_publisher.go	`34.61% <50.00%> (+0.84%)`	⬆️
async/notifier.go	`97.89% <97.36%> (-2.11%)`	⬇️
sfu/janus/test/janus.go	`0.00% <0.00%> (ø)`

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vladopol · 2026-06-03T21:05:30Z

Tested in production on our deployment (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, ~30 concurrent users, Talk Desktop 2.1.x on Windows/macOS).

Built from the race-wait-for-publisher branch and deployed replacing our previous patch (#1254). HPB started cleanly, all clients reconnected within seconds, no errors in logs.

Monitoring for regressions — will report back after a few days of normal usage.

vladopol · 2026-06-03T21:56:18Z

Completed testing over ~1 hour with 3 test calls (2 participants each, one on internal network + Talk Desktop, one on iOS v23.0.2 via mobile/external network).

Results:

The wait and retry mechanism fired in every call and worked correctly:

subscriber.go:248: Publisher ... not sending yet for video, wait and retry
subscriber.go:261: Retry subscribing video
subscriber.go:104: Subscriber ... received connected  ✓

Both subscribers successfully received connected in each call. No race-condition "No such feed" errors with random feed IDs observed (only feed=1 noise from outdated Talk v1.1.7 clients — unrelated to this fix).

Comparison with #1254: Same observable outcome — zero race-condition errors under both patches. The reference-counting approach in #1263 is architecturally sounder (no TTL expiry risk) and more transparent in logs.

No regressions found. Will continue monitoring production over the next few days and report back.

vladopol · 2026-06-06T17:04:45Z

Production incident with 1263-test: "No such feed (1)" cascade not prevented

We have been running 1263-test in production since early June (Nextcloud Talk 23.0.5, spreed-signaling 2.1.1, Janus 1.4.0). Initial testing looked good — three back-to-back calls, zero No such feed errors, both participants received Subscriber received connected. We posted that result earlier in this thread.

However, we later observed a production incident that PR#1263 did not prevent.

What happened

At T+0: three VideoRoom sessions closed simultaneously, including one with a 43 MB recording. This created a burst of cleanup events on the single WebSocket connection between HPB and Janus.

At T+37s: a publisher (internal client, Talk Desktop v2.0.5) joined the room. Janus immediately logged:

[WARN] Didn't receive video (substream #1) for more than 1 second(s)
[WARN] Didn't receive video for more than 1 second(s)

At T+67s: a subscriber joined (external client using TURN/TLS relay).

At T+68s: Janus logged:

[ERR] [plugins/janus_videoroom.c:janus_videoroom_handler:10208] No such feed (1)

Note the feed ID: 1 — a default/uninitialized value, not the publisher's actual Janus feed ID. This indicates HPB initiated the subscriber join before it had received and processed the publisher's joinroom response from Janus.

The cascade

49 total No such feed errors over the following 10 minutes. Each error destroyed the subscriber handle and triggered a retry, which created a new ICE agent with new candidates, which then failed:

ICE failures/min: 16 → 47 → 32 → 48 → 49 → 78 → 90 → 31 → 64 → 8

At peak (90/min) Janus's message queue backed up enough that HPB's requestoffer calls hit context timeout:

Could not send MCU message {type:requestoffer}: context deadline exceeded
Could not send MCU message {type:requestoffer}: not connected

The cascade self-resolved after ~17 minutes. The subscriber eventually connected successfully.

Root cause analysis

There are two overlapping issues, both rooted in the same architectural constraint.

The underlying constraint: single WebSocket, single Janus message processing thread

All HPB↔Janus communication goes through one persistent WebSocket connection. Janus processes messages from that connection in a single thread. This means any burst of events — even from routine activity like three simultaneous session closures — creates a processing queue. Messages behind that queue are delayed regardless of server hardware capacity. Three simultaneous closures is already enough to open a race window; five or ten would make it proportionally wider and more likely to affect multiple subscribers simultaneously.

Issue 1 — feed=1 race (root cause)

HPB initiates the subscriber join before the publisher's joinroom response has arrived from Janus, sending feed=1 (uninitialized) instead of the actual feed ID. Under normal load this window is negligible. Under any Janus processing pressure — even three simultaneous session closures — the joinroom response is delayed in the queue long enough for the race to trigger.

This is distinct from the webrtcup race addressed by this PR. The publisher is already registered in the VideoRoom but HPB does not yet know its feed ID.

Issue 2 — tight retry loop (amplifier)

Once No such feed fires, the retry loop runs without any delay between iterations. If the notification/waiter resolves immediately (publisher already connected), the loop becomes tight — retrying as fast as Janus can respond with another error. Each iteration creates a new ICE agent, generating candidates that then fail, flooding Janus further and worsening the queue backup that caused the problem in the first place.

Suggestions

Fix 1 — wait for publisher feed ID before initiating subscriber join

Before sending the subscriber join message, HPB should verify that the publisher's actual Janus feed ID has been received (i.e. the publisher's joinroom response from Janus has been processed). If not yet available, HPB should block with a waiter — analogous to the existing webrtcup waiter — rather than proceeding with an uninitialized value of 1. This addresses the root cause directly.

Fix 2 — exponential backoff in the retry loop

In addition, adding backoff between No such feed retries in subscriber.go would limit the ICE agent storm when the error persists across multiple iterations:

case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED:
    retryCount++
    backoff := time.Duration(retryCount*retryCount) * 100 * time.Millisecond
    if backoff > 2*time.Second {
        backoff = 2*time.Second
    }
    waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType)
    if err := waiter.Wait(ctx); err != nil {
        stop()
        ...
    }
    stop()
    select {
    case <-ctx.Done():
        callback(ctx.Err(), nil)
        return
    case <-time.After(backoff):
    }
    goto retry

This does not affect the happy path (first retry succeeds immediately when publisher is ready) but prevents the retry loop from amplifying a temporary Janus queue backup into a 17-minute ICE storm.

vladopol · 2026-06-22T17:53:19Z

3-week production update (June 22)

We have been running the 1263-test build (c21c347) in production for three weeks now (~30 concurrent users, Nextcloud Talk 23.0.5, Janus 1.4.0).

Original race condition: resolved

Zero occurrences of "No such feed" with random feed IDs (7, 15, 43, etc.) since June 3. The wait and retry mechanism continues to fire and recover correctly on every call. We consider the original race fully addressed by this PR.

Revisiting the June 6 incident

On reflection, the "No such feed (1)" cascade we reported on June 6 more likely originated from stale subscriber handles held by outdated Talk clients (v1.1.7) rather than HPB sending an uninitialized feed ID. We observe a persistent baseline of "No such feed" with feed IDs specifically equal to 1 or 2 across all our production logs — always correlated with the older client version, not with publisher join timing. We may have misattributed causation in that report.

That said, the backoff suggestion from our June 6 comment remains valid independently: a tight retry loop amplifies any "No such feed" storm regardless of its origin, and adding backoff would limit the blast radius.

Summary

PR #1263 is stable and effective on our deployment. Happy to retest against any updated revision if you push changes.

fancycode mentioned this pull request Jun 3, 2026

Fix race condition: subscriber misses publisher-connected notification (No such feed) #1254

Open

Fix race between publisher connected / subscriber waiting.

c21c347

fancycode force-pushed the race-wait-for-publisher branch from 3c822ee to c21c347 Compare June 3, 2026 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race between publisher connected / subscriber waiting.#1263

Fix race between publisher connected / subscriber waiting.#1263
fancycode wants to merge 1 commit into
masterfrom
race-wait-for-publisher

fancycode commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

vladopol commented Jun 3, 2026

Uh oh!

vladopol commented Jun 3, 2026

Uh oh!

vladopol commented Jun 6, 2026

Uh oh!

vladopol commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fancycode commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vladopol commented Jun 3, 2026

Uh oh!

vladopol commented Jun 3, 2026

Uh oh!

vladopol commented Jun 6, 2026

Uh oh!

vladopol commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 3, 2026 •

edited

Loading