Skip to content

Fix race between publisher connected / subscriber waiting.#1263

Open
fancycode wants to merge 1 commit into
masterfrom
race-wait-for-publisher
Open

Fix race between publisher connected / subscriber waiting.#1263
fancycode wants to merge 1 commit into
masterfrom
race-wait-for-publisher

Conversation

@fancycode

Copy link
Copy Markdown
Member

Alternative implementation to #1254, fixes #1257.

@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.29787% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.89%. Comparing base (d82f7f2) to head (c21c347).

Files with missing lines Patch % Lines
sfu/janus/test/janus.go 0.00% 8 Missing ⚠️
async/notifier.go 97.36% 1 Missing and 1 partial ⚠️
sfu/janus/remote_publisher.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1263      +/-   ##
==========================================
+ Coverage   62.80%   62.89%   +0.08%     
==========================================
  Files         129      129              
  Lines       18810    18861      +51     
==========================================
+ Hits        11814    11863      +49     
  Misses       5897     5897              
- Partials     1099     1101       +2     
Flag Coverage Δ
go-1.25 62.83% <88.29%> (+0.07%) ⬆️
go-1.26 62.87% <88.29%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
root ∅ <ø> (∅)
api 81.70% <ø> (ø)
async 87.47% <97.36%> (+0.70%) ⬆️
client 70.38% <ø> (ø)
cmd/client 6.34% <ø> (ø)
cmd/proxy 60.59% <ø> (ø)
cmd/server 0.00% <ø> (ø)
config 82.14% <ø> (ø)
container 93.79% <ø> (ø)
dns 90.82% <ø> (ø)
etcd 81.96% <ø> (ø)
geoip 68.09% <ø> (ø)
grpc 75.56% <ø> (ø)
internal 96.12% <ø> (ø)
log 94.44% <ø> (ø)
metrics 90.00% <ø> (ø)
mock ∅ <ø> (∅)
nats 73.55% <ø> (ø)
pool 81.91% <ø> (ø)
proxy 81.35% <ø> (ø)
security 77.50% <ø> (-1.25%) ⬇️
server 72.17% <ø> (+0.03%) ⬆️
session 78.57% <ø> (ø)
sfu 43.71% <50.00%> (+0.05%) ⬆️
talk 69.27% <ø> (+0.28%) ⬆️
test 86.04% <ø> (ø)
Files with missing lines Coverage Δ
sfu/janus/janus.go 57.66% <100.00%> (+0.18%) ⬆️
sfu/janus/publisher.go 45.42% <100.00%> (+0.20%) ⬆️
sfu/janus/remote_publisher.go 34.61% <50.00%> (+0.84%) ⬆️
async/notifier.go 97.89% <97.36%> (-2.11%) ⬇️
sfu/janus/test/janus.go 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fancycode fancycode force-pushed the race-wait-for-publisher branch from 3c822ee to c21c347 Compare June 3, 2026 19:04
@vladopol

vladopol commented Jun 3, 2026

Copy link
Copy Markdown

Tested in production on our deployment (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, ~30 concurrent users, Talk Desktop 2.1.x on Windows/macOS).

Built from the race-wait-for-publisher branch and deployed replacing our previous patch (#1254). HPB started cleanly, all clients reconnected within seconds, no errors in logs.

Monitoring for regressions — will report back after a few days of normal usage.

@vladopol

vladopol commented Jun 3, 2026

Copy link
Copy Markdown

Completed testing over ~1 hour with 3 test calls (2 participants each, one on internal network + Talk Desktop, one on iOS v23.0.2 via mobile/external network).

Results:

The wait and retry mechanism fired in every call and worked correctly:

subscriber.go:248: Publisher ... not sending yet for video, wait and retry
subscriber.go:261: Retry subscribing video
subscriber.go:104: Subscriber ... received connected  ✓

Both subscribers successfully received connected in each call. No race-condition "No such feed" errors with random feed IDs observed (only feed=1 noise from outdated Talk v1.1.7 clients — unrelated to this fix).

Comparison with #1254: Same observable outcome — zero race-condition errors under both patches. The reference-counting approach in #1263 is architecturally sounder (no TTL expiry risk) and more transparent in logs.

No regressions found. Will continue monitoring production over the next few days and report back.

@vladopol

vladopol commented Jun 6, 2026

Copy link
Copy Markdown

Production incident with 1263-test: "No such feed (1)" cascade not prevented

We have been running 1263-test in production since early June (Nextcloud Talk 23.0.5, spreed-signaling 2.1.1, Janus 1.4.0). Initial testing looked good — three back-to-back calls, zero No such feed errors, both participants received Subscriber received connected. We posted that result earlier in this thread.

However, we later observed a production incident that PR#1263 did not prevent.

What happened

At T+0: three VideoRoom sessions closed simultaneously, including one with a 43 MB recording. This created a burst of cleanup events on the single WebSocket connection between HPB and Janus.

At T+37s: a publisher (internal client, Talk Desktop v2.0.5) joined the room. Janus immediately logged:

[WARN] Didn't receive video (substream #1) for more than 1 second(s)
[WARN] Didn't receive video for more than 1 second(s)

At T+67s: a subscriber joined (external client using TURN/TLS relay).

At T+68s: Janus logged:

[ERR] [plugins/janus_videoroom.c:janus_videoroom_handler:10208] No such feed (1)

Note the feed ID: 1 — a default/uninitialized value, not the publisher's actual Janus feed ID. This indicates HPB initiated the subscriber join before it had received and processed the publisher's joinroom response from Janus.

The cascade

49 total No such feed errors over the following 10 minutes. Each error destroyed the subscriber handle and triggered a retry, which created a new ICE agent with new candidates, which then failed:

ICE failures/min: 16 → 47 → 32 → 48 → 49 → 78 → 90 → 31 → 64 → 8

At peak (90/min) Janus's message queue backed up enough that HPB's requestoffer calls hit context timeout:

Could not send MCU message {type:requestoffer}: context deadline exceeded
Could not send MCU message {type:requestoffer}: not connected

The cascade self-resolved after ~17 minutes. The subscriber eventually connected successfully.

Root cause analysis

There are two overlapping issues, both rooted in the same architectural constraint.

The underlying constraint: single WebSocket, single Janus message processing thread

All HPB↔Janus communication goes through one persistent WebSocket connection. Janus processes messages from that connection in a single thread. This means any burst of events — even from routine activity like three simultaneous session closures — creates a processing queue. Messages behind that queue are delayed regardless of server hardware capacity. Three simultaneous closures is already enough to open a race window; five or ten would make it proportionally wider and more likely to affect multiple subscribers simultaneously.

Issue 1 — feed=1 race (root cause)

HPB initiates the subscriber join before the publisher's joinroom response has arrived from Janus, sending feed=1 (uninitialized) instead of the actual feed ID. Under normal load this window is negligible. Under any Janus processing pressure — even three simultaneous session closures — the joinroom response is delayed in the queue long enough for the race to trigger.

This is distinct from the webrtcup race addressed by this PR. The publisher is already registered in the VideoRoom but HPB does not yet know its feed ID.

Issue 2 — tight retry loop (amplifier)

Once No such feed fires, the retry loop runs without any delay between iterations. If the notification/waiter resolves immediately (publisher already connected), the loop becomes tight — retrying as fast as Janus can respond with another error. Each iteration creates a new ICE agent, generating candidates that then fail, flooding Janus further and worsening the queue backup that caused the problem in the first place.

Suggestions

Fix 1 — wait for publisher feed ID before initiating subscriber join

Before sending the subscriber join message, HPB should verify that the publisher's actual Janus feed ID has been received (i.e. the publisher's joinroom response from Janus has been processed). If not yet available, HPB should block with a waiter — analogous to the existing webrtcup waiter — rather than proceeding with an uninitialized value of 1. This addresses the root cause directly.

Fix 2 — exponential backoff in the retry loop

In addition, adding backoff between No such feed retries in subscriber.go would limit the ICE agent storm when the error persists across multiple iterations:

case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED:
    retryCount++
    backoff := time.Duration(retryCount*retryCount) * 100 * time.Millisecond
    if backoff > 2*time.Second {
        backoff = 2*time.Second
    }
    waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType)
    if err := waiter.Wait(ctx); err != nil {
        stop()
        ...
    }
    stop()
    select {
    case <-ctx.Done():
        callback(ctx.Err(), nil)
        return
    case <-time.After(backoff):
    }
    goto retry

This does not affect the happy path (first retry succeeds immediately when publisher is ready) but prevents the retry loop from amplifying a temporary Janus queue backup into a 17-minute ICE storm.

@vladopol

Copy link
Copy Markdown

3-week production update (June 22)

We have been running the 1263-test build (c21c347) in production for three weeks now (~30 concurrent users, Nextcloud Talk 23.0.5, Janus 1.4.0).

Original race condition: resolved

Zero occurrences of "No such feed" with random feed IDs (7, 15, 43, etc.) since June 3. The wait and retry mechanism continues to fire and recover correctly on every call. We consider the original race fully addressed by this PR.

Revisiting the June 6 incident

On reflection, the "No such feed (1)" cascade we reported on June 6 more likely originated from stale subscriber handles held by outdated Talk clients (v1.1.7) rather than HPB sending an uninitialized feed ID. We observe a persistent baseline of "No such feed" with feed IDs specifically equal to 1 or 2 across all our production logs — always correlated with the older client version, not with publisher join timing. We may have misattributed causation in that report.

That said, the backoff suggestion from our June 6 comment remains valid independently: a tight retry loop amplifies any "No such feed" storm regardless of its origin, and adding backoff would limit the blast radius.

Summary

PR #1263 is stable and effective on our deployment. Happy to retest against any updated revision if you push changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition: subscriber misses publisher-connected notification → "No such feed" on reconnect

2 participants