Skip to content

feat(spf): multi cdn failover#1671

Draft
cjpillsbury wants to merge 11 commits into
feat/spf-multi-cdnfrom
feat/spf-multi-cdn-failover
Draft

feat(spf): multi cdn failover#1671
cjpillsbury wants to merge 11 commits into
feat/spf-multi-cdnfrom
feat/spf-multi-cdn-failover

Conversation

@cjpillsbury

@cjpillsbury cjpillsbury commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Base branch: targets feat/spf-multi-cdn (the sticky-CDN-pick PR), not
main — will repoint to main once that lands. The Files changed tab is
already scoped to just this PR's content.

TL;DR

Multi-CDN failover for HLS sources that publish the same content on more than
one host (Mux Video's ?redundant_streams=true). The redundant variants
already parse as ordinary per-CDN candidate tracks, so this is modeled as
selecting a CDN and keeping the whole presentation on it inside the
track-switching rule model — a scope rule (sticky pick) + a constraint
(failover) — rather than URL rewriting at fetch time. Validated by engine
integration tests and a gated live round-trip against a real Mux source.
Internal SPF (alpha); engine config + state additions are additive.

For reviewers — how to read this PR

The single biggest churn is the design-doc rewrite (multi-cdn-failover.md,
+186/−143, docs-only). The runtime change is two new behaviors
(setup-failover-monitor, plus the deriveCdnPriority rename) and fetch-site
decorators threaded into existing behaviors/engines.

Runtime (focus here)

Read fully: setup-failover-monitor.ts (new) — owns failedCdns;
per-source cooldown timers that expire a tripped CDN; clears the set + timers
on src unload (no cross-source leak); trips on a single terminal failure
(retry/back-off is a future layer below the trip, not here)

Targeted careful read:

  • track-switching.ts (+134/−37) — preferActiveCdn scope (active CDN =
    highest-priority cdnPriority entry with surviving tracks; falls through
    when a CDN is empty) + excludeFailedCdns constraint, shared by the video +
    audio chains via one CdnRuleConfig; cross-type coherence (both chains read
    the same definition → agree on the CDN); applyConstraints pre-pass runs
    before the rule chain and should be order-independent; SwitchableTrack
    gains url
  • resolve-track.ts (+76/−10) — failoverFetch decorates the media-playlist
    fetch; one catch covers both a thrown fetch and a non-OK status
    (fetchResolvableText rejects on non-OK); the failing CDN is keyed off the
    track URL via the configured getCdnId
  • setup-buffer-actors.ts (+43/−5) — failoverFetchBytes trips on a
    segment-fetch failure the same way; the decorator wraps
    trackedFetch/fetchStream without touching buffer-actor lifecycle
  • cdn.ts (+28/−10) — addFailedCdn is pure + idempotent (returns the same
    array reference when the CDN is already present — the monitor's
    set-membership watch relies on this to not reschedule); getCdnId origin
    default + unparseable-URL fallback

Skim structure only: engine.ts (+38/−6) / engine-audio-only.ts
(+28/−5) — compose deriveCdnPriority + setupFailoverMonitor after
resolvePresentation, add failover?/getCdnId? config + cdnPriority?/
failedCdns? state; audio-only mirrors main. derive-cdn-priority.ts
(+10/−7) — the resolveCdnPriorityderiveCdnPriority rename +
getCdnId threading, no behavior change. network/fetch.ts (new) — FetchText

  • fetchResolvableText (fetch → reject on non-OK → text), the text analog of
    FetchBytes

Skim file: Tests (6 files, +411) — engine.test.ts integration
(manifest-order cdnPriority + pick on primary; reorder re-narrows;
auto-failover when a media-playlist fetch fails; custom getCdnId honored
end-to-end); setup-failover-monitor.test.ts (cooldown expiry, per-CDN
independence, per-source clear); track-switching.test.ts (scope/constraint +
failover-via-constraint); cdn.test.ts / derive-cdn-priority.test.ts /
resolve-track.test.ts. failover-smoke.test.ts is the gated live round-trip
(see Smoke test)

Design doc (skim)

Skim file: internal/design/spf/features/multi-cdn-failover.md (+186/−143)
— the feature spec, now definition: implemented. Carries the "constraint +
scope, not URL rewriting" rationale and the site-adds / behavior-expires
failover split; the Follow-up candidates section enumerates the deliberately
deferred gaps.

Smoke test

Sticky CDN pick is observable through the SPF segment-loading harness against a
real Mux redundant source. preload=auto is required — the harness defaults to
preload=none, which never activates the manifest fetch.

Sandbox: /spf-segment-loading/?preload=auto&muted=true&autoplay=true&src=https://stream.mux.com/s41JYeqIpBMBzE4OzxDyGR2yrp2hD1CQ6gJN9SlVGDQ.m3u8?redundant_streams=true

  • Video plays (time advances, buffer grows) — the CDN pick doesn't break
    playback.
  • The rendition picker lists each rendition as "N rendition(s) across CDNs"
    confirms the source really is multi-CDN (redundant variants parse as per-CDN
    tracks).
  • Console: window.state().cdnPriority → two hosts in manifest order
    (…edgemv.mux.com, then …fastly.mux.com).
  • The selected video and audio tracks resolve from the same host (the
    primary, edgemv) — the whole presentation sticks to one CDN (cross-type
    coherence), not a per-type host split.

Failover / recovery has no built-in harness control (it needs a CDN to actually
fail). That round-trip — trip → failover to backup → recovery to primary —
is the gated live test:

VITE_FAILOVER_SMOKE=1 pnpm -F @videojs/spf test \
  src/playback/engines/hls/tests/failover-smoke.test.ts

What changed — by surface

Sticky CDN pick (scope). deriveCdnPriority publishes a per-presentation
ordered cdnPriority list (the manifest's distinct CDN hosts, most-preferred
first — mirroring HLS content steering's PATHWAY-PRIORITY), and clears it on
src unload. A shared preferActiveCdn scope rule narrows every track type's
candidates to the highest-priority CDN that still has tracks, so video / audio
/ text all resolve from one host. The active CDN is derived (first-with-
survivors), never stored.

Failover (constraint + trip/expiry). Site-adds, behavior-expires. Fetch
sites add the failing CDN's origin to a failedCdns set on a terminal fetch
failure — failoverFetch for media playlists in resolve-track,
failoverFetchBytes for segments in setup-buffer-actors. A shared
excludeFailedCdns constraint (in setupTrackSwitching's new applyConstraints
pre-pass) prunes that CDN's tracks, so the scope's first-with-survivors falls
to the next CDN automatically. setupFailoverMonitor removes the CDN once a
cooldown lapses, and the scope snaps back — no reactive "active CDN" rewrite at
any point.

Configurable CDN identity. getCdnId (origin-based default) is overridable
via engine config — e.g. to key on Mux's cdn= query param — and is threaded
to all four CDN-id sites so the keys used to build cdnPriority, trip
failedCdns, and evaluate the scope/constraint stay comparable.

Notable design decisions

  • Constraint + scope in the track-switching model, not active-URI rotation
    in resolveTrack.
    Redundant variants already parse as per-CDN tracks, so
    CDN choice is a track-selection problem. Alternative considered: rewrite the
    selected track's URL at fetch time, or store a reactive activeCdn value.
    Rejected because the ordered-list shape (cdnPriority, active =
    first-with-survivors) makes failover a pure constraint — pruning moves the
    pick, un-pruning returns it — and composes with content-steering later as a
    plain list reorder, with no active-value rewrite to keep in sync.
  • Self-contained cooldown, not a network-resilience circuit-breaker.
    setupFailoverMonitor is a per-source cooldown timer. Alternative
    considered: depend on the (unbuilt) network-resilience cluster for
    retry/backoff/error-classification. Rejected because failover shouldn't block
    on that cluster; retries would sit below the trip (so it sees only
    post-retry terminal failures) as a future refinement, not a prerequisite.
  • Trip on the first terminal failure. A network error or non-OK
    media-playlist status trips immediately; cooldown is the only back-off.
    Alternative considered: an N-failures threshold or a windowed/decaying health
    metric. Rejected as over-built for the first cut. Reviewers: is
    trip-on-first too aggressive for flaky-but-not-dead CDNs? (flapping is called
    out below.)

Reviewer callouts — known limitations

  • [Deferred] All-CDNs-down has no terminal state. When every CDN is pruned
    the candidate set is empty and the prior pick is silently left in place; a
    distinct "nothing playable" state is unmodeled (shared with the
    constraints-pre-pass work).
  • [Deferred] Flapping / no cooldown extension on re-failure. A flaky CDN
    can oscillate — trip → cooldown lapses → re-preferred (it's cdnPriority[0])
    → fails again. No hysteresis or growing back-off on repeated trips; re-failing
    mid-cooldown doesn't push the deadline out.
  • [Deferred] Coarse HTTP-status classification. The trip fires on a thrown
    fetch or a non-OK media-playlist status; finer classification (5xx-with-body
    vs 4xx, segment-side codes) is deferred to network-resilience. A 200 with
    an unparseable body deliberately does not trip (content issue, not CDN
    unavailability).
  • [Author] switchAudioTrack config tidiness (cosmetic). Audio spreads
    ...config, so video-only ABR fields ride into the shared ranker harmlessly
    (audio has no bandwidthState). A cross-cutting-only shared config type would
    keep them out.

Breaking changes

The resolveCdnPriorityderiveCdnPriority rename is internal SPF only — not
a public export, no external consumers. New engine config (failover?,
getCdnId?) and state (cdnPriority?, failedCdns?) are additive.

Test plan

  • Gated live failover smoke test — VITE_FAILOVER_SMOKE=1 pnpm -F @videojs/spf test src/playback/engines/hls/tests/failover-smoke.test.ts
    passes: trip → failover to backup → recovery to primary against a real
    Mux ?redundant_streams=true source.
  • E2E through the html player + real MSE — deferred: apps/e2e pages are
    generated from media.ts, so a redundant source would sweep into every
    generic + visual spec. The gated engine smoke test covers the round-trip,
    including recovery (which isn't observable at the player DOM).

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
v10-sandbox Ready Ready Preview, Comment Jun 10, 2026 7:02pm

Request Review

cjpillsbury and others added 11 commits June 10, 2026 11:57
…s pre-pass

The failover half of multi-CDN, as a hard constraint in track-switching, on
top of a new generic constraints pre-pass. (Auto-detection — the per-CDN
breaker that writes failedCdns from observed fetch failures — is the remaining
layer; failedCdns is externally driven for now.)

Constraints phase (generic; also unblocks capability-probing):
- applyConstraints runs before the rule chain, pruning the unplayable; wired
  into the candidateSet computed so a constraint's signal reads re-prune
  reactively. New `constraints` config slot. Default-empty → no behavior
  change on its own.

Failed-CDN constraint (multi-CDN failover):
- excludeFailedCdns removes tracks whose CDN is in `failedCdns`; the active-CDN
  scope then falls to the next CDN in cdnPriority and snaps back to the primary
  on recovery. Wired into both the video and audio chains.
- `failedCdns` state materialized via shareSignals — externally drivable until
  the breaker lands, and stays writable for manual CDN override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes multi-CDN auto-failover: a per-CDN failure monitor observes fetch
outcomes and trips a CDN into cooldown, which the failed-CDN constraint then
prunes — so the active-CDN scope fails over to the next CDN with no external
write, and returns to the primary when the cooldown lapses.

- network/failover-monitor.ts: pure, clock-free per-CDN failure tracker
  (consecutive-failure threshold + cooldown; `failedCdns(now)` query).
- setup-failover-monitor.ts: behavior that owns `failedCdns`, publishes a
  `failoverReporter` to context, and supplies the real clock + cooldown-lapse
  timers; per-source lifecycle.
- Fetch-site reporting: resolve-track (media playlists; honors response.ok) and
  the segment FetchBytes (reportFetchBytes wrapper). Aborts excluded.
- Wired into both HLS engines; engine config `failover` (threshold / cooldown,
  defaults 3 / 30s).

Minimal slice: the segment path counts only thrown fetch errors (HTTP-status
error classification is deferred to network-resilience); all-CDNs-down keeps
the last pick (terminal "nothing playable" modeling deferred).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rework the just-landed failover monitor into a simpler signal-driven
shape, removing the imperative reporter and the per-CDN failure count:

- Trip on the first terminal fetch failure — drop the consecutive-failure
  count + `failureThreshold` config; the cooldown is the only back-off.
  Absorbing transient blips is the retry layer's job once it exists.
- Fetch sites own the trip: resolve-track adds the failing CDN to
  `failedCdns` inline on a failed media-playlist fetch. setupFailoverMonitor
  no longer publishes a context reporter — it watches `failedCdns` and
  removes each CDN once its cooldown lapses (site-adds, behavior-expires).
- Delete network/failover-monitor.ts (bookkeeping folded into the behavior)
  and drop `failedCdns` from shareSignals (the monitor owns it now).

Segment fetches no longer trip failover for now; that coverage returns in
a second pass that abstracts the trip into a shared fetch wrapper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive createSimpleHlsEngine against a real Mux ?redundant_streams=true
source and exercise failover end-to-end: a fetch wrapper rejects the
primary origin (cdnPriority[0]) while the manifest + backup origin hit
the real network. Asserts the primary trips into failedCdns, the selected
video track resolves on the backup, and — once unblocked past its
cooldown — recovers back to the primary.

Hits the network, so it's gated behind VITE_FAILOVER_SMOKE and skipped in
the default suite. Run on demand:

  VITE_FAILOVER_SMOKE=1 pnpm -F @videojs/spf test \
    src/playback/engines/hls/tests/failover-smoke.test.ts

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CDN grouping is derived at four sites that must agree — building
`cdnPriority`, recording the failover trip in `failedCdns`, and the
track-switching CDN scope + failover constraint. Hardcoding origin there
locked every consumer into host-based redundancy.

Introduce a `GetCdnId = (url) => string` type and an optional `getCdnId`
on both engine configs (default: origin). The single shared config is
broadcast to every behavior, so each site reads `config.getCdnId ??
getCdnId` and the keys stay comparable. `switchAudioTrack` now spreads
`...config` like `switchVideoTrack` so the override reaches the audio CDN
rules too (its video-only ABR fields ride along harmlessly — audio has no
bandwidthState to act on them).

Lets a consumer key CDNs on something other than origin (e.g. Mux's `cdn=`
query param). Default behavior is unchanged. Covered by an engine test
that keys on a query param and asserts cdnPriority, the trip, and the
constraint/scope all honor it end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pull the failover trip out of resolve-track's task body into a decorated
fetch, mirroring how the segment loader decorates its FetchBytes:

- `network/fetch.ts` gains `FetchText` + a default `fetchResolvableText`
  (fetch → reject on non-OK → text), the text analog of `FetchBytes`.
- `reportFailedCdns(fetchText, failedCdns?, getCdnId)` decorates a
  `FetchText` so a failed fetch trips the track's CDN into `failedCdns`.
- Each `resolve*` behavior builds its decorated fetch via a small
  `failoverFetch(state, config)` bridge and passes it through config;
  `setupTrackResolution` just uses it. `failedCdns` stays owned by the
  failover monitor — the bridge reads it off an intersection-typed state
  slice, so `resolve*` never declares it.

Parse failures no longer trip the CDN (only fetch/non-OK do) — a 200 with
an unparseable body is a content issue, not CDN unavailability.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the CDN-failover trip back into resolve-track's local `failoverFetch`
helper (a flat try/catch around `fetchResolvableText`) instead of routing
through a `reportFailedCdns` decorator imported from `setup-failover-monitor`.

- `reportFailedCdns` was vestigial after this; delete it (and its now-unused
  `FetchText`/`GetCdnId` imports). `setup-failover-monitor` is purely the
  expiry behavior again, and `resolve-track` no longer imports from it.
- Drop the unused `fetchResolvableText` config-injection slot — playlists,
  unlike segments, have no per-type fetch variation to inject. `ResolveTrackConfig`
  is now just `{ getCdnId? }`.

`fetchResolvableText` rejects on non-OK, so the single catch covers both
non-OK status and network errors. `FetchText` + `fetchResolvableText` stay in
network/fetch.ts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Segments now trip multi-CDN failover, not just media playlists — closing
the gap where a CDN healthy at load but degrading mid-stream went
undetected (VOD playlists are fetched once, so segments are the only
ongoing signal).

- `setup-buffer-actors` decorates its per-type segment fetch
  (`trackedFetch` / `fetchStream`) with `failoverFetchBytes`, mirroring
  resolve-track's `failoverFetch`: a failed fetch trips the segment's CDN
  into `failedCdns`; aborts don't; reads the signal off an anchored state
  slice so the behavior never declares it.
- Extract the dedup-append into a pure `addFailedCdn(failed, cdn)` in
  `media/utils/cdn.ts` (idempotent, order-preserving, returns the same
  reference on a no-op trip). Both failover decorators use it; unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The doc was written to the "failover deferred, depends on network-resilience"
framing. Bring it in line with what shipped:

- Both sub-features implemented (`definition: implemented`); failover is
  self-contained (site-adds / behavior-expires, trip-on-first-failure +
  cooldown) — `network-resilience` reframed from hard prerequisite to an
  optional refinement that would sit below the trip.
- Constraints pre-pass built, `getCdnId` configurable, segment trips added
  (`failoverFetch` + `failoverFetchBytes`, `addFailedCdn`) — reflected in the
  phases table, implementation surface, and verification.
- New "Follow-up candidates" section tracks the intentionally-deferred
  refinements (no cooldown extension, flapping, all-CDNs-down terminal state,
  coarse HTTP-status classification, retries-below-the-trip, audio config
  tidiness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`resolve` is reserved for resolvables (presentations, tracks). This
behavior derives and owns the `cdnPriority` signal from the already-
resolved presentation, so `derive` is the accurate verb. Mirrors the
shape of `calculatePresentationDuration`.

Renames the behavior export, the `ResolveCdnPriorityState` type, the
module + test files, and all references across the HLS engines,
track-switching, and the multi-cdn-failover feature doc. The
`cdnPriority` signal/slot name is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled `effect()` + `stop()` inside the resolved state's
`entry` with the reactor's own `effects:` block, and move the timers map up
to the `setup` closure so the scheduler (`effects:`) and the exit cleanup
(`entry`'s returned teardown) can share it. The exit still clears pending
timers and resets `failedCdns` for the next source.

Behavior-preserving; now structurally identical to its sibling
`deriveCdnPriority` (entry-cleanup + effects:). Drops the unused `effect`
import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant