feat(spf): multi cdn failover#1671
Draft
cjpillsbury wants to merge 11 commits into
Draft
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…s pre-pass The failover half of multi-CDN, as a hard constraint in track-switching, on top of a new generic constraints pre-pass. (Auto-detection — the per-CDN breaker that writes failedCdns from observed fetch failures — is the remaining layer; failedCdns is externally driven for now.) Constraints phase (generic; also unblocks capability-probing): - applyConstraints runs before the rule chain, pruning the unplayable; wired into the candidateSet computed so a constraint's signal reads re-prune reactively. New `constraints` config slot. Default-empty → no behavior change on its own. Failed-CDN constraint (multi-CDN failover): - excludeFailedCdns removes tracks whose CDN is in `failedCdns`; the active-CDN scope then falls to the next CDN in cdnPriority and snaps back to the primary on recovery. Wired into both the video and audio chains. - `failedCdns` state materialized via shareSignals — externally drivable until the breaker lands, and stays writable for manual CDN override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes multi-CDN auto-failover: a per-CDN failure monitor observes fetch outcomes and trips a CDN into cooldown, which the failed-CDN constraint then prunes — so the active-CDN scope fails over to the next CDN with no external write, and returns to the primary when the cooldown lapses. - network/failover-monitor.ts: pure, clock-free per-CDN failure tracker (consecutive-failure threshold + cooldown; `failedCdns(now)` query). - setup-failover-monitor.ts: behavior that owns `failedCdns`, publishes a `failoverReporter` to context, and supplies the real clock + cooldown-lapse timers; per-source lifecycle. - Fetch-site reporting: resolve-track (media playlists; honors response.ok) and the segment FetchBytes (reportFetchBytes wrapper). Aborts excluded. - Wired into both HLS engines; engine config `failover` (threshold / cooldown, defaults 3 / 30s). Minimal slice: the segment path counts only thrown fetch errors (HTTP-status error classification is deferred to network-resilience); all-CDNs-down keeps the last pick (terminal "nothing playable" modeling deferred). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rework the just-landed failover monitor into a simpler signal-driven shape, removing the imperative reporter and the per-CDN failure count: - Trip on the first terminal fetch failure — drop the consecutive-failure count + `failureThreshold` config; the cooldown is the only back-off. Absorbing transient blips is the retry layer's job once it exists. - Fetch sites own the trip: resolve-track adds the failing CDN to `failedCdns` inline on a failed media-playlist fetch. setupFailoverMonitor no longer publishes a context reporter — it watches `failedCdns` and removes each CDN once its cooldown lapses (site-adds, behavior-expires). - Delete network/failover-monitor.ts (bookkeeping folded into the behavior) and drop `failedCdns` from shareSignals (the monitor owns it now). Segment fetches no longer trip failover for now; that coverage returns in a second pass that abstracts the trip into a shared fetch wrapper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive createSimpleHlsEngine against a real Mux ?redundant_streams=true
source and exercise failover end-to-end: a fetch wrapper rejects the
primary origin (cdnPriority[0]) while the manifest + backup origin hit
the real network. Asserts the primary trips into failedCdns, the selected
video track resolves on the backup, and — once unblocked past its
cooldown — recovers back to the primary.
Hits the network, so it's gated behind VITE_FAILOVER_SMOKE and skipped in
the default suite. Run on demand:
VITE_FAILOVER_SMOKE=1 pnpm -F @videojs/spf test \
src/playback/engines/hls/tests/failover-smoke.test.ts
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CDN grouping is derived at four sites that must agree — building `cdnPriority`, recording the failover trip in `failedCdns`, and the track-switching CDN scope + failover constraint. Hardcoding origin there locked every consumer into host-based redundancy. Introduce a `GetCdnId = (url) => string` type and an optional `getCdnId` on both engine configs (default: origin). The single shared config is broadcast to every behavior, so each site reads `config.getCdnId ?? getCdnId` and the keys stay comparable. `switchAudioTrack` now spreads `...config` like `switchVideoTrack` so the override reaches the audio CDN rules too (its video-only ABR fields ride along harmlessly — audio has no bandwidthState to act on them). Lets a consumer key CDNs on something other than origin (e.g. Mux's `cdn=` query param). Default behavior is unchanged. Covered by an engine test that keys on a query param and asserts cdnPriority, the trip, and the constraint/scope all honor it end-to-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pull the failover trip out of resolve-track's task body into a decorated fetch, mirroring how the segment loader decorates its FetchBytes: - `network/fetch.ts` gains `FetchText` + a default `fetchResolvableText` (fetch → reject on non-OK → text), the text analog of `FetchBytes`. - `reportFailedCdns(fetchText, failedCdns?, getCdnId)` decorates a `FetchText` so a failed fetch trips the track's CDN into `failedCdns`. - Each `resolve*` behavior builds its decorated fetch via a small `failoverFetch(state, config)` bridge and passes it through config; `setupTrackResolution` just uses it. `failedCdns` stays owned by the failover monitor — the bridge reads it off an intersection-typed state slice, so `resolve*` never declares it. Parse failures no longer trip the CDN (only fetch/non-OK do) — a 200 with an unparseable body is a content issue, not CDN unavailability. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the CDN-failover trip back into resolve-track's local `failoverFetch`
helper (a flat try/catch around `fetchResolvableText`) instead of routing
through a `reportFailedCdns` decorator imported from `setup-failover-monitor`.
- `reportFailedCdns` was vestigial after this; delete it (and its now-unused
`FetchText`/`GetCdnId` imports). `setup-failover-monitor` is purely the
expiry behavior again, and `resolve-track` no longer imports from it.
- Drop the unused `fetchResolvableText` config-injection slot — playlists,
unlike segments, have no per-type fetch variation to inject. `ResolveTrackConfig`
is now just `{ getCdnId? }`.
`fetchResolvableText` rejects on non-OK, so the single catch covers both
non-OK status and network errors. `FetchText` + `fetchResolvableText` stay in
network/fetch.ts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Segments now trip multi-CDN failover, not just media playlists — closing the gap where a CDN healthy at load but degrading mid-stream went undetected (VOD playlists are fetched once, so segments are the only ongoing signal). - `setup-buffer-actors` decorates its per-type segment fetch (`trackedFetch` / `fetchStream`) with `failoverFetchBytes`, mirroring resolve-track's `failoverFetch`: a failed fetch trips the segment's CDN into `failedCdns`; aborts don't; reads the signal off an anchored state slice so the behavior never declares it. - Extract the dedup-append into a pure `addFailedCdn(failed, cdn)` in `media/utils/cdn.ts` (idempotent, order-preserving, returns the same reference on a no-op trip). Both failover decorators use it; unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The doc was written to the "failover deferred, depends on network-resilience" framing. Bring it in line with what shipped: - Both sub-features implemented (`definition: implemented`); failover is self-contained (site-adds / behavior-expires, trip-on-first-failure + cooldown) — `network-resilience` reframed from hard prerequisite to an optional refinement that would sit below the trip. - Constraints pre-pass built, `getCdnId` configurable, segment trips added (`failoverFetch` + `failoverFetchBytes`, `addFailedCdn`) — reflected in the phases table, implementation surface, and verification. - New "Follow-up candidates" section tracks the intentionally-deferred refinements (no cooldown extension, flapping, all-CDNs-down terminal state, coarse HTTP-status classification, retries-below-the-trip, audio config tidiness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`resolve` is reserved for resolvables (presentations, tracks). This behavior derives and owns the `cdnPriority` signal from the already- resolved presentation, so `derive` is the accurate verb. Mirrors the shape of `calculatePresentationDuration`. Renames the behavior export, the `ResolveCdnPriorityState` type, the module + test files, and all references across the HLS engines, track-switching, and the multi-cdn-failover feature doc. The `cdnPriority` signal/slot name is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled `effect()` + `stop()` inside the resolved state's `entry` with the reactor's own `effects:` block, and move the timers map up to the `setup` closure so the scheduler (`effects:`) and the exit cleanup (`entry`'s returned teardown) can share it. The exit still clears pending timers and resets `failedCdns` for the next source. Behavior-preserving; now structurally identical to its sibling `deriveCdnPriority` (entry-cleanup + effects:). Drops the unused `effect` import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
021a848 to
f6c1bd8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Multi-CDN failover for HLS sources that publish the same content on more than
one host (Mux Video's
?redundant_streams=true). The redundant variantsalready parse as ordinary per-CDN candidate tracks, so this is modeled as
selecting a CDN and keeping the whole presentation on it inside the
track-switching rule model — a scope rule (sticky pick) + a constraint
(failover) — rather than URL rewriting at fetch time. Validated by engine
integration tests and a gated live round-trip against a real Mux source.
Internal SPF (alpha); engine config + state additions are additive.
For reviewers — how to read this PR
The single biggest churn is the design-doc rewrite (
multi-cdn-failover.md,+186/−143, docs-only). The runtime change is two new behaviors
(
setup-failover-monitor, plus thederiveCdnPriorityrename) and fetch-sitedecorators threaded into existing behaviors/engines.
Runtime (focus here)
Read fully:
setup-failover-monitor.ts(new) — ownsfailedCdns;per-source cooldown timers that expire a tripped CDN; clears the set + timers
on src unload (no cross-source leak); trips on a single terminal failure
(retry/back-off is a future layer below the trip, not here)
Targeted careful read:
track-switching.ts(+134/−37) —preferActiveCdnscope (active CDN =highest-priority
cdnPriorityentry with surviving tracks; falls throughwhen a CDN is empty) +
excludeFailedCdnsconstraint, shared by the video +audio chains via one
CdnRuleConfig; cross-type coherence (both chains readthe same definition → agree on the CDN);
applyConstraintspre-pass runsbefore the rule chain and should be order-independent;
SwitchableTrackgains
urlresolve-track.ts(+76/−10) —failoverFetchdecorates the media-playlistfetch; one catch covers both a thrown fetch and a non-OK status
(
fetchResolvableTextrejects on non-OK); the failing CDN is keyed off thetrack URL via the configured
getCdnIdsetup-buffer-actors.ts(+43/−5) —failoverFetchBytestrips on asegment-fetch failure the same way; the decorator wraps
trackedFetch/fetchStreamwithout touching buffer-actor lifecyclecdn.ts(+28/−10) —addFailedCdnis pure + idempotent (returns the samearray reference when the CDN is already present — the monitor's
set-membership watch relies on this to not reschedule);
getCdnIdorigindefault + unparseable-URL fallback
Skim structure only:
engine.ts(+38/−6) /engine-audio-only.ts(+28/−5) — compose
deriveCdnPriority+setupFailoverMonitorafterresolvePresentation, addfailover?/getCdnId?config +cdnPriority?/failedCdns?state; audio-only mirrors main.derive-cdn-priority.ts(+10/−7) — the
resolveCdnPriority→deriveCdnPriorityrename +getCdnIdthreading, no behavior change.network/fetch.ts(new) —FetchTextfetchResolvableText(fetch → reject on non-OK → text), the text analog ofFetchBytesSkim file: Tests (6 files, +411) —
engine.test.tsintegration(manifest-order
cdnPriority+ pick on primary; reorder re-narrows;auto-failover when a media-playlist fetch fails; custom
getCdnIdhonoredend-to-end);
setup-failover-monitor.test.ts(cooldown expiry, per-CDNindependence, per-source clear);
track-switching.test.ts(scope/constraint +failover-via-constraint);
cdn.test.ts/derive-cdn-priority.test.ts/resolve-track.test.ts.failover-smoke.test.tsis the gated live round-trip(see Smoke test)
Design doc (skim)
Skim file:
internal/design/spf/features/multi-cdn-failover.md(+186/−143)— the feature spec, now
definition: implemented. Carries the "constraint +scope, not URL rewriting" rationale and the site-adds / behavior-expires
failover split; the Follow-up candidates section enumerates the deliberately
deferred gaps.
Smoke test
Sticky CDN pick is observable through the SPF segment-loading harness against a
real Mux redundant source.
preload=autois required — the harness defaults topreload=none, which never activates the manifest fetch.Sandbox:
/spf-segment-loading/?preload=auto&muted=true&autoplay=true&src=https://stream.mux.com/s41JYeqIpBMBzE4OzxDyGR2yrp2hD1CQ6gJN9SlVGDQ.m3u8?redundant_streams=trueplayback.
confirms the source really is multi-CDN (redundant variants parse as per-CDN
tracks).
window.state().cdnPriority→ two hosts in manifest order(
…edgemv.mux.com, then…fastly.mux.com).primary,
edgemv) — the whole presentation sticks to one CDN (cross-typecoherence), not a per-type host split.
Failover / recovery has no built-in harness control (it needs a CDN to actually
fail). That round-trip — trip → failover to backup → recovery to primary —
is the gated live test:
What changed — by surface
Sticky CDN pick (scope).
deriveCdnPrioritypublishes a per-presentationordered
cdnPrioritylist (the manifest's distinct CDN hosts, most-preferredfirst — mirroring HLS content steering's
PATHWAY-PRIORITY), and clears it onsrc unload. A shared
preferActiveCdnscope rule narrows every track type'scandidates to the highest-priority CDN that still has tracks, so video / audio
/ text all resolve from one host. The active CDN is derived (first-with-
survivors), never stored.
Failover (constraint + trip/expiry). Site-adds, behavior-expires. Fetch
sites add the failing CDN's origin to a
failedCdnsset on a terminal fetchfailure —
failoverFetchfor media playlists inresolve-track,failoverFetchBytesfor segments insetup-buffer-actors. A sharedexcludeFailedCdnsconstraint (insetupTrackSwitching's newapplyConstraintspre-pass) prunes that CDN's tracks, so the scope's first-with-survivors falls
to the next CDN automatically.
setupFailoverMonitorremoves the CDN once acooldown lapses, and the scope snaps back — no reactive "active CDN" rewrite at
any point.
Configurable CDN identity.
getCdnId(origin-based default) is overridablevia engine config — e.g. to key on Mux's
cdn=query param — and is threadedto all four CDN-id sites so the keys used to build
cdnPriority, tripfailedCdns, and evaluate the scope/constraint stay comparable.Notable design decisions
in
resolveTrack. Redundant variants already parse as per-CDN tracks, soCDN choice is a track-selection problem. Alternative considered: rewrite the
selected track's URL at fetch time, or store a reactive
activeCdnvalue.Rejected because the ordered-list shape (
cdnPriority, active =first-with-survivors) makes failover a pure constraint — pruning moves the
pick, un-pruning returns it — and composes with content-steering later as a
plain list reorder, with no active-value rewrite to keep in sync.
network-resiliencecircuit-breaker.setupFailoverMonitoris a per-source cooldown timer. Alternativeconsidered: depend on the (unbuilt)
network-resiliencecluster forretry/backoff/error-classification. Rejected because failover shouldn't block
on that cluster; retries would sit below the trip (so it sees only
post-retry terminal failures) as a future refinement, not a prerequisite.
media-playlist status trips immediately; cooldown is the only back-off.
Alternative considered: an N-failures threshold or a windowed/decaying health
metric. Rejected as over-built for the first cut. Reviewers: is
trip-on-first too aggressive for flaky-but-not-dead CDNs? (flapping is called
out below.)
Reviewer callouts — known limitations
the candidate set is empty and the prior pick is silently left in place; a
distinct "nothing playable" state is unmodeled (shared with the
constraints-pre-pass work).
can oscillate — trip → cooldown lapses → re-preferred (it's
cdnPriority[0])→ fails again. No hysteresis or growing back-off on repeated trips; re-failing
mid-cooldown doesn't push the deadline out.
fetch or a non-OK media-playlist status; finer classification (5xx-with-body
vs 4xx, segment-side codes) is deferred to
network-resilience. A 200 withan unparseable body deliberately does not trip (content issue, not CDN
unavailability).
switchAudioTrackconfig tidiness (cosmetic). Audio spreads...config, so video-only ABR fields ride into the shared ranker harmlessly(audio has no
bandwidthState). A cross-cutting-only shared config type wouldkeep them out.
Breaking changes
The
resolveCdnPriority→deriveCdnPriorityrename is internal SPF only — nota public export, no external consumers. New engine config (
failover?,getCdnId?) and state (cdnPriority?,failedCdns?) are additive.Test plan
VITE_FAILOVER_SMOKE=1 pnpm -F @videojs/spf test src/playback/engines/hls/tests/failover-smoke.test.ts—passes: trip → failover to backup → recovery to primary against a real
Mux
?redundant_streams=truesource.apps/e2epages aregenerated from
media.ts, so a redundant source would sweep into everygeneric + visual spec. The gated engine smoke test covers the round-trip,
including recovery (which isn't observable at the player DOM).