Skip to content

iOS error-code → likely-cause classifier (raw codes aren't enough to guess root cause) #341

@jonathaneoliver

Description

@jonathaneoliver

Why

When iOS playback fails, the AVPlayer numeric error code alone is rarely enough to guess the root cause. The same code is overloaded across domains (e.g. -12642 means kCMFormatDescriptionError_ValueNotAvailable in CoreMedia and "No matching mediaFile from playlist" when AVPlayer raises it from the HLS path), some codes are completely undocumented (-12860, -12888, -12174), and several of our worst incidents were cascades — one server-side defect produced a sequence of codes, with the first symptom carrying the most diagnostic signal but the loudest log line being a downstream stale-playlist complaint.

We test against legal/well-formed streams, but spec-violating streams exist in the wild — we should be better at recognizing the effect of broken/malformed manifests on the player from error-code patterns alone.

What we know — go-live (server) bugs that produced iOS error codes

Issue Server defect iOS effect
#109 Loop-boundary playlist explosion (sliding window not capped across EXT-X-DISCONTINUITY — ~620 B → ~4.8 KB) -12860 decode error → poll stops → -12888 "playlist unchanged for 1.5× target duration" on audio → playback dies
#110 Residual #109 + content-type/codec-string mismatch on master variants -12860 on 540p → ABR upshift to 2160p → repeated -12642 "No matching mediaFile" → -12888 cascade
#149 Missing EXT-X-DISCONTINUITY-SEQUENCE (RFC 8216 §4.3.3.3 violation) + MAX_LIVE_WINDOW_DURATION too tight (36 s vs 20 s buffer) -12642 + AVPlayer "Cannot Open" on cross-discontinuity ABR upshift; oldest-edge fall-off
#151 v9 tags on EXT-X-VERSION:7 playlist; or HOLD-BACK < 3 × TARGETDURATION; or EXT-X-START inserted between #EXTM3U and #EXT-X-VERSION -12646 "playlist parse error" — entire playlist rejected
#161 HOLD-BACK not preserved across stall recovery (client-fix in iOS app) Recurring stalls — player snapped to oldest-segment edge with zero margin
#94/#95/#96 go-live concurrency (3-lock DASH cache, fresh LLHLSGenerator per 200 ms tick, double-lock duplicate MPD generation) Lock contention → manifest stall under load → indirectly -12888/stalls
#137 (LocalHTTPProxy bug) chunked TE forced on 206 Partial Content with Content-Length stripped URLAsset err=-12174 flood on byte-range fetches
#145 Newer iOS sims still emit -12174 after #137 fix; sim-only AVFoundation tightening Cosmetic only — no real-device impact
#135 Upstream URLSessionDataTask not cancelled when AVPlayer abandoned client ECANCELED (POSIX 89) flood — wasted upstream traffic

Codes we've actually seen and what we currently know about them

Code Label / "errorComment" string Cause we've identified
-12174 URLAsset os_log warning (no AVError mapping) 206 Partial Content returned with Transfer-Encoding: chunked and no Content-Length after a Range: request (#137); newer-sim variant: HTTP/1.1 Connection: close (#145)
-12642 "No matching mediaFile from playlist" (HLS context) — overloaded with kCMFormatDescriptionError_ValueNotAvailable Missing EXT-X-DISCONTINUITY-SEQUENCE, live window too narrow vs forward buffer, or wrong content served (#149, #110)
-12646 "playlist parse error" v7-vs-v9 mismatch, HOLD-BACK underflow, or bad tag insertion order (#151)
-12860 CoreMedia decode error (no public symbol) Oversized/malformed playlist (#109); also genuine bad codec config in master
-12888 "playlist unchanged for 1.5× target duration" Almost always downstream — player stopped polling after an earlier -12860/-12646 and the audio playlist now appears stale; or genuine go-live worker stall

PlaybackDiagnostics.swift:1150-1262 has the full reverse-engineered switch covering -11800…-11865 (AVError) and -12640…-12900 (CoreMedia families). Note the labels there are header-derived, not behavioral — -12642's real meaning in our incidents has nothing to do with "ValueNotAvailable".

The core problem

A bare numeric code is one signal. To classify cause we need three:

  1. CodeAVPlayerItem.error.code / errorLog().events.last.errorStatusCode.
  2. CommenterrorComment / localizedDescription. This is what disambiguates the overloaded codes (-12642 → "No matching mediaFile" vs CM-domain).
  3. Context fingerprint captured at error time:

With these three, a rule classifier (code, comment_regex, predicates) → (probable_cause, confidence) turns -12642 into one of:

  • MISSING_DISCONTINUITY_SEQUENCE (high) — comment matches "No matching mediaFile" + a discontinuity wrap occurred in the last few seconds + variant switch was in flight
  • WINDOW_FELL_OFF_OLDEST_EDGE (medium) — comment matches "No matching mediaFile" + player position was within ~1× TARGETDURATION of MEDIA-SEQUENCE head
  • CM_FORMAT_VALUE_NOT_AVAILABLE (low) — error came from a non-HLS path
  • UNKNOWN (fallback)

Proposed direction (going forwards)

Hybrid client + server classifier:

Concrete first steps

  1. Add the missing observed codes (-12174, -12888) and HLS-context labels to interpretCoreMediaErrorCode / interpretAVErrorCode in PlaybackDiagnostics.swift. Cheap, immediate dashboard win.
  2. Capture the context fingerprint at error-emission time: extend LocalHTTPProxy's existing per-segment hook (Track per-segment identity via LocalHTTPProxy (iOS) #157) to keep a small ring buffer of the last N (~10) responses (URL, status, framing, size, byterange) so an errorLog event can attach the immediately-preceding HTTP context.
  3. Define the rule schema (likely YAML/JSON: {code, comment_regex, predicates[], cause, confidence, refs[]}) and seed it with the known cases from this issue.
  4. Wire the client-side classifier into the metric payload (player_metrics_probable_cause) and the dashboard error lane.
  5. Implement the same rule-engine server-side under Analytics sidecar: ClickHouse + Grafana for cross-session event analysis & historical replay #336 so retrospective labelling is consistent with live labelling.

Open questions

  • Should probable_cause be a free-form string or an enum? Enum is cleaner for analytics; free-form is more honest about partial knowledge.
  • How much context history is enough? Last 10 HTTP responses is cheap; a full HAR replay is overkill for live labelling.
  • Do we want confidence as a score (0-1) or a level (low/medium/high)? Levels are easier to reason about; scores are easier to aggregate.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions