iOS error-code → likely-cause classifier (raw codes aren't enough to guess root cause)

## Why

When iOS playback fails, the AVPlayer numeric error code alone is rarely enough to guess the root cause. The same code is overloaded across domains (e.g. `-12642` means **kCMFormatDescriptionError_ValueNotAvailable** in CoreMedia *and* **"No matching mediaFile from playlist"** when AVPlayer raises it from the HLS path), some codes are completely undocumented (`-12860`, `-12888`, `-12174`), and several of our worst incidents were *cascades* — one server-side defect produced a sequence of codes, with the first symptom carrying the most diagnostic signal but the loudest log line being a downstream stale-playlist complaint.

We test against legal/well-formed streams, but spec-violating streams exist in the wild — we should be better at recognizing the *effect* of broken/malformed manifests on the player from error-code patterns alone.

## What we know — go-live (server) bugs that produced iOS error codes

| Issue | Server defect | iOS effect |
|---|---|---|
| #109 | Loop-boundary playlist explosion (sliding window not capped across `EXT-X-DISCONTINUITY` — ~620 B → ~4.8 KB) | `-12860` decode error → poll stops → `-12888` "playlist unchanged for 1.5× target duration" on audio → playback dies |
| #110 | Residual #109 + content-type/codec-string mismatch on master variants | `-12860` on 540p → ABR upshift to 2160p → repeated `-12642` "No matching mediaFile" → `-12888` cascade |
| #149 | Missing `EXT-X-DISCONTINUITY-SEQUENCE` (RFC 8216 §4.3.3.3 violation) + `MAX_LIVE_WINDOW_DURATION` too tight (36 s vs 20 s buffer) | `-12642` + AVPlayer "Cannot Open" on cross-discontinuity ABR upshift; oldest-edge fall-off |
| #151 | v9 tags on `EXT-X-VERSION:7` playlist; or `HOLD-BACK < 3 × TARGETDURATION`; or `EXT-X-START` inserted between `#EXTM3U` and `#EXT-X-VERSION` | `-12646` "playlist parse error" — entire playlist rejected |
| #161 | `HOLD-BACK` not preserved across stall recovery (client-fix in iOS app) | Recurring stalls — player snapped to oldest-segment edge with zero margin |
| #94/#95/#96 | go-live concurrency (3-lock DASH cache, fresh `LLHLSGenerator` per 200 ms tick, double-lock duplicate MPD generation) | Lock contention → manifest stall under load → indirectly `-12888`/stalls |
| #137 | (LocalHTTPProxy bug) chunked TE forced on `206 Partial Content` with `Content-Length` stripped | `URLAsset err=-12174` flood on byte-range fetches |
| #145 | Newer iOS sims still emit `-12174` after #137 fix; sim-only AVFoundation tightening | Cosmetic only — no real-device impact |
| #135 | Upstream `URLSessionDataTask` not cancelled when AVPlayer abandoned client | `ECANCELED` (POSIX 89) flood — wasted upstream traffic |

## Codes we've actually seen and what we currently know about them

| Code | Label / "errorComment" string | Cause we've identified |
|---|---|---|
| `-12174` | `URLAsset` os_log warning (no AVError mapping) | `206 Partial Content` returned with `Transfer-Encoding: chunked` and no `Content-Length` after a `Range:` request (#137); newer-sim variant: HTTP/1.1 `Connection: close` (#145) |
| `-12642` | "No matching mediaFile from playlist" (HLS context) — overloaded with `kCMFormatDescriptionError_ValueNotAvailable` | Missing `EXT-X-DISCONTINUITY-SEQUENCE`, live window too narrow vs forward buffer, or wrong content served (#149, #110) |
| `-12646` | "playlist parse error" | v7-vs-v9 mismatch, `HOLD-BACK` underflow, or bad tag insertion order (#151) |
| `-12860` | CoreMedia decode error (no public symbol) | Oversized/malformed playlist (#109); also genuine bad codec config in master |
| `-12888` | "playlist unchanged for 1.5× target duration" | **Almost always downstream** — player stopped polling after an earlier `-12860`/`-12646` and the audio playlist now appears stale; or genuine go-live worker stall |

`PlaybackDiagnostics.swift:1150-1262` has the full reverse-engineered switch covering `-11800…-11865` (`AVError`) and `-12640…-12900` (CoreMedia families). Note the labels there are *header-derived*, not behavioral — `-12642`'s real meaning in our incidents has nothing to do with "ValueNotAvailable".

## The core problem

A bare numeric code is one signal. To classify cause we need three:

1. **Code** — `AVPlayerItem.error.code` / `errorLog().events.last.errorStatusCode`.
2. **Comment** — `errorComment` / `localizedDescription`. This is what disambiguates the overloaded codes (`-12642` → "No matching mediaFile" vs CM-domain).
3. **Context fingerprint** captured at error time:
   - last playlist response: size, body hash, status, framing (`Content-Length` vs chunked TE), `Content-Type`
   - in-flight variant switch? distance (segments / wallclock) since last loop boundary
   - last `EXT-X-DISCONTINUITY-SEQUENCE` value seen
   - last byte-range request and response
   - server-side go-proxy state (already captured by SSE / 911 HAR (#308) / freeze HAR (#273))

With these three, a rule classifier `(code, comment_regex, predicates) → (probable_cause, confidence)` turns `-12642` into one of:
- `MISSING_DISCONTINUITY_SEQUENCE` (high) — comment matches "No matching mediaFile" + a discontinuity wrap occurred in the last few seconds + variant switch was in flight
- `WINDOW_FELL_OFF_OLDEST_EDGE` (medium) — comment matches "No matching mediaFile" + player position was within ~1× TARGETDURATION of `MEDIA-SEQUENCE` head
- `CM_FORMAT_VALUE_NOT_AVAILABLE` (low) — error came from a non-HLS path
- `UNKNOWN` (fallback)

## Proposed direction (going forwards)

**Hybrid client + server classifier:**

- **Client-side (iOS app)** — small, high-confidence rule set in `PlaybackDiagnostics` for the ~5 codes we've actually triggered. Stamps `probable_cause` + `confidence` into the metric + on-device 911 HAR (#308) / freeze HAR (#273). Slow to update (ships with the app) but lets the dashboard label incidents in real time.
- **Server-side (analytics sidecar #336)** — richer ruleset that runs over the HAR + SSE log retrospectively. Can use cross-session context (e.g. "all sessions on this content in the last 5 minutes hit `-12860` → server-side regression"), update freely without app release, and backfill labels onto historical incidents.
- **Disambiguation cheat sheet** — checked-in doc (e.g. `apple/InfiniteStreamPlayer/IOS_ERROR_CODES.md` or PRD section) capturing the (code, comment, context) → cause mappings as we learn them, alongside the issue refs that proved each cause. Both classifiers source rules from the same catalogue.

**Concrete first steps**
1. Add the missing observed codes (`-12174`, `-12888`) and HLS-context labels to `interpretCoreMediaErrorCode` / `interpretAVErrorCode` in `PlaybackDiagnostics.swift`. Cheap, immediate dashboard win.
2. Capture the **context fingerprint** at error-emission time: extend `LocalHTTPProxy`'s existing per-segment hook (#157) to keep a small ring buffer of the last N (~10) responses (URL, status, framing, size, byterange) so an `errorLog` event can attach the immediately-preceding HTTP context.
3. Define the rule schema (likely YAML/JSON: `{code, comment_regex, predicates[], cause, confidence, refs[]}`) and seed it with the known cases from this issue.
4. Wire the client-side classifier into the metric payload (`player_metrics_probable_cause`) and the dashboard error lane.
5. Implement the same rule-engine server-side under #336 so retrospective labelling is consistent with live labelling.

**Open questions**
- Should `probable_cause` be a free-form string or an enum? Enum is cleaner for analytics; free-form is more honest about partial knowledge.
- How much context history is enough? Last 10 HTTP responses is cheap; a full HAR replay is overkill for live labelling.
- Do we want confidence as a score (0-1) or a level (low/medium/high)? Levels are easier to reason about; scores are easier to aggregate.

## References

- Server bugs: #109 #110 #149 #151 #161 #94 #95 #96
- Client/proxy: #137 #145 #135 #157
- Adjacent platform work: #308 (911 HAR), #281 (HAR incident context), #273 (auto-HAR on freeze), #336 (analytics sidecar), #272 (player error-characterization framework)
- Existing reverse-engineered map: `apple/InfiniteStreamPlayer/InfiniteStreamPlayer/PlaybackDiagnostics.swift:1150-1262`
- Server comments cross-referencing iOS codes: `go-live/pkg/generator/range_hls.go:183-200`, `go-live/pkg/generator/ll_hls.go:20,129`, `go-live/internal/api/handlers.go:2010`, `go-proxy/cmd/server/main.go:4765,4802`



Code	Label / "errorComment" string	Cause we've identified
`-12174`	`URLAsset` os_log warning (no AVError mapping)	`206 Partial Content` returned with `Transfer-Encoding: chunked` and no `Content-Length` after a `Range:` request (#137); newer-sim variant: HTTP/1.1 `Connection: close` (#145)
`-12642`	"No matching mediaFile from playlist" (HLS context) — overloaded with `kCMFormatDescriptionError_ValueNotAvailable`	Missing `EXT-X-DISCONTINUITY-SEQUENCE`, live window too narrow vs forward buffer, or wrong content served (#149, #110)
`-12646`	"playlist parse error"	v7-vs-v9 mismatch, `HOLD-BACK` underflow, or bad tag insertion order (#151)
`-12860`	CoreMedia decode error (no public symbol)	Oversized/malformed playlist (#109); also genuine bad codec config in master
`-12888`	"playlist unchanged for 1.5× target duration"	Almost always downstream — player stopped polling after an earlier `-12860`/`-12646` and the audio playlist now appears stale; or genuine go-live worker stall

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iOS error-code → likely-cause classifier (raw codes aren't enough to guess root cause) #341

Why

What we know — go-live (server) bugs that produced iOS error codes

Codes we've actually seen and what we currently know about them

The core problem

Proposed direction (going forwards)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue	Server defect	iOS effect
#109	Loop-boundary playlist explosion (sliding window not capped across `EXT-X-DISCONTINUITY` — ~620 B → ~4.8 KB)	`-12860` decode error → poll stops → `-12888` "playlist unchanged for 1.5× target duration" on audio → playback dies
#110	Residual #109 + content-type/codec-string mismatch on master variants	`-12860` on 540p → ABR upshift to 2160p → repeated `-12642` "No matching mediaFile" → `-12888` cascade
#149	Missing `EXT-X-DISCONTINUITY-SEQUENCE` (RFC 8216 §4.3.3.3 violation) + `MAX_LIVE_WINDOW_DURATION` too tight (36 s vs 20 s buffer)	`-12642` + AVPlayer "Cannot Open" on cross-discontinuity ABR upshift; oldest-edge fall-off
#151	v9 tags on `EXT-X-VERSION:7` playlist; or `HOLD-BACK < 3 × TARGETDURATION`; or `EXT-X-START` inserted between `#EXTM3U` and `#EXT-X-VERSION`	`-12646` "playlist parse error" — entire playlist rejected
#161	`HOLD-BACK` not preserved across stall recovery (client-fix in iOS app)	Recurring stalls — player snapped to oldest-segment edge with zero margin
#94/#95/#96	go-live concurrency (3-lock DASH cache, fresh `LLHLSGenerator` per 200 ms tick, double-lock duplicate MPD generation)	Lock contention → manifest stall under load → indirectly `-12888`/stalls
#137	(LocalHTTPProxy bug) chunked TE forced on `206 Partial Content` with `Content-Length` stripped	`URLAsset err=-12174` flood on byte-range fetches
#145	Newer iOS sims still emit `-12174` after #137 fix; sim-only AVFoundation tightening	Cosmetic only — no real-device impact
#135	Upstream `URLSessionDataTask` not cancelled when AVPlayer abandoned client	`ECANCELED` (POSIX 89) flood — wasted upstream traffic

iOS error-code → likely-cause classifier (raw codes aren't enough to guess root cause) #341

Description

Why

What we know — go-live (server) bugs that produced iOS error codes

Codes we've actually seen and what we currently know about them

The core problem

Proposed direction (going forwards)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions