Skip to content

Improve model load error handling with structured diagnostics, retry logic, and self-hosting support #96

@Sumanth-806307

Description

@Sumanth-806307

Problem

Models fail to load across multiple surfaces (chat.webllm.ai, JSFiddle examples, Chrome extensions) as reported in #85. Current error handling lacks:

  • Structured error classification
  • Automatic retry mechanisms
  • Cache recovery logic
  • User-actionable error messages
  • Self-hosting capabilities

Root Causes Identified

  1. Insufficient Error Diagnostics - Generic error messages without classification codes
  2. No Retry Logic - Transient network/CDN failures cause hard stops
  3. Cache Corruption - No automatic cache clearing and retry
  4. No Self-Hosting Support - Users locked into default CDN with no override option

Proposed Solution

Phase 1: Enhanced Error Diagnostics (High Priority)

  • Add ModelLoadErrorCode enum (manifest_fetch_failed, artifact_fetch_failed, worker_init_failed, webgpu_init_failed, cache_invalid)
  • Implement error classification in webllm.ts
  • Add structured error display with actionable guidance
  • Include "Copy Diagnostics" feature for bug reports

Files: app/client/api.ts, app/client/webllm.ts, app/store/chat.ts

Phase 2: Retry Logic & Self-Recovery

  • Automatic retry with exponential backoff (max 3 attempts, 1s → 2s → 4s)
  • Automatic cache clearing on cache_invalid errors
  • Progress indication during retries
  • Only retry on retryable error types

Files: app/client/webllm.ts

Phase 3: Custom Artifact Source Support

Files: app/store/config.ts, app/components/model-config.tsx, app/client/webllm.ts

Phase 4: Documentation

  • Troubleshooting guide with error code explanations
  • Self-hosting setup instructions
  • Updated issue templates with diagnostic fields

Files: docs/TROUBLESHOOTING.md, docs/SELF_HOSTING.md, .github/ISSUE_TEMPLATE/bug_report.md

Acceptance Criteria

  • All model load errors map to defined error codes
  • Retryable errors trigger automatic retry (max 3)
  • Cache corruption triggers automatic clear + retry
  • Custom base URL configurable in Settings
  • Error messages include actionable guidance
  • "Copy Diagnostics" provides complete debug info
  • Documentation covers all error codes and self-hosting

Implementation Details

Full implementation plan available in plan-85.md with:

  • Detailed code examples for each phase
  • Testing strategy (unit, integration, manual)
  • Rollout strategy with risk assessment
  • Success metrics and monitoring approach

Related Issues

Estimated Effort

Time: 3-4 weeks (1 developer)
Priority: High (affects user experience across all surfaces)
Risk: Low-Medium (Phase 1-2), Low (Phase 3-4)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions