Skip to content

feat: Industrial-Grade Reliability, Cascading LLM Fallback, and Tiered Service Support#600

Open
amaynez wants to merge 6 commits into
666ghj:mainfrom
amaynez:feature/industrial-reliability
Open

feat: Industrial-Grade Reliability, Cascading LLM Fallback, and Tiered Service Support#600
amaynez wants to merge 6 commits into
666ghj:mainfrom
amaynez:feature/industrial-reliability

Conversation

@amaynez
Copy link
Copy Markdown

@amaynez amaynez commented May 3, 2026

Overview

This PR transforms the MiroFish engine from a prototype into a robust, production-ready platform. It introduces a multi-layered resilience system for both LLM generations and external service (Zep) interactions, alongside a foundation for tiered service support.

Key Changes

1. Robust LLM Client with Cascading Fallback

  • Truncation Detection: Automatically detects if an LLM response was cut off (e.g., due to token limits).
  • JSON Repair: Implements smart logic to repair malformed or truncated JSON responses.
  • Boost Fallback: If the primary LLM fails or returns broken JSON, the system automatically falls back to a high-capacity "Boost" model (configured via LLM_BOOST_* env vars).

2. Zep Resilience Layer

  • Smart Retries & Rate Limiting: Added a sophisticated handling layer to avoid 429 errors and handle quota limits gracefully.
  • Robust Paging: Re-implemented graph reading with robust paging to handle large-scale data without timeouts.
  • Localized Error Handling: Improved error messages to inform users specifically when Zep quotas are exceeded.

3. Tiered Service Foundation

  • Configuration-Driven Polling: Introduced a /config endpoint that allows the backend to control frontend polling behavior.
  • Conditional Polling: The UI now dynamically enables/disables automatic graph updates based on the service tier (Free vs. Premium).
  • Response Caching: Implemented server-side caching for graph data to optimize performance and reduce API costs.

Why this benefits MiroFish users

  • No-Fail Simulations: Simulations are significantly less likely to crash due to minor AI hiccups or transient network issues.
  • Clearer Feedback: Users are no longer met with generic "500 Internal Server Error" when external service limits are hit; they get clear, actionable messages.
  • Scalability: The engine is now better equipped to handle large documents and complex simulations that previously caused timeouts or JSON parsing errors.

Commit Breakdown

  1. feat(utils): implement robust LLM client with cascading fallback and JSON repair
  2. feat(zep): add resilience layer with retries, rate limiting, and robust paging
  3. feat(graph): refactor core services for high-availability simulations
  4. feat(api): add tiered configuration and graph data caching
  5. feat(frontend): implement conditional polling and service-tier UI
  6. fix(i18n): add localized error messages for service quotas

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant