How the deterministic "logic-gate" system and the LLM divide the work. The short version: the LLM owns WHAT and WHY; deterministic code owns HOW and WHETHER-IT'S-SAFE.
- Pure logic-gates can't handle stages 6–8 of
JOB.md(open-ended reject interpretation, novel combinations, drafting outreach). - Pure LLM-in-control is unreliable and unsafe on stages 2–5 (navigating a PHI terminal, injecting keystrokes) and can't be audited.
So Shield uses three "brains":
- Reflexes — deterministic state machine ("logic gates"). Screen navigation, field entry, screen-state verification, safety interlocks. Fast, testable, identical every run. The LLM never emits a keystroke.
- Knowledge — deterministic reference data + lookups. NCPDP reject codes, formularies, BIN/PCN routing, override rules. Facts the LLM looks up via tools — never recalls from memory (LLMs hallucinate reject codes; a table doesn't).
- Judgment — the LLM. Interpreting an ambiguous screen, diagnosing why a claim rejected, choosing among resolution paths, planning the fix, drafting outreach, summarizing the case.
| Stage | Driver | Why |
|---|---|---|
| Capture/OCR → structured state | Deterministic (Vision + classifier); LLM only as fallback on ambiguous screens | Reliable, cheap, on-device |
| Navigation & input injection | Deterministic only | Safety, auditability, no hallucinated clicks |
| Reject-code & formulary lookup | Deterministic data + tools | Facts must be exact |
| "Why did this reject / what now?" | LLM | Synthesis across context |
| Choosing a resolution path | LLM proposes → rules validate → human confirms | Judgment + guardrail |
| Drafting outreach | LLM | Language generation |
| Documentation / case summary | LLM drafts, deterministic logs the facts | Narrative vs. record |
This is how the two halves combine without letting the LLM near the controls:
- LLM plans. Given the case + current
ScreenState, the LLM (via tool use / structured output) emits a typed plan — a sequence of intents likeEnterField(member_id, "X"),RunTestClaim,ReadRejectCodes— not raw keystrokes. - Deterministic engine executes each intent through the state machine, with
pre/post-condition interlocks (
ARCHITECTURE.md§5–6). It refuses any intent whose precondition doesn't match the live screen. - Verify & re-plan. After each step it re-captures the screen; if reality diverges from the plan, it hands the new state back to the LLM to re-plan. This is what makes it robust to a UI that misbehaves.
The LLM is an advisor beside the loop, not on the hot path. It's consulted at decision points; it never has its finger on the keyboard.
A council (multiple models proposing/critiquing/voting) is valuable only where there's no deterministic ground truth — i.e. the judgment steps (6–7). It is not the right tool where a lookup table already has the answer (reject-code meaning, formulary status) — there you call the table, not a committee.
Recommended use: cheap model proposes a resolution path → smart model (or a small
council on genuinely hard/novel rejects) critiques → deterministic rules
validate the chosen path → human confirms. Reserve multi-model voting for
low-confidence or high-stakes cases; running a council on every claim is slow and
expensive for little gain. Cost/model mechanics live in MODEL-STRATEGY.md.
Target: the system closes everything closable end-to-end and frees ~5 of 8 hours; the remaining ~3 hours is the human's active work — answering its prompts (budget: ≤1 prompt/hour), reviewing, and handling what it can't close. That goal is reconciled with safety by two asymmetries, not by confirming every keystroke.
1. Read/research/draft is most of the work and is low-risk; the dangerous writes are a tiny surface. Eligibility checks, benefits investigation, reject interpretation, COB discovery, PA drafting, and documentation are read-only or draft-only — run them fully autonomous, near-zero blast radius. The actions that can actually harm: submit a real claim, submit a PA, change COB order. Gate those hard (post-condition verify; prefer test-claim-first — a test adjudication is inherently safe, real submission is the one defended line).
2. Autonomy is earned per-workflow-shape, not granted globally. A procedure goes autonomous only after it passes an eval bar (N clean runs on its eval cases). Below the bar → assisted or routed to the human-review queue. The ≤1-prompt/hour budget is a tunable confidence threshold: below it → queue or ask; above it → act.
Outreach is act-then-notify. When a case needs outreach, the system may draft and send the note through the software unattended, then push a notification for after-the-fact review — it does not block waiting for approval. Everything it can't close routes to the review queue and waits.
This supersedes the earlier "human confirms everything" default. Approvals,
oversight, and the review queue surface through the companion apps (macOS /
watchOS / iOS) — see ARCHITECTURE.md §3a. The learning loop that lets the
autonomous slice grow over time is in TRAINABILITY.md.