Reverse chronological. Most recent entry first.
Ran a full quality validation sweep across 4 backends after prompt and post-processing changes made earlier in the session. The changes were: 3 code examples + suffix delimiter rule added to SYSTEM_PROMPT, mode-gated minOverlap in trimSuffixOverlap (code=1 char, prose=10 chars), and 3 suffix echo regression scenarios.
New test infrastructure:
- Created
src/test/quality/rapid-test.ts— standalone script for fast iteration during prompt engineering (~30-60s per run, 7 cherry-picked scenarios, objective pass/fail criteria) - Added 7 new code scenarios to
scenarios.ts: prefix-only (TS, Python), Java mid-file, YAML config, partial word, deep nesting, empty function body
Results — code completion pass rates:
| Backend | Before (pre-changes) | After | Delta |
|---|---|---|---|
| CLI haiku | 100% (24/24) | 97% (33/34) | -3% (1 null on new scenario) |
| xAI Grok | 86.9% (20/23) | 100% (26/26) | +13.1% |
| GPT-4.1 Nano | 56.5% (13/23) | 93.9% (31/33) | +37.4% |
| Anthropic Haiku | (new baseline) | 87.9% (29/33) | — |
What fixed the failures:
The 3 code examples in SYSTEM_PROMPT (template literal, list comprehension, shell echo) taught instruction-extraction models to respect suffix delimiters. The mode-gated minOverlap (code=1 vs prose=10) in post-processing catches any remaining single-character suffix echoes (closing ], }, `, etc.) without false positives on prose.
No iterative prompt engineering was needed — the pre-existing changes already exceeded all stop criteria (Grok ≥90%, Nano ≥70%).
Remaining failures:
- CLI haiku: 1 null on
code-full-py-pipeline-dispatch(new large-context scenario) - GPT-4.1 Nano: 2 failures —
code-mid-file-py-module-full(prose in docstring, borderline),regression-code-suffix-echo-shell-quote(suffix echo in shell quoting) - Anthropic Haiku: 4 nulls from "thinking leak" pattern (known prefill extraction issue where model immediately closes
</COMPLETION>tag)
Documentation: Updated reference model set to 9 models (added CLI sonnet and Ollama), added Tested Models section to README.
OpenRouter presets: Replaced the two auto-routing OpenRouter presets with specific-model presets: openrouter-haiku (anthropic/claude-haiku-4.5, prefill strategy) and openrouter-gpt-4.1-nano (openai/gpt-4.1-nano, instruction strategy). Auto-routing was unpredictable — the prompt strategy must match the underlying model.
Pricing removal: Removed hardcoded pricing tables from all presets and deleted calculateCost(). Token tracking (input, output, cache-read, cache-creation) is preserved in the usage ledger. For actual costs, users should check their provider's dashboard. The costUsd field is retained on LedgerEntry because the Claude Code SDK writes real total_cost_usd values. OpenRouter also returns usage.cost per response (not currently captured).
extraBody/extraHeaders passthrough: Users can now pass arbitrary provider-specific parameters via bespokeAI.api.customPresets. The fields are spread into the adapter's request body (...this.preset.extraBody) and merged into default headers (Object.assign(defaultHeaders, this.preset.extraHeaders)). Use cases: OpenRouter transforms, provider routing, custom headers for any provider.
OpenRouter reasoning disabled: Built-in OpenRouter presets include extraBody: { reasoning: { enabled: false } } to prevent reasoning tokens from activating on autocomplete requests.
Known issue — intermittent chain-of-thought in OpenRouter completions: On one test run, the OpenRouter Haiku preset produced a completion where the model output its reasoning as regular text content (not reasoning tokens) before generating the actual completion. The model quoted back system prompt instructions ("NEVER output empty COMPLETION tags...") while deliberating. This is NOT the OpenRouter reasoning tokens feature — reasoning_tokens: 0 was confirmed in the raw API response. It's a model behavior where Claude "thinks out loud" in the content field, particularly on minimal-context prompts. Only observed once on an edge-case scenario (very short prefix, no suffix). The direct Anthropic Haiku adapter did not exhibit this on the same prompt, but the behavior is likely non-deterministic. If it recurs, the fix should be in the extraction logic (discard text between </COMPLETION> and a subsequent <COMPLETION> tag).
Expanded the quality test suite from 38 scenarios to 63 with a focus on realistic, full-window editing contexts. The existing scenarios had near-zero coverage of the most common editing situation: working mid-document with both prefix and suffix at or near the context window limits (2500/2000 chars).
New scenario categories (25 scenarios):
- Mid-document editing (8) — Prose editing inside full-length documents. Prefix 2600-3800 chars, suffix 2100-3700 chars. Covers READMEs, design docs, blog posts, tutorials, and mixed-structure documents.
- Dev journal (5) — Chronological dev logs and meeting notes. Tests the user's primary writing use case with date markers, multiple topics per entry, and varied entry lengths.
- Fill-in-the-middle bridging (6) — Cursor in a gap between existing text. The hardest completion task. Varying gap sizes from ~10 words to ~3 sentences.
- Realistic mid-file code (6) — Full code files (TypeScript, Python, Go, Rust) with imports, types, multiple functions. Much larger and more realistic than the existing 50-200 char code stubs.
Approach: Wrote ~11 anchor documents (7000-10000 chars each) and extracted scenarios by placing the cursor at different positions. This produces naturally coherent contexts.
Documentation: Added docs/autocomplete-approach.md consolidating autocomplete philosophy, prompt design, and testing strategy.
Added a direct-API autocomplete backend alongside the existing Claude Code CLI backend. Motivation: Claude Code's chat-based subprocess can't use assistant prefill (pre-seeding the model's response with the cursor anchor), so the model often fails to continue from the right place. Direct API calls fix this and unlock provider-specific features like prompt caching and streaming.
Architecture:
A BackendRouter implements CompletionProvider and delegates to either the existing PoolClient (Claude Code) or a new ApiCompletionProvider (direct API). The router slots into the existing CompletionProvider orchestrator — cache, debounce, mode detection are all unchanged.
Three SDK adapters cover five providers:
AnthropicAdapter(@anthropic-ai/sdk) — Anthropic models, prompt caching, assistant prefillOpenAICompatAdapter(openaipackage) — OpenAI, xAI/Grok, Ollama (local)GeminiAdapter(@google/genai) — Google Gemini models, context caching
Presets define model ID, provider, generation params (temperature, max_tokens), feature flags (caching, prefill), and pricing for cost tracking. Built-in presets are bundled in src/providers/api/presets.ts. Switching presets via the status bar enables A/B testing between models or settings.
Key design decisions:
-
Direct SDKs over Vercel AI SDK — Vercel's abstraction doesn't fully expose provider-specific features (Anthropic prompt caching, Gemini context caching). Three thin adapters give us full control with minimal code.
-
Commands stay on Claude Code — Commit message, suggest-edits, and expand commands work fine with the Claude Code backend and benefit from longer context. Only inline completions get the API backend.
-
Separate debounce — API mode uses 400ms (vs 8000ms for Claude Code) because API roundtrips are much faster than subprocess communication.
-
Cost transparency — Every API call records tokens, cost, duration, and model to the UsageLedger. Test runs also track total cost and per-scenario latency.
Phase 1 (this commit): Anthropic adapter + end-to-end wiring. OpenAI-compat and Gemini adapters are stubbed out for Phase 2/3.
Refactored the Claude Code provider from disposable slots (kill + respawn after every completion) to reusable slots (one subprocess handles up to 24 completions before recycling).
Before: Each completion acquired a slot → pushed a message → awaited the result → recycled the slot (closed the channel, killed the subprocess, spawned a fresh one). The AbortSignal was raced against the result promise via raceAbort().
After: The consumeStream loop continues after delivering a result — it resets the result promise, marks the slot available, and waits for the next message. A slot recycles only after 24 completions (warmup=1, maxTurns=50, leaves headroom) or on stream error. The raceAbort helper is deleted entirely.
Key design decisions:
-
Single-waiter queue replaces the 100ms polling loop. A
pendingWaiterfield holds one waiting request. When a new request arrives and both slots are busy, it cancels the previous waiter (resolve(null)) and registers itself. Slots notify the waiter immediately when they become available (vianotifyWaiter()). -
Committed in-flight requests — once
acquireSlot()returns a slot index, the request awaitsslot.resultPromiseunconditionally. TheAbortSignalparameter is kept (interface requirement) but ignored. This eliminates the abort race that could leave a slot in an inconsistent state. -
Slot states simplified — removed
recyclingstate, renamedready→available. States:initializing → available → busy → available → ... → recycleSlot → initializing.
Why not abort? With reusable slots, aborting mid-request is unsafe: the subprocess is still processing the message and will produce a result that needs to be consumed before the slot can be reused. Ignoring the result would desync the stream. The cost of waiting for a committed result is low (sub-second) and avoids the complexity of stream resynchronization.
Major prompt engineering session to improve Claude Code completion quality. Added new examples, renamed the fill marker, and restructured guidance.
Marker rename: >>>HOLE_TO_FILL<<< → >>>GAP_TO_FILL<<<
The word "hole" implied something that must be filled. "Gap" better conveys that the space might need bridging with substantial content, minimal content, or nothing at all depending on context.
Why not a self-closing XML tag like <fill/>?
Tested earlier — self-closing tags like <fill/> caused Claude Code to output matching closing patterns like </filled> or </fill> in completions. The model treated it as an XML structure to complete rather than a marker to replace. The >>>MARKER<<< format avoids this by being visually distinct from XML.
New examples added (8 total, up from 4):
| # | Pattern | Teaches |
|---|---|---|
| 1 | Bullet list (- ) |
Don't repeat marker |
| 2 | JSON object | Indentation + raw code |
| 3 | Function body | Indentation + code |
| 4 | Mid-word (quic) |
Complete partial word |
| 5 | Prose bridging | Short phrase fill between prefix/suffix |
| 6 | After heading | Start prose, not structure |
| 7 | Numbered list (3. ) |
Don't repeat marker |
| 8 | Before structured content | Brief lead-in, don't duplicate/elaborate |
New prompt structure:
- Examples section with clear "What you receive" / "What you should output" format
- Engine mode transition: "The examples are complete. From now on, act as the gap-filling engine — no more examples, just raw output."
- Length guidance: "Use judgment to decide how much to output: from a single character (completing a partial word) to several sentences (when substantial content is needed). When in doubt, prefer brevity."
- Tightened rules with back-references to examples
Rules (updated):
- Always wrap in
<output>tags - No code fences, commentary, or meta-text
- Never repeat structural markers (see example 1)
- Don't duplicate/elaborate on suffix content (see example 8)
- Not a chat — tool pipeline
Regressions fixed:
- List marker echo (
- - **content**) — fixed via Example 1 + explicit rule - JSON markdown fencing (``` wrapping) — fixed via "no code fences" rule
New regression captured:
- Partial date continuation — user types
0to start01-31-26, model should continue with1-31-26but sometimes inserts full date. Added as regression test for tracking.
Replaced the Claude Code provider's anchor echo strategy with a ${TEXT_TO_FILL} placeholder approach. The old approach instructed the model to echo the current line as an anchor, then trimPrefixOverlap stripped the echo — fragile and indirect. The new approach wraps the document in <incomplete_text> tags with a ${TEXT_TO_FILL} placeholder at the cursor position. The model fills the hole directly.
What changed:
src/providers/claude-code.ts— New system prompt with${TEXT_TO_FILL}example. RemovedextractAnchor()function. RemovedPromptBuilderdependency (readsmaxTokens/temperaturedirectly from mode config). Message assembly now wrapsprefix + ${TEXT_TO_FILL} + suffixin<incomplete_text>tags — same format for prose and code. Passesundefinedfor prefix inpostProcessCompletion()to skiptrimPrefixOverlap.src/scripts/dump-prompts.ts— Updated to reflect new prompt format, removedextractAnchorimport.src/test/unit/anchor.test.ts— Deleted (tested the removedextractAnchorfunction).src/test/api/anchor-echo.test.ts— Rewritten as TEXT_TO_FILL adherence test. Verifies the model fills the placeholder without echoing prefix or suffix text.CLAUDE.md— Added Claude Code provider entry, updated prompt-builder description.
Test results: 226 unit tests passing, lint + type-check clean. TEXT_TO_FILL adherence API test: 6/6 scenarios clean (no prefix echo, no suffix echo). Completions are contextually appropriate across prose continuation, bullet lists, code FIM, heading continuation, and short input scenarios.
Deep-dived into both the Anthropic SDK and Ollama API to ensure correct, token-efficient usage. Two research agents ran in parallel. Full findings documented in:
docs/anthropic-sdk-reference.md— prefill behavior, stop sequence constraints, caching limits, timeout config, error classesdocs/ollama-api-reference.md— raw vs templated mode, FIM via suffix param, keep_alive, KV cache reuse, endpoint selection
Code changes based on research:
- Anthropic: Removed dead prefill stripping logic (API doesn't echo prefill). Set client timeout to 30s (was 10min default). Added
\n\npost-processing trim (Anthropic drops whitespace-only stop sequences). AddedAPIUserAbortErrorcatch. - Ollama: Redesigned raw mode usage — prose uses
raw: true(continuation), code FIM usesraw: falsewithsuffixparam (Ollama handles model-specific FIM tokens). Addedkeep_alive: "30m". System prompt now sent in non-raw mode. - Types/prompt-builder: Added
suffixtoBuiltPromptfor providers with native FIM support.
Test results: 64 unit tests passing (+3 new), 4 API tests passing (Anthropic), 4 skipped (Ollama — no local model). npm run check clean.
Ran a two-pass code review (my contextual review + fresh-eyes subagent reviewer). The background reviewer found 25 issues ranging from critical bugs to minor polish items.
Critical bugs fixed:
-
Missing
isAvailable()method — AddedisAvailable()to theCompletionProviderinterface (types.ts:14) and implemented it inOllamaProvider(ollama.ts:21-24). TheProviderRouterwas calling this method but it wasn't part of the interface contract. -
Memory leak in debouncer — The
onCancellationRequestedlistener was never disposed (debouncer.ts:28-42). Fixed by storing the listener and callingdispose()in both resolution paths (timeout completes or cancellation fires). -
Dead code: 'universal' mode — The
CompletionModetype included'universal'butModeDetectornever returned it. Removed from type definition (types.ts:1), removed unusedbuildUniversalPrompt()method (prompt-builder.ts), and simplified switch statement. The README called it "Universal (auto)" which was confusing — renamed to just "Auto" for clarity. -
Cross-platform filename bug — Used
split('/')which fails on Windows (context-builder.ts:29). Changed topath.basename()for proper cross-platform support. -
Documentation accuracy — README claimed "last 3-5 words" for Anthropic prefill but code uses exactly 4 words (
slice(-4)in prompt-builder.ts:36). Updated README to say "last 4 words".
Other findings not fixed (yet):
The reviewer identified 20 additional issues including: Anthropic prefill stripping logic may be based on API misunderstanding, cache key collision potential, missing config validation, no cross-platform file permissions check for API key file, console.log instead of output channels, etc. These are documented for future work.
All fixes verified with npm run check — zero errors, zero warnings.
Built the entire extension from scratch based on a design plan developed in a prior Claude Code session. The plan laid out a VSCodium extension for inline ghost-text completions with three modes (prose, code, universal/auto-detect) and two backends (Anthropic Claude, Ollama).
What got built (11 source files, compiles clean with zero TypeScript errors):
extension.ts— Entry point. Loads config from VS Code settings, wires up the inline completion provider, status bar, and three commands (trigger, toggle, cycle mode).types.ts— Central type definitions:CompletionMode,Backend,CompletionContext,CompletionProviderinterface,BuiltPrompt,ExtensionConfig.completion-provider.ts— The orchestrator. ImplementsInlineCompletionItemProvider. Runs the full chain: mode detection → context extraction → cache check → debounce → provider call → cache write → return.mode-detector.ts— Maps VS CodelanguageIdto prose/code. Maintains sets of known code languages and prose languages. Unrecognized languages default to prose (intentional — primary use case is writing).prompt-builder.ts— Constructs mode-specific prompts. Prose mode uses continuation-style prompting with assistant prefill support. Code mode includes filename/language context and prefix+suffix framing. Universal mode falls back to a generic continuation prompt.providers/anthropic.ts— Claude API client using@anthropic-ai/sdk. Implements prefill (seeds assistant response with last few words to force continuation). Supports prompt caching viacache_control: { type: "ephemeral" }. PassesAbortSignalfor cancellation.providers/ollama.ts— Ollama client using nativefetch. Hits/api/generatewithraw: truefor base models. PassesAbortSignalfor cancellation.providers/provider-router.ts— Thin router that holds both provider instances and returns the active one based on config.utils/debouncer.ts— Promise-based debounce that integrates with VS Code'sCancellationTokenand returns anAbortSignalfor HTTP request cancellation. Clears previous timers and aborts in-flight requests on new keystrokes.utils/cache.ts— LRU cache with TTL (50 entries, 5min). Key ismode + last 500 chars of prefix + first 200 chars of suffix.utils/context-builder.ts— Extracts prefix/suffix from aTextDocumentat a given position with configurable context window sizes.
Also set up: package.json with full configuration schema (all settings, commands, keybindings), tsconfig.json, esbuild.js build script, .vscode/launch.json for F5 debugging, .vscode/tasks.json for watch mode.
What was NOT built:
- No tests. No test framework installed, no test files.
- No linting. ESLint was referenced in
package.jsonscripts but wasn't installed or configured. (Being fixed today.) - No design document or dev log. The plan only existed in the conversation transcript. (This file and README.md fix that.)
- No CLAUDE.md project instructions file for future Claude Code sessions.
Decisions made during implementation:
- Used esbuild for bundling (fast, simple, standard for VS Code extensions).
- Anthropic SDK is the only runtime dependency. Ollama uses native
fetch. - Prompt caching applies
cache_controlto the system message array, which required a type assertion since the SDK types don't expose it cleanly onTextBlockParam. - The debouncer creates a new
AbortControllerper debounce cycle and aborts the previous one, ensuring only one HTTP request is in flight at a time.