Skip to content

Feat/evaluation planner system prompt split#98

Merged
liamlaverty merged 6 commits into
developmentfrom
feat/evaluation-planner-system-prompt-split
Apr 22, 2026
Merged

Feat/evaluation planner system prompt split#98
liamlaverty merged 6 commits into
developmentfrom
feat/evaluation-planner-system-prompt-split

Conversation

@liamlaverty

@liamlaverty liamlaverty commented Apr 22, 2026

Copy link
Copy Markdown
Owner

Description

  • Split monolithic prompts into separate (system_prompt, user_prompt) pairs across all three VLM client classes (StrokeVLMClient, EvaluationVLMClient, PlannerLLMClient)
  • Added required system_prompt keyword argument to VLMClient.query(), query_multimodal(), and query_multimodal_multi_image()
  • Introduced cached_images parameter on multi-image queries so static stroke sample images form the Anthropic cache prefix, while the per-iteration canvas image remains dynamic
  • Placed block-level cache_control: {"type": "ephemeral"} on the Anthropic system content block and on the last cached image block, replacing the previous top-level cache_control key
  • For OpenAI-compatible providers (Mistral, LMStudio), system prompt is prepended as a role: system message; cached_images are prepended without cache markers
  • Extracted stroke prompt templates into external files: stroke_system_prompt.txt and stroke_user_prompt.txt
  • Added VLMClient.last_usage attribute exposing raw API usage/token counts from the most recent response
  • Changed prompt log directory structure from flat to date-based subdirectories (YYYY/MM/DD/)
  • Updated Mistral VLM model references from mistral-large-2512 to mistral-large-latest
  • Commented out early-stop on target score reached in the generation orchestrator (painting now continues until max_iterations)

Devs need to know

  • Breaking API change: VLMClient.query(), query_multimodal(), and query_multimodal_multi_image() now require a system_prompt= keyword argument. Any direct callers of these methods must be updated.
  • No new Python packages or environment variable changes
  • Prompt log files are now written to date-partitioned subdirectories under src/prompt_logs/ (YYYY/MM/DD/). Existing flat log files are unaffected but new logs will appear in the new structure.

Testing

  • All existing unit tests updated to pass the new system_prompt keyword argument
  • New unit tests verify:
    • _build_*_prompts() methods return (system_prompt, user_prompt) tuples
    • System prompt is byte-identical across consecutive calls for the same artist configuration (proving cacheability)
    • Anthropic payloads place system_prompt in the top-level system field as a content block array with cache_control
    • Anthropic payloads do not include a top-level cache_control key
    • OpenAI-compatible payloads prepend system_prompt as role: system message
    • cached_images are prepended before dynamic images with cache_control on only the last cached image block (Anthropic) and no markers (OpenAI-compat)
  • New manual integration test (tests/manual_test_anthropic_caching.py) sends two identical requests to the Anthropic API and asserts cache_creation_input_tokens > 0 on the first and cache_read_input_tokens > 0 on the second
  • Pre-flight checks: Python lint ✅, mypy ✅, frontend format/lint/type-check ✅

Design Files, screenshots

No visual changes. This is a backend prompt-engineering and API integration change.

How & (optional) omissions

  • Approach: Each client's single _build_*_prompt() method was refactored into a _build_*_prompts() method returning a (system_prompt, user_prompt) tuple. The system prompt contains stable-per-run content (artist persona, response format specification, scoring rubric) while the user prompt contains dynamic-per-iteration content (canvas reference, iteration number, strategy context, layer progress). This separation allows Anthropic's prompt caching to match the system prefix across repeated calls within a single painting run.
  • cached_images: Stroke sample images are byte-identical across all iterations of a run. Routing them through a separate cached_images parameter places them before the dynamic canvas image in the user message, with a cache_control breakpoint on the last cached image block. This extends the cache prefix to cover both the system prompt and the static sample images.
  • Legacy _build_stroke_prompt(): The original combined-prompt method is retained alongside the new split method for backward compatibility. It is no longer called from suggest_strokes().
  • Orchestrator early-stop: The return True on target score reached is commented out rather than removed, as this is a temporary behavioural change pending further tuning.

Additional info

  • 17 files changed, 1,379 insertions, 290 deletions
  • Anthropic prompt caching documentation reference: cache breakpoints are placed at the block level (up to 4 per request); Haiku 4.5 requires a minimum of 4,096 tokens in the cacheable prefix

@liamlaverty liamlaverty self-assigned this Apr 22, 2026
…s in use, just that a claude model is in use
@liamlaverty liamlaverty merged commit f0f47aa into development Apr 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant