Feat/evaluation planner system prompt split by liamlaverty · Pull Request #98 · liamlaverty/paint-by-language-model

liamlaverty · 2026-04-22T09:57:53Z

Description

Split monolithic prompts into separate (system_prompt, user_prompt) pairs across all three VLM client classes (StrokeVLMClient, EvaluationVLMClient, PlannerLLMClient)
Added required system_prompt keyword argument to VLMClient.query(), query_multimodal(), and query_multimodal_multi_image()
Introduced cached_images parameter on multi-image queries so static stroke sample images form the Anthropic cache prefix, while the per-iteration canvas image remains dynamic
Placed block-level cache_control: {"type": "ephemeral"} on the Anthropic system content block and on the last cached image block, replacing the previous top-level cache_control key
For OpenAI-compatible providers (Mistral, LMStudio), system prompt is prepended as a role: system message; cached_images are prepended without cache markers
Extracted stroke prompt templates into external files: stroke_system_prompt.txt and stroke_user_prompt.txt
Added VLMClient.last_usage attribute exposing raw API usage/token counts from the most recent response
Changed prompt log directory structure from flat to date-based subdirectories (YYYY/MM/DD/)
Updated Mistral VLM model references from mistral-large-2512 to mistral-large-latest
Commented out early-stop on target score reached in the generation orchestrator (painting now continues until max_iterations)

Devs need to know

Breaking API change: VLMClient.query(), query_multimodal(), and query_multimodal_multi_image() now require a system_prompt= keyword argument. Any direct callers of these methods must be updated.
No new Python packages or environment variable changes
Prompt log files are now written to date-partitioned subdirectories under src/prompt_logs/ (YYYY/MM/DD/). Existing flat log files are unaffected but new logs will appear in the new structure.

Testing

All existing unit tests updated to pass the new system_prompt keyword argument
New unit tests verify:
- _build_*_prompts() methods return (system_prompt, user_prompt) tuples
- System prompt is byte-identical across consecutive calls for the same artist configuration (proving cacheability)
- Anthropic payloads place system_prompt in the top-level system field as a content block array with cache_control
- Anthropic payloads do not include a top-level cache_control key
- OpenAI-compatible payloads prepend system_prompt as role: system message
- cached_images are prepended before dynamic images with cache_control on only the last cached image block (Anthropic) and no markers (OpenAI-compat)
New manual integration test (tests/manual_test_anthropic_caching.py) sends two identical requests to the Anthropic API and asserts cache_creation_input_tokens > 0 on the first and cache_read_input_tokens > 0 on the second
Pre-flight checks: Python lint ✅, mypy ✅, frontend format/lint/type-check ✅

Design Files, screenshots

No visual changes. This is a backend prompt-engineering and API integration change.

How & (optional) omissions

Approach: Each client's single _build_*_prompt() method was refactored into a _build_*_prompts() method returning a (system_prompt, user_prompt) tuple. The system prompt contains stable-per-run content (artist persona, response format specification, scoring rubric) while the user prompt contains dynamic-per-iteration content (canvas reference, iteration number, strategy context, layer progress). This separation allows Anthropic's prompt caching to match the system prefix across repeated calls within a single painting run.
cached_images: Stroke sample images are byte-identical across all iterations of a run. Routing them through a separate cached_images parameter places them before the dynamic canvas image in the user message, with a cache_control breakpoint on the last cached image block. This extends the cache prefix to cover both the system prompt and the static sample images.
Legacy _build_stroke_prompt(): The original combined-prompt method is retained alongside the new split method for backward compatibility. It is no longer called from suggest_strokes().
Orchestrator early-stop: The return True on target score reached is commented out rather than removed, as this is a temporary behavioural change pending further tuning.

Additional info

17 files changed, 1,379 insertions, 290 deletions
Anthropic prompt caching documentation reference: cache breakpoints are placed at the block level (up to 4 per request); Haiku 4.5 requires a minimum of 4,096 tokens in the cacheable prefix

…caching

…ovider

…pic prompt

…s in use, just that a claude model is in use

liamlaverty added 5 commits April 21, 2026 11:42

feat(system-prompts): adapting system prompts to work with ephemeral …

9899c1f

…caching

Merge branch 'development' into feat/vlm-client-system-prompt-refactor

90763c8

feat(clients): update clients to use prompt caching with Anthropic pr…

2cd4b28

…ovider

feat(config): update min model for claude to sonnet-4-6

00c9e91

feat(vlm-clients): pass cachecontrol to the correct part of an anthro…

b203e91

…pic prompt

liamlaverty self-assigned this Apr 22, 2026

test(test-config): no longer care about which specific claude model i…

ed7e720

…s in use, just that a claude model is in use

liamlaverty merged commit f0f47aa into development Apr 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/evaluation planner system prompt split#98

Feat/evaluation planner system prompt split#98
liamlaverty merged 6 commits into
developmentfrom
feat/evaluation-planner-system-prompt-split

liamlaverty commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liamlaverty commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Devs need to know

Testing

Design Files, screenshots

How & (optional) omissions

Additional info

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liamlaverty commented Apr 22, 2026 •

edited

Loading