Feat/evaluation planner system prompt split#98
Merged
liamlaverty merged 6 commits intoApr 22, 2026
Conversation
…s in use, just that a claude model is in use
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
(system_prompt, user_prompt)pairs across all three VLM client classes (StrokeVLMClient,EvaluationVLMClient,PlannerLLMClient)system_promptkeyword argument toVLMClient.query(),query_multimodal(), andquery_multimodal_multi_image()cached_imagesparameter on multi-image queries so static stroke sample images form the Anthropic cache prefix, while the per-iteration canvas image remains dynamiccache_control: {"type": "ephemeral"}on the Anthropic system content block and on the last cached image block, replacing the previous top-levelcache_controlkeyrole: systemmessage;cached_imagesare prepended without cache markersstroke_system_prompt.txtandstroke_user_prompt.txtVLMClient.last_usageattribute exposing raw API usage/token counts from the most recent responseYYYY/MM/DD/)mistral-large-2512tomistral-large-latestmax_iterations)Devs need to know
VLMClient.query(),query_multimodal(), andquery_multimodal_multi_image()now require asystem_prompt=keyword argument. Any direct callers of these methods must be updated.src/prompt_logs/(YYYY/MM/DD/). Existing flat log files are unaffected but new logs will appear in the new structure.Testing
system_promptkeyword argument_build_*_prompts()methods return(system_prompt, user_prompt)tuplessystem_promptin the top-levelsystemfield as a content block array withcache_controlcache_controlkeysystem_promptasrole: systemmessagecached_imagesare prepended before dynamic images withcache_controlon only the last cached image block (Anthropic) and no markers (OpenAI-compat)tests/manual_test_anthropic_caching.py) sends two identical requests to the Anthropic API and assertscache_creation_input_tokens > 0on the first andcache_read_input_tokens > 0on the secondDesign Files, screenshots
No visual changes. This is a backend prompt-engineering and API integration change.
How & (optional) omissions
_build_*_prompt()method was refactored into a_build_*_prompts()method returning a(system_prompt, user_prompt)tuple. The system prompt contains stable-per-run content (artist persona, response format specification, scoring rubric) while the user prompt contains dynamic-per-iteration content (canvas reference, iteration number, strategy context, layer progress). This separation allows Anthropic's prompt caching to match the system prefix across repeated calls within a single painting run.cached_images: Stroke sample images are byte-identical across all iterations of a run. Routing them through a separatecached_imagesparameter places them before the dynamic canvas image in the user message, with acache_controlbreakpoint on the last cached image block. This extends the cache prefix to cover both the system prompt and the static sample images._build_stroke_prompt(): The original combined-prompt method is retained alongside the new split method for backward compatibility. It is no longer called fromsuggest_strokes().return Trueon target score reached is commented out rather than removed, as this is a temporary behavioural change pending further tuning.Additional info