refactor(sdk): remove Jinja prompts superseded by the Python prompt registry#3796
refactor(sdk): remove Jinja prompts superseded by the Python prompt registry#3796VascoSch92 wants to merge 8 commits into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
I verified the SDK prompt-rendering paths as a real user would; the main refactor works for default, disabled, normal custom-policy, custom-template, and planning-template usage, but an empty custom policy file now leaks the built-in default policy.
Does this PR achieve its stated goal?
Mostly yes. The PR set out to remove redundant built-in Jinja prompts while keeping the Python prompt registry behavior equivalent and routing custom security policies through it; running Agent(...).static_system_message on base vs. the PR showed identical default Claude/GPT-5 prompt hashes and working normal custom/disabled policy behavior. However, a zero-byte custom security_policy_filename regresses: base rendered no default policy for that custom file, while the PR falls back to the built-in security policy.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully. |
| CI Status | 🟡 No failures observed; many checks pass, with coverage/QA/build jobs still pending when checked. |
| Functional Verification |
Functional Verification
Test 1: Default prompt parity and normal custom-policy behavior
Step 1 — Establish baseline on origin/main:
Ran git checkout --quiet origin/main && uv run python /tmp/qa_prompt_check.py:
default_claude: len=14504 sha=c9d7d3f665687ed7 security_tag=True default_policy=True self_doc=True important=True gpt5_inline_review=False claude_specific=True
gpt5: len=15202 sha=0ecf2532b0abacfc security_tag=True default_policy=True self_doc=True important=True gpt5_inline_review=True claude_specific=False
disabled_policy: security_tag=True default_policy=False len=13020 sha=6ac4a8b90cd27349
custom_policy: custom_rule=True jinja_variable_literal=False jinja_variable_rendered=False default_policy_leaked=False security_tag_count=1 len=13062 sha=adb80e715e640b47
custom_template: CUSTOM_TEMPLATE model_family=openai_gpt policy=security_policy.j2
planning_template: len=5249 sha=9dfe768574be2daf planning_marker=True
This establishes the old user-facing behavior: default prompts include the expected sections/model-specific instructions; custom policy content replaces the default policy; custom system templates and the planning template render through Jinja.
Step 2 — Apply the PR's changes:
Checked out vasco/clean-jinja at b140712e05c8d9c77b40c1f6116e14223ffd2da5.
Step 3 — Re-run with the PR in place:
Ran git checkout --quiet vasco/clean-jinja && uv run python /tmp/qa_prompt_check.py:
default_claude: len=14504 sha=c9d7d3f665687ed7 security_tag=True default_policy=True self_doc=True important=True gpt5_inline_review=False claude_specific=True
gpt5: len=15202 sha=0ecf2532b0abacfc security_tag=True default_policy=True self_doc=True important=True gpt5_inline_review=True claude_specific=False
disabled_policy: security_tag=False default_policy=False len=12991 sha=5b8bface75f397d6
custom_policy: custom_rule=True jinja_variable_literal=True jinja_variable_rendered=False default_policy_leaked=False security_tag_count=1 len=13073 sha=27cb578ec84d66b9
custom_template: CUSTOM_TEMPLATE model_family=openai_gpt policy=security_policy.j2
planning_template: len=5249 sha=9dfe768574be2daf planning_marker=True
This confirms the main refactor goal for normal usage: default Claude and GPT-5 prompts are byte-identical by hash, normal custom policy content is included without default-policy leakage, disabled policy no longer emits empty <SECURITY> tags, and retained Jinja escape hatches still render.
Test 2: Edge case — empty custom security policy file
Step 1 — Establish baseline on origin/main:
Ran git checkout --quiet origin/main && uv run python /tmp/qa_empty_policy_check.py:
empty_custom_policy: security_tag=True default_policy=False policy_path=/tmp/tmp9zhd0u84/empty_policy.j2 len=13022
This shows that, before the PR, choosing a custom policy file did not silently reintroduce the built-in policy even when the file was empty.
Step 2 — Apply the PR's changes:
Checked out vasco/clean-jinja at b140712e05c8d9c77b40c1f6116e14223ffd2da5.
Step 3 — Re-run with the PR in place:
Ran git checkout --quiet vasco/clean-jinja && uv run python /tmp/qa_empty_policy_check.py:
empty_custom_policy: security_tag=True default_policy=True policy_path=/tmp/tmp3c64fja7/empty_policy.j2 len=14504
This shows a regression: a user-supplied empty custom policy file is treated like no custom content and the built-in default policy appears instead.
Issues Found
- 🟠 Issue: Empty custom security policy files now fall back to the built-in default security policy instead of honoring the selected custom file content.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable — the refactor removes a dead second prompt implementation, but the custom security-policy path now changes a public “template filename” behavior and has a small fallback bug.
[CRITICAL ISSUES]
- None found that require a blocking review from this automation.
[IMPROVEMENT OPPORTUNITIES]
- See the inline comments on custom security policy rendering. The data-flow simplification is good, but custom policy files should either preserve the documented Jinja-template contract or the breaking change should be made explicit with docs/deprecation and stronger tests.
[TESTING GAPS]
- The PR description includes targeted prompt-render evidence and relevant test commands. I did not see eval-monitor evidence plus human confirmation; because this changes prompt assembly/agent behavior, the repo-specific review guide says not to approve yet and to defer to a human maintainer/eval validation.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
This removes legacy prompt templates and changes the custom security-policy path in agent prompt construction. The deletion itself is low-complexity and covered by snapshots/tests, but prompt assembly is eval-sensitive and can affect benchmark/agent behavior; custom-policy compatibility is the main user-facing risk.
VERDICT:
✅ Worth merging after maintainer validation: The core cleanup direction is sound, but please address/acknowledge the custom policy compatibility concerns and have a human maintainer decide after the appropriate lightweight evals.
KEY INSIGHT:
The refactor is only safe if the Python registry remains a drop-in replacement not just for default prompts, but also for documented custom security-policy template behavior.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Prompt rendering works as described: default/model-specific prompts remain stable, custom security policies now route through the registry as raw text, and disabled policies no longer emit empty <SECURITY> tags.
Does this PR achieve its stated goal?
Yes. The PR set out to remove superseded Jinja prompt files while preserving the default Python-registry prompt path and moving custom security policies into that registry path. I exercised the SDK as a user would by constructing real Agent/LLM objects and rendering static_system_message on both origin/main and commit db2ada2e90078f0107613f37c24c9ea6f92fa8da; the default Claude and GPT-5 prompt markers/lengths stayed identical, while the documented custom-policy behavior changed from Jinja-rendered to verbatim raw content and disabled policies stopped producing empty security tags.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run python created the project .venv, installed packages, and imported Agent/LLM successfully. |
| CI Status | 🟡 At review time: 21 checks passing, 8 pending, 1 skipped; no failing checks observed. |
| Functional Verification | ✅ Real SDK prompt rendering exercised before/after for default, GPT-5, disabled security policy, custom security policy, and retained planning-template rendering. |
Functional Verification
Test 1: Default and model-specific static prompts remain stable
Step 1 — Establish baseline on origin/main:
Ran git checkout --quiet origin/main && uv run python - <<'PY' ... PY, where the script constructed real Agent/LLM objects for default Claude, GPT-5, disabled policy, and custom policy cases and printed prompt markers.
Observed baseline excerpt:
default len 14504
default.has_security_tag=True
default.has_default_policy=True
default.has_self_documentation=True
default.has_claude_important=True
gpt5 len 15202
gpt5.has_gpt5_inline_review_guidance=True
This establishes the user-visible static prompt content that the refactor needs to preserve for normal/default agents.
Step 2 — Apply the PR's changes:
Checked out db2ada2e90078f0107613f37c24c9ea6f92fa8da.
Step 3 — Re-run with the PR in place:
Ran the same SDK script and observed:
default len 14504
default.has_security_tag=True
default.has_default_policy=True
default.has_self_documentation=True
default.has_claude_important=True
gpt5 len 15202
gpt5.has_gpt5_inline_review_guidance=True
This confirms the default registry path still renders the expected security/self-documentation/model-specific sections after deleting the redundant Jinja files.
Test 2: Custom and disabled security policies behave as documented
Step 1 — Establish baseline on origin/main:
The same baseline script created a temporary custom custom_policy.j2 containing CUSTOM_RULE plus Jinja syntax and rendered it through Agent.static_system_message.
Observed baseline excerpt:
disabled.has_security_tag=True
disabled.has_default_policy=False
disabled.has_empty_security_block=True
custom.has_security_tag=True
custom.has_default_policy=False
custom.has_custom_rule=True
custom.literal_jinja_kept=False
custom.rendered_jinja_seen=True
This shows the old behavior: disabled policies left an empty <SECURITY> wrapper, and custom policy files were rendered as Jinja.
Step 2 — Apply the PR's changes:
Checked out db2ada2e90078f0107613f37c24c9ea6f92fa8da.
Step 3 — Re-run with the PR in place:
Ran the same SDK script and observed:
disabled.has_security_tag=False
disabled.has_default_policy=False
disabled.has_empty_security_block=False
custom.has_security_tag=True
custom.has_default_policy=False
custom.has_custom_rule=True
custom.literal_jinja_kept=True
This confirms the intended behavior change: custom policy content is inserted verbatim via the registry, the default policy does not leak into custom prompts, and disabling the policy removes the security block rather than emitting an empty one.
Test 3: Retained planning Jinja render path still executes
Step 1 — Apply the PR's changes:
Used the PR commit db2ada2e90078f0107613f37c24c9ea6f92fa8da.
Step 2 — Exercise the surviving render path:
Ran uv run python - <<'PY' ... render_template(agent.prompt_dir, 'system_prompt_planning.j2') ... PY.
Observed:
planning_rendered True
planning_len 5249
This confirms render_template can still render the retained built-in planning template after the deleted prompt files are removed.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the SDK prompt-rendering paths affected by this refactor with real Agent/LLM instantiation; behavior matches the PR’s stated goal and no functional issues were found.
Does this PR achieve its stated goal?
Yes. The PR set out to remove superseded built-in Jinja prompt templates while preserving default prompt behavior and routing custom security policies through the Python prompt registry. Running Agent(...).static_system_message on origin/main and on vasco/clean-jinja showed the default Claude/GPT-5 prompts still include the expected security, self-documentation, and model-specific content, while the custom security-policy behavior changed exactly as documented: the PR injects the custom policy file verbatim, avoids default-policy leakage, and no longer emits <SECURITY> tags when the policy is disabled. I also verified the retained Jinja escape hatch still renders the planning prompt and an absolute custom template.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed; uv dependencies were synced. No test suite, linter, formatter, or pre-commit run was executed. |
| CI Status | 🟡 33 passing, 3 skipped, 1 pending (qa-changes) at verification time. |
| Functional Verification | ✅ SDK prompt rendering and retained Jinja template paths exercised successfully. |
Functional Verification
Test 1: Default prompt behavior and custom security-policy delta
Step 1 — Establish baseline without the PR:
Ran git switch --detach origin/main && uv run python /tmp/qa_prompt_probe.py using a script that imports Agent/LLM, instantiates real agents, and reads static_system_message:
default_claude: len=14504
default_claude: security_tag=True
default_claude: default_policy=True
default_claude: self_documentation=True
default_claude: gpt5_inline_thread_help=False
gpt5: len=15202
gpt5: security_tag=True
gpt5: default_policy=True
gpt5: self_documentation=True
gpt5: gpt5_inline_thread_help=True
custom: security_tag=True
custom: custom_rule=True
custom: default_policy_leaked=False
custom: literal_model_expr=False
custom: rendered_model_name=True
custom: false_branch_visible=False
custom: jinja_if_literal=False
disabled: security_tag=True
disabled: default_policy=False
status: baseline=0 pr=0
This confirms the pre-PR state: default prompts rendered expected sections, custom security policies were rendered as Jinja (rendered_model_name=True, literal Jinja markers absent), and disabling the policy still left an empty <SECURITY> block.
Step 2 — Apply the PR's changes:
Checked out vasco/clean-jinja at db2ada2e90078f0107613f37c24c9ea6f92fa8da.
Step 3 — Re-run with the PR in place:
Ran git switch vasco/clean-jinja && uv run python /tmp/qa_prompt_probe.py:
default_claude: len=14504
default_claude: security_tag=True
default_claude: default_policy=True
default_claude: self_documentation=True
default_claude: gpt5_inline_thread_help=False
gpt5: len=15202
gpt5: security_tag=True
gpt5: default_policy=True
gpt5: self_documentation=True
gpt5: gpt5_inline_thread_help=True
custom: security_tag=True
custom: custom_rule=True
custom: default_policy_leaked=False
custom: literal_model_expr=True
custom: rendered_model_name=False
custom: false_branch_visible=True
custom: jinja_if_literal=True
disabled: security_tag=False
disabled: default_policy=False
status: baseline=0 pr=0
This shows the default prompt path is preserved for the sampled Claude/GPT-5 models (same lengths and expected markers), while the changed custom-policy behavior matches the PR description: Jinja syntax remains literal, the built-in default policy does not leak, and a disabled security policy emits no <SECURITY> tag.
Test 2: Retained Jinja prompt escape hatch still works
Step 1 — Baseline expectation:
The PR description says render_template and system_prompt_planning.j2 are retained for the planning preset, custom user templates, and subclasses; this path should continue to be executable after the deletion.
Step 2 — Apply the PR's changes:
Using the same checked-out PR branch, I exercised Agent.static_system_message with system_prompt_filename="system_prompt_planning.j2" and then with an absolute custom Jinja template path.
Step 3 — Run the retained path:
Ran uv run python /tmp/qa_jinja_escape_probe.py:
planning: rendered=True
planning: role_tag=True
planning: security_tag=False
custom_template: CUSTOM TEMPLATE FOR test-model / cli=
This confirms the surviving built-in planning template renders successfully, and an absolute custom system-prompt template is still Jinja-rendered with SDK template variables.
Issues Found
None.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Functional SDK prompt rendering works as claimed; the only issue I observed is a non-functional PR description validator failure.
Does this PR achieve its stated goal?
Yes. The PR set out to remove redundant Jinja prompt templates while preserving default prompt behavior and routing custom security policies through the Python registry. I exercised the SDK by instantiating real Agent/LLM objects and rendering static_system_message: default prompt markers remained present, custom policies were inserted verbatim without leaking the default policy, disabled security policies emitted no <SECURITY> block, and retained Jinja escape hatches still rendered.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the uv workspace environment. |
| CI Status | Validate PR description fails with 3 template-section errors. I did not rerun tests. |
| Functional Verification | ✅ Before/after SDK execution confirmed the changed behavior. |
Functional Verification
Test 1: Default prompt and custom security policy rendering
Step 1 — Establish baseline on origin/main (b0324013):
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY' ... PY with a script that instantiated Agent(llm=LLM(model="gpt-5"), tools=[]), rendered static_system_message, then repeated with a custom security_policy_filename containing Jinja syntax and with security_policy_filename="".
Output excerpt:
commit b0324013
default markers True True True
custom security block
<SECURITY>
# Custom Policy for gpt-5
-
</SECURITY>
disabled security block <SECURITY>
</SECURITY>
This shows the baseline default prompt worked, but custom policy files were Jinja-rendered and disabling the policy still left an empty <SECURITY> block.
Step 2 — Apply the PR changes:
Checked out vasco/clean-jinja at db2ada2e.
Step 3 — Re-run with the PR in place:
Ran the same SDK rendering flow.
Output excerpt:
commit db2ada2e
default markers True True True
custom security block
<SECURITY>
# Custom Policy for {{ model_name }}
- {% if cli_mode %}CLI_TRUE{% endif %}
</SECURITY>
default policy leaked False
disabled security block <NO SECURITY BLOCK>
This confirms the PR preserves default prompt content (<ROLE>, default security policy, and GPT-5 model-specific preamble all present), inserts custom security policy files verbatim, avoids default-policy leakage, and removes the disabled-policy empty tag block.
Test 2: Retained Jinja escape hatches still render
Step 1 — Baseline:
On origin/main, the same SDK script rendered a temporary custom system prompt and the retained planning template:
custom_system_prompt: CUSTOM TEMPLATE: gpt-5 / security_policy.j2
planning_template: len= 5249 has_PLANNING= True
This establishes that user-supplied system prompt templates and system_prompt_planning.j2 were valid entry points before the refactor.
Step 2 — PR branch:
On db2ada2e, I re-ran equivalent user-facing rendering:
custom system prompt CUSTOM TEMPLATE: gpt-5 / security_policy.j2
planning template True True
This confirms the retained Jinja pathways still work after deleting the obsolete built-in prompt templates.
CI note
I checked the failing validator log with gh run view 27833734611 --repo OpenHands/software-agent-sdk --job 82376587822 --log-failed.
Output excerpt:
##[error]Keep the `## Why` section from the PR template.
##[error]Keep the `## Summary` section from the PR template.
##[error]Keep the `## How to Test` section from the PR template.
PR description validation failed with 3 error(s).
I did not edit the PR description.
Issues Found
- 🟡 PR hygiene / CI:
Validate PR descriptionis failing because the PR template section headings## Why,## Summary, and## How to Testare not present in the expected form. No functional SDK issue was found in the exercised behavior.
This review was created by an AI agent (OpenHands) on behalf of the user.
| The default prompt is assembled from the typed section registry, which also | ||
| resolves a custom ``security_policy_filename``. Escape hatches keep the Jinja | ||
| path: an inline ``system_prompt`` is returned verbatim; a custom | ||
| ``system_prompt_filename`` or subclass ``prompt_dir`` renders its own template. |
There was a problem hiding this comment.
We still support a .j2 file if custom?
| "verbatim (NOT rendered as a Jinja template). Can be either:\n" | ||
| "- A relative filename (e.g., 'security_policy.j2') loaded from the " | ||
| "agent's prompts directory\n" | ||
| "- An absolute path (e.g., '/path/to/custom_security_policy.j2')\n" |
There was a problem hiding this comment.
I'm a bit confused here, the comment says it's not rendered as jinja, but then it continues saying it can be jinja? Sorry I jumped directly in the code, maybe it was explained somewhere?
There was a problem hiding this comment.
Sorry, I didn't see this comment and it is missleading. I delete it.
I mean: if you pass a custum jinja will be rendered as a jinja (with the method render_template) but if you don't do it, we use the built-in prompts which are verbatim.
| # Security Risk Policy | ||
| When using tools that support the security_risk parameter, assess the safety risk of your actions: | ||
|
|
||
| {% if cli_mode | default(true) %} |
There was a problem hiding this comment.
This is one reason why I think maybe we need jinja. I know we first made them python, but that way is not as easily customizable as this... I wonder if the next step could be, on the contrary, to put more in jinja rather than less 🤔
There was a problem hiding this comment.
You're right: a template can be more customizable than Python, but I checked, and the prompts never actually used that: no {% block %}/{% extends %} anywhere, and system_prompt.j2 had no inline logic beyond {{ soul_content }} and include-guards.
So every hook the old Jinja exposed is preserved on the registry, just relocated from {% if %}/{% include %} into .guard()/dict lookups.
The only thing genuinely removed is Jinja inside a custom policy file (now verbatim), which was deliberate.
The "override one block without forking" story you're describing would be new capability, neither the old templates nor the registry expose it today.
If that's what we want, I think the right move is to design a real section-override API (works for both), rather than keep the duplicate templates that only gave us that power in theory.
Moreover, you can always pass a template and kwargs argument for it.
HUMAN:
Deleteing old jinja prompts not used anymore.
AGENT:
End-to-end evidence — instantiated real agents and rendered the full system prompt
(not just unit tests). Before deleting
system_prompt.j2, I confirmed the registryreproduces its Jinja render byte-for-byte for the default and custom policies
(including the Windows shell refinement). Post-deletion render:
Suites:
ruffclean; fullagent/,context/, fncall-converter,delegate/, and theAPI-breakage signature test pass (1260 passed in the broad sweep).
Why
The default system prompt is now assembled from the pure-Python section registry
(#3610 port), but the original Jinja templates were left behind — mostly unused or
duplicating the registry. Keeping both is a drift/maintenance hazard: two sources of
truth for the same prompt.
Summary
.j2templates: 6 dead (unreferenced{% include %}wrappers,Python-ported in-context examples, an orphaned delegate description) and 4
model_specific/*(byte-for-byte duplicated inModelSpecificSection).{% include %}— intoSecuritySection(AgentBase._read_custom_security_policy), then deletesystem_prompt.j2+self_documentation.j2+security_risk_assessment.j2+security_policy.j2; a customsecurity_policy_filenamenow routes through the registry.render_templateandsystem_prompt_planning.j2areretained for the planning preset, custom user templates, and subclasses.
Issue Number
Relates to #3610 (prompt-registry port).
How to Test
Then render a prompt directly:
Video/Screenshots
N/A — library/CLI change with no UI. Render output included in the AGENT section above.
Type
Notes
Minor behavior changes for custom security policies (default-policy agents are
unaffected — already on the registry): content is identical, but inter-section spacing
is normalized to single blank lines; a disabled policy (
security_policy_filename="")no longer emits empty
<SECURITY></SECURITY>tags; and custom policy files are insertedverbatim (no longer Jinja-rendered). The static "matches-legacy" test oracle had become
self-referential (default agents take the registry path), so it's removed along with
system_prompt.j2;test_prompt_snapshot.py(byte-exact registry snapshots) carries thestatic regression guard, and the dynamic oracle against
system_message_suffix.j2stays.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:73475ac-pythonRun
All tags pushed for this build
About Multi-Architecture Support
73475ac-python) is a multi-arch manifest supporting both amd64 and arm6473475ac-python-amd64) are also available if needed