Add prompt templates for LLM judgments #163

sstults · 2025-07-01T23:21:48Z

Description

The LLM judgment template functionality allows users to create and manage custom prompt templates for LLM-based search relevance evaluation. This feature provides a way to standardize and customize how LLMs evaluate the relevance of search results.

Key Components:

1. Template Management

LlmPromptTemplate Model: Stores template metadata including ID, name, description, template content, and timestamps
CRUD Operations: Full create, read, update, and delete operations via REST APIs (/_plugins/_search_relevance/llm_prompt_templates)
Persistent Storage: Templates are stored in OpenSearch indices with proper mappings

2. Template Structure

Variable Substitution: Templates support dynamic variables using {variableName} syntax
Supported Variables:
- {searchText} - The search query text
- {reference} - Reference answer (optional)
- {hits} - JSON-formatted search results
Template Validation: Ensures only supported variables are used

3. Integration with LLM Judgments

Optional Template Usage: When creating LLM judgments, users can specify a templateId parameter
Fallback Behavior: If template is missing or invalid, the system falls back to default prompts
Variable Substitution: The TemplateUtils class handles dynamic replacement of template variables with actual values
Caching Integration: Template-based judgments are cached like standard judgments

4. Workflow

User creates a custom prompt template with variables
During LLM judgment generation, the template is retrieved
Variables are substituted with actual query text, reference answers, and search hits
The customized prompt is sent to the LLM for evaluation
Results are processed and cached for future use

This functionality enables organizations to standardize their relevance evaluation criteria and customize prompts for specific domains or use cases while maintaining the automated nature of LLM-based search relevance assessment.

Issues Resolved

n/a

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

- Add LlmPromptTemplate model with validation - Implement CRUD operations for prompt templates - Add REST endpoints for template management - Include comprehensive test coverage - Update plugin registration and indices Signed-off-by: Scott Stults <[email protected]>

- Replace toLowerCase() with toLowerCase(Locale.ROOT) to comply with forbidden APIs - Ensures consistent locale-independent string conversion - All tests passing including LLM template and judgment functionality Signed-off-by: Scott Stults <[email protected]>

- Add templateId parameter support to PutLlmJudgmentRequest - Update REST API to accept templateId in LLM judgment requests - Enhance transport action to pass templateId through metadata - Add TEMPLATE_ID constant to PluginConstants - Create comprehensive integration test for template workflow - Complete end-to-end integration from REST API to LLM processor This completes Phase 4 of the LLM Prompt Template integration, enabling users to specify custom templates when creating LLM judgments. The system now supports full template lifecycle management and usage. Signed-off-by: Scott Stults <[email protected]>

- Fixed refresh operation to use specific index instead of global refresh - Added error handling for refresh operations to prevent system index warnings - All four previously failing tests now pass: * testLlmJudgmentWithMissingTemplate * testLlmJudgmentWithCustomTemplate * testLlmJudgmentWithoutTemplate * testLlmJudgmentTemplateVariableSubstitution Signed-off-by: Scott Stults <[email protected]>

epugh · 2025-07-02T15:42:23Z

Let's not forget that we need to add Docs that demonstrate what a "good user prompt" would look like. Also, let's make sure to tag on to the "Advanced User Journey" blog post that @wrigleyDan is writing....

@test

- Extend OpenSearchTestCase instead of using standalone JUnit tests - Remove @test annotations (OpenSearch framework uses method naming conventions) - Remove explicit JUnit imports - All unit tests pass successfully (205+ tests) Signed-off-by: Scott Stults <[email protected]>

- Update connector URLs from 127.0.0.1 to host.docker.internal for Docker compatibility - Add host.docker.internal patterns to ML Commons trusted endpoints - Configure ML Commons settings for private IP access - Script now successfully connects OpenSearch (Docker) to Ollama (host) - Test demonstrates working LLM judgment with accurate relevance ratings Signed-off-by: Scott Stults <[email protected]>

…sTests Signed-off-by: Scott Stults <[email protected]>

martin-gaievski · 2025-07-03T17:19:18Z

src/main/java/org/opensearch/searchrelevance/utils/TemplateUtils.java

+     */
+    public static boolean validateTemplate(String template) {
+        if (template == null || template.isEmpty()) {
+            return true;


this should be false, isn't it?

Good point. I'm either going to rename this method or refactor how it's used (or both).

martin-gaievski · 2025-07-03T17:20:24Z

src/main/java/org/opensearch/searchrelevance/utils/TemplateUtils.java

+     * @param template The template to validate
+     * @return true if template contains only supported variables, false otherwise
+     */
+    public static boolean validateTemplate(String template) {


let's add validation for max length, otherwise this is a potential gap in system resiliency

Agreed. I'm uncertain about what we should do though so your advice would be welcome.

I think what we should do is add a dynamic cluster setting to SearchRelevanceSettings with a default value. That is, I think it would be nice to let this be overridden at runtime and not require a restart to change. I don't know enough about multitenancy configurations to say whether a cluster or node setting is more appropriate though.

The harder question is, should this be string length or token length? String length is easy to implement, but token length is more useful in LLM-world (like cost calculations). Token length isn't too much different but the token counts probably won't exactly match the target model. It'll be close though, so that might be good enough.

Lastly, we want to set a default value that's not too high and not too low. In the last year the max token length in popular models has expanded by a couple orders of magnitude (like 2k to 100k). Maybe we just pick something medium like 10k?

@fen-qin or @heemin32, do either of you have opinions about these three questions?

martin-gaievski · 2025-07-03T17:27:25Z

...ava/org/opensearch/searchrelevance/action/llmprompttemplate/PutLlmPromptTemplateRequest.java

+public class PutLlmPromptTemplateRequest extends ActionRequest {
+
+    private String templateId;
+    private LlmPromptTemplate template;


can we add some basic metadata, like version and description

martin-gaievski · 2025-07-03T17:29:19Z

src/test/java/org/opensearch/searchrelevance/action/llmprompttemplate/LlmPromptTemplateIT.java

+ */
+public class LlmPromptTemplateIT extends BaseSearchRelevanceIT {
+
+    private static final String LLM_PROMPT_TEMPLATE_ENDPOINT = "/_plugins/_search_relevance/llm_prompt_templates";


this should use a constant from SearchRelevanceIndices

RES

Signed-off-by: Scott Stults <[email protected]>

fen-qin · 2025-07-07T18:38:15Z

@epugh @sstults
Hi, I have couple high level questions on this changes:

For different prompt templates, the output schema may vary. How will the response processor align with the selected prompt template? For example:

  *  rating between 0 to 1 → rating = 0.8
  *  rating for RELEVANT or IRRELEVANT → rating = RELEVANT

Research shows word choice significantly impacts prompt effectiveness. What is the rationale for using customized prompt templates instead of having the search-relevance plugin provide pre-optimized prompts? To determine optimal prompts, we would need:

  * Performance benchmarking
  * Analysis of accuracy impacts per template 
  * Our primary goal should be maximizing search relevance accuracy.

This interface change affects how users interact with the search-relevance plugin. As with any interface modification, a new security review will be required.

sstults · 2025-07-08T15:58:04Z

Good questions, @fen-qin. I'll try to address them here and in the eventual documentation.

1. For different prompt templates, the output schema may vary. How will the response processor align with the selected prompt template? For example:

Right now we only support integer and decimal numbers for the rating. What the scale of those ratings means is delegated to the metrics. nDCG will always be between 0 and 1 regardless, but DCG would directly reflect what scale we use in our templates. This means that if we start off with a scale from 0 to 1 but later use a scale from 0 to 4 then the history of our DCG scores will see a 4x jump for no other reason than scale change.

We might consider adding an output schema spec to the template creation request that would enforce the type and scale of the output we expect. That might allow us to do categorical judgments as well (like Exact Match, Substitute, Compliment, Irrelevant). I think our judgments are pretty agnostic about what values we record, but it would be good to know which judgments are appropriate to the metrics we use to summarize and track them.

2. Research shows word choice significantly impacts prompt effectiveness. What is the rationale for using customized prompt templates instead of having the search-relevance plugin provide pre-optimized prompts?

I'm hoping to work a template optimization process into 3.2 and 3.3. Regardless of whether we supply a set of pre-packaged templates I think our users are going to want the capability to modify and add to them. In general though, I'm pessimistic that we could predetermine what optimal means.

3. This interface change affects how users interact with the search-relevance plugin. As with any interface modification, a new security review will be required.

Should we convert this PR to a draft pending the review?

martin-gaievski · 2025-07-09T18:27:06Z

thanks for bringing valid points @fen-qin, because this PR adds new API it has to go over the security review.

@sstults don't need to convert to draft PR, I just changed destination branch from main to a new feature branch, that is sufficient. Please share the technical doc that covers details of this change, it's need for security review. You can use structure from RFCs that team has created in the past (ref1, ref2), or came up with your own format.

I can't find a link to a proper security review process for external contributors. I'll keep looking, but as of now let's assume it's a two stage process - you prep the technical docs (tech design and there is one more document that describes Threat model) and we take it from there and talk with security experts.

fen-qin · 2025-07-11T23:25:19Z

I propose to have alternative ways to support customized prompt template.

Instead of creating a database wrapper, prompt template can be selectable or customizable during judgment creation.

Customers are allow to select from a list of optimized prompt templates:
- question answering prompt template
- chain of conversation template
- which have corresponding output processor defined, such like:
  - JUDGMENT_RATING [0-5]
  - RELEVANCE_OR_IRRELEVANCE
  - A_IS_BETTER_THAN_B
Customers are allow to have their own customized prompt template but need to pair with an existing output processor.

These parameters can come as one of the configurations for LLM-as-a-judge when make a judgment PUT api.

epugh · 2025-08-08T17:18:40Z

During discussion today with @heemin32 @smacrakis about how much effort to put forward to getting this in, with the other items on the plate for 3.3. We decided that it made the most sense for @fen-qin to take the lead on evolving the LLM Judgements, and use this PR as inspiration. @heemin32 can you follow up with @fen-qin ?

@fen-qin Please don't hesitate to use the #search-revelevance-workbench team to bounce ideas off of, and we're all here to support getting this next step in LLM-as-a-Judge done.

epugh · 2025-09-09T21:51:32Z

Realizing closing it means folks won't see it. I was showing it to someone who wanted LLM as a judge in OS at lunch today.

sstults added 4 commits July 1, 2025 08:23

sstults marked this pull request as ready for review July 1, 2025 23:22

epugh requested review from fen-qin, heemin32 and martin-gaievski July 2, 2025 15:36

sstults added 3 commits July 3, 2025 06:01

Fix forbidden APIs test by renaming TemplateUtilsTest to TemplateUtil…

ad6212b

…sTests Signed-off-by: Scott Stults <[email protected]>

martin-gaievski reviewed Jul 3, 2025

View reviewed changes

RES Outdated Show resolved Hide resolved

Remove file that accidentally got committed

7c4b00b

Signed-off-by: Scott Stults <[email protected]>

martin-gaievski changed the base branch from main to feature/prompt_templates_for_llm_judgments July 9, 2025 17:29

epugh closed this Aug 8, 2025

epugh reopened this Sep 9, 2025

Add prompt templates for LLM judgments #163

Are you sure you want to change the base?

Add prompt templates for LLM judgments #163

Uh oh!

Conversation

sstults commented Jul 1, 2025

Description

Key Components:

Issues Resolved

Uh oh!

epugh commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-gaievski Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

sstults Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

sstults Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fen-qin commented Jul 7, 2025

Uh oh!

sstults commented Jul 8, 2025

Uh oh!

martin-gaievski commented Jul 9, 2025

Uh oh!

fen-qin commented Jul 11, 2025

Uh oh!

epugh commented Aug 8, 2025

Uh oh!

epugh commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

epugh commented Jul 2, 2025 •

edited

Loading