Skip to content

Conversation

sstults
Copy link

@sstults sstults commented Jul 1, 2025

Description

The LLM judgment template functionality allows users to create and manage custom prompt templates for LLM-based search relevance evaluation. This feature provides a way to standardize and customize how LLMs evaluate the relevance of search results.

Key Components:

1. Template Management

  • LlmPromptTemplate Model: Stores template metadata including ID, name, description, template content, and timestamps
  • CRUD Operations: Full create, read, update, and delete operations via REST APIs (/_plugins/_search_relevance/llm_prompt_templates)
  • Persistent Storage: Templates are stored in OpenSearch indices with proper mappings

2. Template Structure

  • Variable Substitution: Templates support dynamic variables using {variableName} syntax

  • Supported Variables:

    • {searchText} - The search query text
    • {reference} - Reference answer (optional)
    • {hits} - JSON-formatted search results
  • Template Validation: Ensures only supported variables are used

3. Integration with LLM Judgments

  • Optional Template Usage: When creating LLM judgments, users can specify a templateId parameter
  • Fallback Behavior: If template is missing or invalid, the system falls back to default prompts
  • Variable Substitution: The TemplateUtils class handles dynamic replacement of template variables with actual values
  • Caching Integration: Template-based judgments are cached like standard judgments

4. Workflow

  1. User creates a custom prompt template with variables
  2. During LLM judgment generation, the template is retrieved
  3. Variables are substituted with actual query text, reference answers, and search hits
  4. The customized prompt is sent to the LLM for evaluation
  5. Results are processed and cached for future use

This functionality enables organizations to standardize their relevance evaluation criteria and customize prompts for specific domains or use cases while maintaining the automated nature of LLM-based search relevance assessment.

Issues Resolved

n/a

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

sstults added 4 commits July 1, 2025 08:23
- Add LlmPromptTemplate model with validation
- Implement CRUD operations for prompt templates
- Add REST endpoints for template management
- Include comprehensive test coverage
- Update plugin registration and indices

Signed-off-by: Scott Stults <[email protected]>
- Replace toLowerCase() with toLowerCase(Locale.ROOT) to comply with forbidden APIs
- Ensures consistent locale-independent string conversion
- All tests passing including LLM template and judgment functionality

Signed-off-by: Scott Stults <[email protected]>
- Add templateId parameter support to PutLlmJudgmentRequest
- Update REST API to accept templateId in LLM judgment requests
- Enhance transport action to pass templateId through metadata
- Add TEMPLATE_ID constant to PluginConstants
- Create comprehensive integration test for template workflow
- Complete end-to-end integration from REST API to LLM processor

This completes Phase 4 of the LLM Prompt Template integration,
enabling users to specify custom templates when creating LLM judgments.
The system now supports full template lifecycle management and usage.

Signed-off-by: Scott Stults <[email protected]>
- Fixed refresh operation to use specific index instead of global refresh
- Added error handling for refresh operations to prevent system index warnings
- All four previously failing tests now pass:
  * testLlmJudgmentWithMissingTemplate
  * testLlmJudgmentWithCustomTemplate
  * testLlmJudgmentWithoutTemplate
  * testLlmJudgmentTemplateVariableSubstitution

Signed-off-by: Scott Stults <[email protected]>
@sstults sstults marked this pull request as ready for review July 1, 2025 23:22
@epugh
Copy link
Collaborator

epugh commented Jul 2, 2025

Let's not forget that we need to add Docs that demonstrate what a "good user prompt" would look like. Also, let's make sure to tag on to the "Advanced User Journey" blog post that @wrigleyDan is writing....

sstults added 3 commits July 3, 2025 06:01
- Extend OpenSearchTestCase instead of using standalone JUnit tests
- Remove @test annotations (OpenSearch framework uses method naming conventions)
- Remove explicit JUnit imports
- All unit tests pass successfully (205+ tests)

Signed-off-by: Scott Stults <[email protected]>
- Update connector URLs from 127.0.0.1 to host.docker.internal for Docker compatibility
- Add host.docker.internal patterns to ML Commons trusted endpoints
- Configure ML Commons settings for private IP access
- Script now successfully connects OpenSearch (Docker) to Ollama (host)
- Test demonstrates working LLM judgment with accurate relevance ratings

Signed-off-by: Scott Stults <[email protected]>
*/
public static boolean validateTemplate(String template) {
if (template == null || template.isEmpty()) {
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, isn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm either going to rename this method or refactor how it's used (or both).

* @param template The template to validate
* @return true if template contains only supported variables, false otherwise
*/
public static boolean validateTemplate(String template) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add validation for max length, otherwise this is a potential gap in system resiliency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm uncertain about what we should do though so your advice would be welcome.

I think what we should do is add a dynamic cluster setting to SearchRelevanceSettings with a default value. That is, I think it would be nice to let this be overridden at runtime and not require a restart to change. I don't know enough about multitenancy configurations to say whether a cluster or node setting is more appropriate though.

The harder question is, should this be string length or token length? String length is easy to implement, but token length is more useful in LLM-world (like cost calculations). Token length isn't too much different but the token counts probably won't exactly match the target model. It'll be close though, so that might be good enough.

Lastly, we want to set a default value that's not too high and not too low. In the last year the max token length in popular models has expanded by a couple orders of magnitude (like 2k to 100k). Maybe we just pick something medium like 10k?

@fen-qin or @heemin32, do either of you have opinions about these three questions?

public class PutLlmPromptTemplateRequest extends ActionRequest {

private String templateId;
private LlmPromptTemplate template;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some basic metadata, like version and description

*/
public class LlmPromptTemplateIT extends BaseSearchRelevanceIT {

private static final String LLM_PROMPT_TEMPLATE_ENDPOINT = "/_plugins/_search_relevance/llm_prompt_templates";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use a constant from SearchRelevanceIndices

@fen-qin
Copy link
Collaborator

fen-qin commented Jul 7, 2025

@epugh @sstults
Hi, I have couple high level questions on this changes:

  1. For different prompt templates, the output schema may vary. How will the response processor align with the selected prompt template? For example:
  *  rating between 0 to 1 → rating = 0.8
  *  rating for RELEVANT or IRRELEVANT → rating = RELEVANT
  1. Research shows word choice significantly impacts prompt effectiveness. What is the rationale for using customized prompt templates instead of having the search-relevance plugin provide pre-optimized prompts? To determine optimal prompts, we would need:
  * Performance benchmarking
  * Analysis of accuracy impacts per template 
  * Our primary goal should be maximizing search relevance accuracy.
  1. This interface change affects how users interact with the search-relevance plugin. As with any interface modification, a new security review will be required.

@sstults
Copy link
Author

sstults commented Jul 8, 2025

Good questions, @fen-qin. I'll try to address them here and in the eventual documentation.

1. For different prompt templates, the output schema may vary. How will the response processor align with the selected prompt template? For example:

Right now we only support integer and decimal numbers for the rating. What the scale of those ratings means is delegated to the metrics. nDCG will always be between 0 and 1 regardless, but DCG would directly reflect what scale we use in our templates. This means that if we start off with a scale from 0 to 1 but later use a scale from 0 to 4 then the history of our DCG scores will see a 4x jump for no other reason than scale change.

We might consider adding an output schema spec to the template creation request that would enforce the type and scale of the output we expect. That might allow us to do categorical judgments as well (like Exact Match, Substitute, Compliment, Irrelevant). I think our judgments are pretty agnostic about what values we record, but it would be good to know which judgments are appropriate to the metrics we use to summarize and track them.

2. Research shows word choice significantly impacts prompt effectiveness. What is the rationale for using customized prompt templates instead of having the search-relevance plugin provide pre-optimized prompts?

I'm hoping to work a template optimization process into 3.2 and 3.3. Regardless of whether we supply a set of pre-packaged templates I think our users are going to want the capability to modify and add to them. In general though, I'm pessimistic that we could predetermine what optimal means.

3. This interface change affects how users interact with the search-relevance plugin. As with any interface modification, a new security review will be required.

Should we convert this PR to a draft pending the review?

@martin-gaievski martin-gaievski changed the base branch from main to feature/prompt_templates_for_llm_judgments July 9, 2025 17:29
@martin-gaievski
Copy link
Member

thanks for bringing valid points @fen-qin, because this PR adds new API it has to go over the security review.

@sstults don't need to convert to draft PR, I just changed destination branch from main to a new feature branch, that is sufficient. Please share the technical doc that covers details of this change, it's need for security review. You can use structure from RFCs that team has created in the past (ref1, ref2), or came up with your own format.

I can't find a link to a proper security review process for external contributors. I'll keep looking, but as of now let's assume it's a two stage process - you prep the technical docs (tech design and there is one more document that describes Threat model) and we take it from there and talk with security experts.

@fen-qin
Copy link
Collaborator

fen-qin commented Jul 11, 2025

I propose to have alternative ways to support customized prompt template.

Instead of creating a database wrapper, prompt template can be selectable or customizable during judgment creation.

  • Customers are allow to select from a list of optimized prompt templates:
    • question answering prompt template
    • chain of conversation template
    • which have corresponding output processor defined, such like:
      • JUDGMENT_RATING [0-5]
      • RELEVANCE_OR_IRRELEVANCE
      • A_IS_BETTER_THAN_B
  • Customers are allow to have their own customized prompt template but need to pair with an existing output processor.

These parameters can come as one of the configurations for LLM-as-a-judge when make a judgment PUT api.

@epugh
Copy link
Collaborator

epugh commented Aug 8, 2025

During discussion today with @heemin32 @smacrakis about how much effort to put forward to getting this in, with the other items on the plate for 3.3. We decided that it made the most sense for @fen-qin to take the lead on evolving the LLM Judgements, and use this PR as inspiration. @heemin32 can you follow up with @fen-qin ?

@fen-qin Please don't hesitate to use the #search-revelevance-workbench team to bounce ideas off of, and we're all here to support getting this next step in LLM-as-a-Judge done.

@epugh epugh closed this Aug 8, 2025
@epugh epugh reopened this Sep 9, 2025
@epugh
Copy link
Collaborator

epugh commented Sep 9, 2025

Realizing closing it means folks won't see it. I was showing it to someone who wanted LLM as a judge in OS at lunch today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants