-
Notifications
You must be signed in to change notification settings - Fork 21
Add prompt templates for LLM judgments #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/prompt_templates_for_llm_judgments
Are you sure you want to change the base?
Add prompt templates for LLM judgments #163
Conversation
- Add LlmPromptTemplate model with validation - Implement CRUD operations for prompt templates - Add REST endpoints for template management - Include comprehensive test coverage - Update plugin registration and indices Signed-off-by: Scott Stults <[email protected]>
- Replace toLowerCase() with toLowerCase(Locale.ROOT) to comply with forbidden APIs - Ensures consistent locale-independent string conversion - All tests passing including LLM template and judgment functionality Signed-off-by: Scott Stults <[email protected]>
- Add templateId parameter support to PutLlmJudgmentRequest - Update REST API to accept templateId in LLM judgment requests - Enhance transport action to pass templateId through metadata - Add TEMPLATE_ID constant to PluginConstants - Create comprehensive integration test for template workflow - Complete end-to-end integration from REST API to LLM processor This completes Phase 4 of the LLM Prompt Template integration, enabling users to specify custom templates when creating LLM judgments. The system now supports full template lifecycle management and usage. Signed-off-by: Scott Stults <[email protected]>
- Fixed refresh operation to use specific index instead of global refresh - Added error handling for refresh operations to prevent system index warnings - All four previously failing tests now pass: * testLlmJudgmentWithMissingTemplate * testLlmJudgmentWithCustomTemplate * testLlmJudgmentWithoutTemplate * testLlmJudgmentTemplateVariableSubstitution Signed-off-by: Scott Stults <[email protected]>
Let's not forget that we need to add Docs that demonstrate what a "good user prompt" would look like. Also, let's make sure to tag on to the "Advanced User Journey" blog post that @wrigleyDan is writing.... |
- Extend OpenSearchTestCase instead of using standalone JUnit tests - Remove @test annotations (OpenSearch framework uses method naming conventions) - Remove explicit JUnit imports - All unit tests pass successfully (205+ tests) Signed-off-by: Scott Stults <[email protected]>
- Update connector URLs from 127.0.0.1 to host.docker.internal for Docker compatibility - Add host.docker.internal patterns to ML Commons trusted endpoints - Configure ML Commons settings for private IP access - Script now successfully connects OpenSearch (Docker) to Ollama (host) - Test demonstrates working LLM judgment with accurate relevance ratings Signed-off-by: Scott Stults <[email protected]>
…sTests Signed-off-by: Scott Stults <[email protected]>
*/ | ||
public static boolean validateTemplate(String template) { | ||
if (template == null || template.isEmpty()) { | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm either going to rename this method or refactor how it's used (or both).
* @param template The template to validate | ||
* @return true if template contains only supported variables, false otherwise | ||
*/ | ||
public static boolean validateTemplate(String template) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add validation for max length, otherwise this is a potential gap in system resiliency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm uncertain about what we should do though so your advice would be welcome.
I think what we should do is add a dynamic cluster setting to SearchRelevanceSettings
with a default value. That is, I think it would be nice to let this be overridden at runtime and not require a restart to change. I don't know enough about multitenancy configurations to say whether a cluster or node setting is more appropriate though.
The harder question is, should this be string length or token length? String length is easy to implement, but token length is more useful in LLM-world (like cost calculations). Token length isn't too much different but the token counts probably won't exactly match the target model. It'll be close though, so that might be good enough.
Lastly, we want to set a default value that's not too high and not too low. In the last year the max token length in popular models has expanded by a couple orders of magnitude (like 2k to 100k). Maybe we just pick something medium like 10k?
@fen-qin or @heemin32, do either of you have opinions about these three questions?
public class PutLlmPromptTemplateRequest extends ActionRequest { | ||
|
||
private String templateId; | ||
private LlmPromptTemplate template; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some basic metadata, like version and description
*/ | ||
public class LlmPromptTemplateIT extends BaseSearchRelevanceIT { | ||
|
||
private static final String LLM_PROMPT_TEMPLATE_ENDPOINT = "/_plugins/_search_relevance/llm_prompt_templates"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use a constant from SearchRelevanceIndices
Signed-off-by: Scott Stults <[email protected]>
@epugh @sstults
|
Good questions, @fen-qin. I'll try to address them here and in the eventual documentation.
Right now we only support integer and decimal numbers for the rating. What the scale of those ratings means is delegated to the metrics. nDCG will always be between 0 and 1 regardless, but DCG would directly reflect what scale we use in our templates. This means that if we start off with a scale from 0 to 1 but later use a scale from 0 to 4 then the history of our DCG scores will see a 4x jump for no other reason than scale change. We might consider adding an output schema spec to the template creation request that would enforce the type and scale of the output we expect. That might allow us to do categorical judgments as well (like Exact Match, Substitute, Compliment, Irrelevant). I think our judgments are pretty agnostic about what values we record, but it would be good to know which judgments are appropriate to the metrics we use to summarize and track them.
I'm hoping to work a template optimization process into 3.2 and 3.3. Regardless of whether we supply a set of pre-packaged templates I think our users are going to want the capability to modify and add to them. In general though, I'm pessimistic that we could predetermine what optimal means.
Should we convert this PR to a draft pending the review? |
thanks for bringing valid points @fen-qin, because this PR adds new API it has to go over the security review. @sstults don't need to convert to draft PR, I just changed destination branch from I can't find a link to a proper security review process for external contributors. I'll keep looking, but as of now let's assume it's a two stage process - you prep the technical docs (tech design and there is one more document that describes Threat model) and we take it from there and talk with security experts. |
I propose to have alternative ways to support customized prompt template. Instead of creating a database wrapper, prompt template can be selectable or customizable during judgment creation.
These parameters can come as one of the configurations for LLM-as-a-judge when make a judgment PUT api. |
During discussion today with @heemin32 @smacrakis about how much effort to put forward to getting this in, with the other items on the plate for 3.3. We decided that it made the most sense for @fen-qin to take the lead on evolving the LLM Judgements, and use this PR as inspiration. @heemin32 can you follow up with @fen-qin ? @fen-qin Please don't hesitate to use the #search-revelevance-workbench team to bounce ideas off of, and we're all here to support getting this next step in LLM-as-a-Judge done. |
Realizing closing it means folks won't see it. I was showing it to someone who wanted LLM as a judge in OS at lunch today. |
Description
The LLM judgment template functionality allows users to create and manage custom prompt templates for LLM-based search relevance evaluation. This feature provides a way to standardize and customize how LLMs evaluate the relevance of search results.
Key Components:
1. Template Management
/_plugins/_search_relevance/llm_prompt_templates
)2. Template Structure
Variable Substitution: Templates support dynamic variables using
{variableName}
syntaxSupported Variables:
{searchText}
- The search query text{reference}
- Reference answer (optional){hits}
- JSON-formatted search resultsTemplate Validation: Ensures only supported variables are used
3. Integration with LLM Judgments
templateId
parameterTemplateUtils
class handles dynamic replacement of template variables with actual values4. Workflow
This functionality enables organizations to standardize their relevance evaluation criteria and customize prompts for specific domains or use cases while maintaining the automated nature of LLM-based search relevance assessment.
Issues Resolved
n/a
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.