SAK-52338 Search Refactor content producers to remove html in search results#14367
SAK-52338 Search Refactor content producers to remove html in search results#14367kunaljaykam wants to merge 1 commit intosakaiproject:masterfrom
Conversation
WalkthroughThis pull request refactors content extraction across multiple EntityContentProducer implementations by removing HTML parsing and cleaning logic, while simultaneously introducing HTML stripping during search indexing and highlighting in the Elasticsearch module. The changes shift HTML processing responsibility from individual content producers to centralized search indexing components. Changes
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lessonbuilder/tool/src/java/org/sakaiproject/lessonbuildertool/service/LessonsEntityContentProducer.java (1)
154-168:⚠️ Potential issue | 🟠 Major
item.getHtml()may returnnull, causing downstream NPE.
SimplePageItem.getHtml()can returnnull. Previously, the HTMLParser intermediary likely handled this. Now, returningnullfromgetContentwill cause aNullPointerExceptioningetContentReader(Line 151:new StringReader(getContent(reference))) and potentially in the search indexer.Proposed fix
if (item != null) { - return item.getHtml(); + String html = item.getHtml(); + return html != null ? html : ""; }conversations/impl/src/main/java/org/sakaiproject/conversations/impl/ConversationsEntityContentProducerImpl.java (1)
256-277:⚠️ Potential issue | 🟡 Minor
getMessage()may returnnull, causing"null"to be indexed.Other content producers in this PR null-check before returning (e.g.,
AssignmentContentProducerline 108:instructions != null ? instructions : "";MessageForumsEntityContentProducerline 162:m.getBody() != null ? m.getBody() : ""). Here,StringBuilder.append(null)will append the literal string"null"into the indexed content.Proposed fix: guard against null messages
if (!topic.getDraft() && !topic.getHidden()) { - sb.append(topic.getMessage()); + String message = topic.getMessage(); + if (message != null) { + sb.append(message); + } topic.getTagIds().forEach(tagId -> { tagRepository.findById(tagId).ifPresent(t -> sb.append(" ").append(t.getLabel())); }); } break; case "p": ConversationsPost post = postRepository.findById(ref.getId()) .orElseThrow(() -> new IllegalArgumentException("Invalid post reference: " + reference)); if (!post.getDraft() && !post.getHidden()) { - sb.append(post.getMessage()); + String postMessage = post.getMessage(); + if (postMessage != null) { + sb.append(postMessage); + } } break; case "c": ConversationsComment comment = commentRepository.findById(ref.getId()) .orElseThrow(() -> new IllegalArgumentException("Invalid comment reference: " + reference)); - sb.append(comment.getMessage()); + String commentMessage = comment.getMessage(); + if (commentMessage != null) { + sb.append(commentMessage); + } break;
🤖 Fix all issues with AI agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`:
- Around line 150-169: The variable name "hightlighter" is misspelled—rename it
to "highlighter" in the ElasticSearchResult code block where you create the
Highlighter instance (currently: Highlighter hightlighter = new
Highlighter(...)) and update the subsequent call to
hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...);
check the surrounding code that constructs the TermQuery, QueryScorer and uses
analyzer.tokenStream to ensure the renamed symbol is used consistently.
- Around line 170-174: The catch blocks in ElasticSearchResult.java that
currently return e.getMessage() for IOException and InvalidTokenOffsetsException
should not expose internal messages; update the handlers in the method inside
class ElasticSearchResult to log the caught exception via the class logger
(e.g., LOG.error or the existing logger) including context, and return a safe
fallback such as an empty string or a generic message (e.g., "") instead of
e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException
catch blocks are changed consistently.
| // Try to highlight the search terms in the content | ||
| TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms)); | ||
| Scorer scorer = new QueryScorer(query); | ||
| Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); | ||
|
|
||
| TokenStream tokenStream = analyzer.tokenStream( | ||
| SearchService.FIELD_CONTENTS, new StringReader(text)); | ||
| return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$ | ||
| String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... "); | ||
|
|
||
| // If highlighting didn't find matches, return a truncated version of the content | ||
| if (highlighted == null || highlighted.isEmpty()) { | ||
| // Return first 200 characters as a preview | ||
| int maxLength = 200; | ||
| if (text.length() <= maxLength) { | ||
| return text; | ||
| } | ||
| return text.substring(0, maxLength) + " ..."; | ||
| } | ||
|
|
||
| return highlighted; |
There was a problem hiding this comment.
Typo: hightlighter → highlighter.
Line 153 introduces a misspelled variable name in changed code.
Proposed fix
- Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
+ Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
TokenStream tokenStream = analyzer.tokenStream(
SearchService.FIELD_CONTENTS, new StringReader(text));
- String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");
+ String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... ");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Try to highlight the search terms in the content | |
| TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms)); | |
| Scorer scorer = new QueryScorer(query); | |
| Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); | |
| TokenStream tokenStream = analyzer.tokenStream( | |
| SearchService.FIELD_CONTENTS, new StringReader(text)); | |
| return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$ | |
| String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... "); | |
| // If highlighting didn't find matches, return a truncated version of the content | |
| if (highlighted == null || highlighted.isEmpty()) { | |
| // Return first 200 characters as a preview | |
| int maxLength = 200; | |
| if (text.length() <= maxLength) { | |
| return text; | |
| } | |
| return text.substring(0, maxLength) + " ..."; | |
| } | |
| return highlighted; | |
| // Try to highlight the search terms in the content | |
| TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms)); | |
| Scorer scorer = new QueryScorer(query); | |
| Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); | |
| TokenStream tokenStream = analyzer.tokenStream( | |
| SearchService.FIELD_CONTENTS, new StringReader(text)); | |
| String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... "); | |
| // If highlighting didn't find matches, return a truncated version of the content | |
| if (highlighted == null || highlighted.isEmpty()) { | |
| // Return first 200 characters as a preview | |
| int maxLength = 200; | |
| if (text.length() <= maxLength) { | |
| return text; | |
| } | |
| return text.substring(0, maxLength) + " ..."; | |
| } | |
| return highlighted; |
🤖 Prompt for AI Agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`
around lines 150 - 169, The variable name "hightlighter" is misspelled—rename it
to "highlighter" in the ElasticSearchResult code block where you create the
Highlighter instance (currently: Highlighter hightlighter = new
Highlighter(...)) and update the subsequent call to
hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...);
check the surrounding code that constructs the TermQuery, QueryScorer and uses
analyzer.tokenStream to ensure the renamed symbol is used consistently.
| } catch (IOException e) { | ||
| return e.getMessage(); //$NON-NLS-1$ | ||
| return e.getMessage(); | ||
| } catch (InvalidTokenOffsetsException e) { | ||
| return e.getMessage(); | ||
| } |
There was a problem hiding this comment.
Returning e.getMessage() as search result content leaks internal details.
Both IOException and InvalidTokenOffsetsException handlers return the raw exception message to the caller. If this surfaces in the UI, it exposes implementation details to end users. Consider returning an empty string (or a generic fallback) and logging the exception instead.
Proposed fix
} catch (IOException e) {
- return e.getMessage();
+ log.error("Error generating search result for reference: {}", getReference(), e);
+ return "";
} catch (InvalidTokenOffsetsException e) {
- return e.getMessage();
+ log.error("Error highlighting search result for reference: {}", getReference(), e);
+ return "";
}🤖 Prompt for AI Agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`
around lines 170 - 174, The catch blocks in ElasticSearchResult.java that
currently return e.getMessage() for IOException and InvalidTokenOffsetsException
should not expose internal messages; update the handlers in the method inside
class ElasticSearchResult to log the caught exception via the class logger
(e.g., LOG.error or the existing logger) including context, and return a safe
fallback such as an empty string or a generic message (e.g., "") instead of
e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException
catch blocks are changed consistently.
Summary by CodeRabbit
Release Notes
Bug Fixes
Refactor