Skip to content

SAK-52338 Search Refactor content producers to remove html in search results#14367

Draft
kunaljaykam wants to merge 1 commit intosakaiproject:masterfrom
kunaljaykam:SAK-52338
Draft

SAK-52338 Search Refactor content producers to remove html in search results#14367
kunaljaykam wants to merge 1 commit intosakaiproject:masterfrom
kunaljaykam:SAK-52338

Conversation

@kunaljaykam
Copy link
Member

@kunaljaykam kunaljaykam commented Feb 7, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved search result highlighting and text extraction accuracy by refining how content is processed and displayed.
    • Simplified content indexing across assignment, discussion, lesson, wiki, test, and syllabus modules for more consistent search behavior.
    • Enhanced search performance by optimizing content cleanup before indexing.
  • Refactor

    • Streamlined internal content extraction logic to reduce redundant processing.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 7, 2026

Walkthrough

This pull request refactors content extraction across multiple EntityContentProducer implementations by removing HTML parsing and cleaning logic, while simultaneously introducing HTML stripping during search indexing and highlighting in the Elasticsearch module. The changes shift HTML processing responsibility from individual content producers to centralized search indexing components.

Changes

Cohort / File(s) Summary
ContentProducer Simplifications
assignment/impl/src/java/.../AssignmentContentProducer.java, conversations/impl/src/main/java/.../ConversationsEntityContentProducerImpl.java, lessonbuilder/tool/src/java/.../LessonsEntityContentProducer.java, message/search-adapters/impl/src/java/.../MessageContentProducer.java, rwiki/rwiki-impl/impl/src/java/.../RWikiEntityContentProducer.java, samigo/samigo-impl/src/java/.../ItemContentProducer.java, samigo/samigo-impl/src/java/.../PublishedItemContentProducer.java, syllabus/syllabus-impl/src/java/.../SyllabusContentProducer.java, search/search-impl/impl/src/java/.../SiteContentProducer.java
Simplified getContent methods to return raw/direct content without HTML parsing. Removed HTMLParser and Jsoup imports and processing loops. Titles are now excluded from content payloads.
MessageForums Refactoring
msgcntr/messageforums-component-impl/src/java/.../MessageForumsEntityContentProducer.java
Removed FormattedText dependency and simplified getContent to return only message body, eliminating author, title, and attachment handling logic.
MessageForums Configuration
msgcntr/messageforums-component-impl/src/webapp/WEB-INF/components.xml
Removed FormattedText bean dependency injection from ForumsAdaptor configuration.
Elasticsearch Search Enhancements
search/elasticsearch/impl/src/java/.../ElasticSearchResult.java
Refactored highlighting logic to fetch content via EntityContentProducer, strip HTML with HTMLParser, and highlight on FIELD_CONTENTS. Added ElasticSearchTermFrequency inner class implementing TermFrequency interface.
Elasticsearch Indexing
search/elasticsearch/util/src/java/.../BaseElasticSearchIndexBuilder.java
Added HTML stripping step during indexing by parsing entity content with HTMLParser and indexing cleaned text.
Elasticsearch Dependencies
search/elasticsearch/util/pom.xml
Added search-util dependency to support search utilities in Elasticsearch module.

Possibly related PRs

  • sakaiproject/sakai#13974: Modifies SyllabusContentProducer to refactor getContent behavior, building upon similar content extraction changes in this PR.

Suggested labels

do-not-close

Suggested reviewers

  • ottenhoff
  • ern
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main objective: refactoring content producers to remove HTML from search results, which aligns with the systematic changes across multiple producer classes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lessonbuilder/tool/src/java/org/sakaiproject/lessonbuildertool/service/LessonsEntityContentProducer.java (1)

154-168: ⚠️ Potential issue | 🟠 Major

item.getHtml() may return null, causing downstream NPE.

SimplePageItem.getHtml() can return null. Previously, the HTMLParser intermediary likely handled this. Now, returning null from getContent will cause a NullPointerException in getContentReader (Line 151: new StringReader(getContent(reference))) and potentially in the search indexer.

Proposed fix
 		if (item != null) {
-			return item.getHtml();
+			String html = item.getHtml();
+			return html != null ? html : "";
         }
conversations/impl/src/main/java/org/sakaiproject/conversations/impl/ConversationsEntityContentProducerImpl.java (1)

256-277: ⚠️ Potential issue | 🟡 Minor

getMessage() may return null, causing "null" to be indexed.

Other content producers in this PR null-check before returning (e.g., AssignmentContentProducer line 108: instructions != null ? instructions : ""; MessageForumsEntityContentProducer line 162: m.getBody() != null ? m.getBody() : ""). Here, StringBuilder.append(null) will append the literal string "null" into the indexed content.

Proposed fix: guard against null messages
                if (!topic.getDraft() && !topic.getHidden()) {
-                    sb.append(topic.getMessage());
+                    String message = topic.getMessage();
+                    if (message != null) {
+                        sb.append(message);
+                    }
                    topic.getTagIds().forEach(tagId -> {
                        tagRepository.findById(tagId).ifPresent(t -> sb.append(" ").append(t.getLabel()));
                    });
                }
                break;
            case "p":
                ConversationsPost post = postRepository.findById(ref.getId())
                    .orElseThrow(() -> new IllegalArgumentException("Invalid post reference: " + reference));
                if (!post.getDraft() && !post.getHidden()) {
-                    sb.append(post.getMessage());
+                    String postMessage = post.getMessage();
+                    if (postMessage != null) {
+                        sb.append(postMessage);
+                    }
                }
                break;
            case "c":
                ConversationsComment comment = commentRepository.findById(ref.getId())
                    .orElseThrow(() -> new IllegalArgumentException("Invalid comment reference: " + reference));
-                sb.append(comment.getMessage());
+                String commentMessage = comment.getMessage();
+                if (commentMessage != null) {
+                    sb.append(commentMessage);
+                }
                break;
🤖 Fix all issues with AI agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`:
- Around line 150-169: The variable name "hightlighter" is misspelled—rename it
to "highlighter" in the ElasticSearchResult code block where you create the
Highlighter instance (currently: Highlighter hightlighter = new
Highlighter(...)) and update the subsequent call to
hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...);
check the surrounding code that constructs the TermQuery, QueryScorer and uses
analyzer.tokenStream to ensure the renamed symbol is used consistently.
- Around line 170-174: The catch blocks in ElasticSearchResult.java that
currently return e.getMessage() for IOException and InvalidTokenOffsetsException
should not expose internal messages; update the handlers in the method inside
class ElasticSearchResult to log the caught exception via the class logger
(e.g., LOG.error or the existing logger) including context, and return a safe
fallback such as an empty string or a generic message (e.g., "") instead of
e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException
catch blocks are changed consistently.

Comment on lines +150 to +169
// Try to highlight the search terms in the content
TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));
Scorer scorer = new QueryScorer(query);
Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);

TokenStream tokenStream = analyzer.tokenStream(
SearchService.FIELD_CONTENTS, new StringReader(text));
return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$
String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");

// If highlighting didn't find matches, return a truncated version of the content
if (highlighted == null || highlighted.isEmpty()) {
// Return first 200 characters as a preview
int maxLength = 200;
if (text.length() <= maxLength) {
return text;
}
return text.substring(0, maxLength) + " ...";
}

return highlighted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: hightlighterhighlighter.

Line 153 introduces a misspelled variable name in changed code.

Proposed fix
-            Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
+            Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
             
             TokenStream tokenStream = analyzer.tokenStream(
                     SearchService.FIELD_CONTENTS, new StringReader(text));
-            String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");
+            String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... ");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Try to highlight the search terms in the content
TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));
Scorer scorer = new QueryScorer(query);
Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
TokenStream tokenStream = analyzer.tokenStream(
SearchService.FIELD_CONTENTS, new StringReader(text));
return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$
String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");
// If highlighting didn't find matches, return a truncated version of the content
if (highlighted == null || highlighted.isEmpty()) {
// Return first 200 characters as a preview
int maxLength = 200;
if (text.length() <= maxLength) {
return text;
}
return text.substring(0, maxLength) + " ...";
}
return highlighted;
// Try to highlight the search terms in the content
TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));
Scorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
TokenStream tokenStream = analyzer.tokenStream(
SearchService.FIELD_CONTENTS, new StringReader(text));
String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... ");
// If highlighting didn't find matches, return a truncated version of the content
if (highlighted == null || highlighted.isEmpty()) {
// Return first 200 characters as a preview
int maxLength = 200;
if (text.length() <= maxLength) {
return text;
}
return text.substring(0, maxLength) + " ...";
}
return highlighted;
🤖 Prompt for AI Agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`
around lines 150 - 169, The variable name "hightlighter" is misspelled—rename it
to "highlighter" in the ElasticSearchResult code block where you create the
Highlighter instance (currently: Highlighter hightlighter = new
Highlighter(...)) and update the subsequent call to
hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...);
check the surrounding code that constructs the TermQuery, QueryScorer and uses
analyzer.tokenStream to ensure the renamed symbol is used consistently.

Comment on lines 170 to 174
} catch (IOException e) {
return e.getMessage(); //$NON-NLS-1$
return e.getMessage();
} catch (InvalidTokenOffsetsException e) {
return e.getMessage();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Returning e.getMessage() as search result content leaks internal details.

Both IOException and InvalidTokenOffsetsException handlers return the raw exception message to the caller. If this surfaces in the UI, it exposes implementation details to end users. Consider returning an empty string (or a generic fallback) and logging the exception instead.

Proposed fix
         } catch (IOException e) {
-            return e.getMessage();
+            log.error("Error generating search result for reference: {}", getReference(), e);
+            return "";
         } catch (InvalidTokenOffsetsException e) {
-            return e.getMessage();
+            log.error("Error highlighting search result for reference: {}", getReference(), e);
+            return "";
         }
🤖 Prompt for AI Agents
In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`
around lines 170 - 174, The catch blocks in ElasticSearchResult.java that
currently return e.getMessage() for IOException and InvalidTokenOffsetsException
should not expose internal messages; update the handlers in the method inside
class ElasticSearchResult to log the caught exception via the class logger
(e.g., LOG.error or the existing logger) including context, and return a safe
fallback such as an empty string or a generic message (e.g., "") instead of
e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException
catch blocks are changed consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant