SAK-52338 Search Refactor content producers to remove html in search results by kunaljaykam · Pull Request #14367 · sakaiproject/sakai

kunaljaykam · 2026-02-07T07:45:33Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved search result highlighting and text extraction accuracy by refining how content is processed and displayed.
- Simplified content indexing across assignment, discussion, lesson, wiki, test, and syllabus modules for more consistent search behavior.
- Enhanced search performance by optimizing content cleanup before indexing.
Refactor
- Streamlined internal content extraction logic to reduce redundant processing.

…results

coderabbitai · 2026-02-07T08:03:18Z

Walkthrough

This pull request refactors content extraction across multiple EntityContentProducer implementations by removing HTML parsing and cleaning logic, while simultaneously introducing HTML stripping during search indexing and highlighting in the Elasticsearch module. The changes shift HTML processing responsibility from individual content producers to centralized search indexing components.

Changes

Cohort / File(s)	Summary
ContentProducer Simplifications `assignment/impl/src/java/.../AssignmentContentProducer.java`, `conversations/impl/src/main/java/.../ConversationsEntityContentProducerImpl.java`, `lessonbuilder/tool/src/java/.../LessonsEntityContentProducer.java`, `message/search-adapters/impl/src/java/.../MessageContentProducer.java`, `rwiki/rwiki-impl/impl/src/java/.../RWikiEntityContentProducer.java`, `samigo/samigo-impl/src/java/.../ItemContentProducer.java`, `samigo/samigo-impl/src/java/.../PublishedItemContentProducer.java`, `syllabus/syllabus-impl/src/java/.../SyllabusContentProducer.java`, `search/search-impl/impl/src/java/.../SiteContentProducer.java`	Simplified getContent methods to return raw/direct content without HTML parsing. Removed HTMLParser and Jsoup imports and processing loops. Titles are now excluded from content payloads.
MessageForums Refactoring `msgcntr/messageforums-component-impl/src/java/.../MessageForumsEntityContentProducer.java`	Removed FormattedText dependency and simplified getContent to return only message body, eliminating author, title, and attachment handling logic.
MessageForums Configuration `msgcntr/messageforums-component-impl/src/webapp/WEB-INF/components.xml`	Removed FormattedText bean dependency injection from ForumsAdaptor configuration.
Elasticsearch Search Enhancements `search/elasticsearch/impl/src/java/.../ElasticSearchResult.java`	Refactored highlighting logic to fetch content via EntityContentProducer, strip HTML with HTMLParser, and highlight on FIELD_CONTENTS. Added ElasticSearchTermFrequency inner class implementing TermFrequency interface.
Elasticsearch Indexing `search/elasticsearch/util/src/java/.../BaseElasticSearchIndexBuilder.java`	Added HTML stripping step during indexing by parsing entity content with HTMLParser and indexing cleaned text.
Elasticsearch Dependencies `search/elasticsearch/util/pom.xml`	Added search-util dependency to support search utilities in Elasticsearch module.

Possibly related PRs

sakaiproject/sakai#13974: Modifies SyllabusContentProducer to refactor getContent behavior, building upon similar content extraction changes in this PR.

Suggested labels

do-not-close

Suggested reviewers

ottenhoff
ern

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main objective: refactoring content producers to remove HTML from search results, which aligns with the systematic changes across multiple producer classes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

lessonbuilder/tool/src/java/org/sakaiproject/lessonbuildertool/service/LessonsEntityContentProducer.java (1)

154-168: ⚠️ Potential issue | 🟠 Major

item.getHtml() may return null, causing downstream NPE.

SimplePageItem.getHtml() can return null. Previously, the HTMLParser intermediary likely handled this. Now, returning null from getContent will cause a NullPointerException in getContentReader (Line 151: new StringReader(getContent(reference))) and potentially in the search indexer.
Proposed fix
 		if (item != null) {
-			return item.getHtml();
+			String html = item.getHtml();
+			return html != null ? html : "";
         }

conversations/impl/src/main/java/org/sakaiproject/conversations/impl/ConversationsEntityContentProducerImpl.java (1)

256-277: ⚠️ Potential issue | 🟡 Minor

getMessage() may return null, causing "null" to be indexed.

Other content producers in this PR null-check before returning (e.g., AssignmentContentProducer line 108: instructions != null ? instructions : ""; MessageForumsEntityContentProducer line 162: m.getBody() != null ? m.getBody() : ""). Here, StringBuilder.append(null) will append the literal string "null" into the indexed content.

Proposed fix: guard against null messages

                if (!topic.getDraft() && !topic.getHidden()) {
-                    sb.append(topic.getMessage());
+                    String message = topic.getMessage();
+                    if (message != null) {
+                        sb.append(message);
+                    }
                    topic.getTagIds().forEach(tagId -> {
                        tagRepository.findById(tagId).ifPresent(t -> sb.append(" ").append(t.getLabel()));
                    });
                }
                break;
            case "p":
                ConversationsPost post = postRepository.findById(ref.getId())
                    .orElseThrow(() -> new IllegalArgumentException("Invalid post reference: " + reference));
                if (!post.getDraft() && !post.getHidden()) {
-                    sb.append(post.getMessage());
+                    String postMessage = post.getMessage();
+                    if (postMessage != null) {
+                        sb.append(postMessage);
+                    }
                }
                break;
            case "c":
                ConversationsComment comment = commentRepository.findById(ref.getId())
                    .orElseThrow(() -> new IllegalArgumentException("Invalid comment reference: " + reference));
-                sb.append(comment.getMessage());
+                String commentMessage = comment.getMessage();
+                if (commentMessage != null) {
+                    sb.append(commentMessage);
+                }
                break;

🤖 Fix all issues with AI agents

In
`@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java`:
- Around line 150-169: The variable name "hightlighter" is misspelled—rename it
to "highlighter" in the ElasticSearchResult code block where you create the
Highlighter instance (currently: Highlighter hightlighter = new
Highlighter(...)) and update the subsequent call to
hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...);
check the surrounding code that constructs the TermQuery, QueryScorer and uses
analyzer.tokenStream to ensure the renamed symbol is used consistently.
- Around line 170-174: The catch blocks in ElasticSearchResult.java that
currently return e.getMessage() for IOException and InvalidTokenOffsetsException
should not expose internal messages; update the handlers in the method inside
class ElasticSearchResult to log the caught exception via the class logger
(e.g., LOG.error or the existing logger) including context, and return a safe
fallback such as an empty string or a generic message (e.g., "") instead of
e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException
catch blocks are changed consistently.

coderabbitai · 2026-02-07T08:03:21Z

...h/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java

+            // Try to highlight the search terms in the content
+            TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));
+            Scorer scorer = new QueryScorer(query);
+            Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);
+
            TokenStream tokenStream = analyzer.tokenStream(
                    SearchService.FIELD_CONTENTS, new StringReader(text));
-            return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$
+            String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");
+
+            // If highlighting didn't find matches, return a truncated version of the content
+            if (highlighted == null || highlighted.isEmpty()) {
+                // Return first 200 characters as a preview
+                int maxLength = 200;
+                if (text.length() <= maxLength) {
+                    return text;
+                }
+                return text.substring(0, maxLength) + " ...";
+            }
+
+            return highlighted;


⚠️ Potential issue | 🟡 Minor

Typo: hightlighter → highlighter.

Line 153 introduces a misspelled variable name in changed code.

Proposed fix

- Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); + Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer); TokenStream tokenStream = analyzer.tokenStream( SearchService.FIELD_CONTENTS, new StringReader(text)); - String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... "); + String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... ");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Try to highlight the search terms in the content

TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));

Scorer scorer = new QueryScorer(query);

Highlighter hightlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);

TokenStream tokenStream = analyzer.tokenStream(

SearchService.FIELD_CONTENTS, new StringReader(text));

return hightlighter.getBestFragments(tokenStream, text, 5, " ... "); //$NON-NLS-1$

String highlighted = hightlighter.getBestFragments(tokenStream, text, 5, " ... ");

// If highlighting didn't find matches, return a truncated version of the content

if (highlighted == null || highlighted.isEmpty()) {

// Return first 200 characters as a preview

int maxLength = 200;

if (text.length() <= maxLength) {

return text;

}

return text.substring(0, maxLength) + " ...";

}

return highlighted;

// Try to highlight the search terms in the content

TermQuery query = new TermQuery(new Term(SearchService.FIELD_CONTENTS, searchTerms));

Scorer scorer = new QueryScorer(query);

Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new SimpleHTMLEncoder(), scorer);

TokenStream tokenStream = analyzer.tokenStream(

SearchService.FIELD_CONTENTS, new StringReader(text));

String highlighted = highlighter.getBestFragments(tokenStream, text, 5, " ... ");

// If highlighting didn't find matches, return a truncated version of the content

if (highlighted == null || highlighted.isEmpty()) {

// Return first 200 characters as a preview

int maxLength = 200;

if (text.length() <= maxLength) {

return text;

}

return text.substring(0, maxLength) + " ...";

}

return highlighted;

🤖 Prompt for AI Agents

In `@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java` around lines 150 - 169, The variable name "hightlighter" is misspelled—rename it to "highlighter" in the ElasticSearchResult code block where you create the Highlighter instance (currently: Highlighter hightlighter = new Highlighter(...)) and update the subsequent call to hightlighter.getBestFragments(...) to use highlighter.getBestFragments(...); check the surrounding code that constructs the TermQuery, QueryScorer and uses analyzer.tokenStream to ensure the renamed symbol is used consistently.

coderabbitai · 2026-02-07T08:03:21Z

...h/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java

        } catch (IOException e) {
-            return e.getMessage(); //$NON-NLS-1$
+            return e.getMessage();
        } catch (InvalidTokenOffsetsException e) {
            return e.getMessage();
        }


⚠️ Potential issue | 🟡 Minor

Returning e.getMessage() as search result content leaks internal details.

Both IOException and InvalidTokenOffsetsException handlers return the raw exception message to the caller. If this surfaces in the UI, it exposes implementation details to end users. Consider returning an empty string (or a generic fallback) and logging the exception instead.

Proposed fix

} catch (IOException e) { - return e.getMessage(); + log.error("Error generating search result for reference: {}", getReference(), e); + return ""; } catch (InvalidTokenOffsetsException e) { - return e.getMessage(); + log.error("Error highlighting search result for reference: {}", getReference(), e); + return ""; }

🤖 Prompt for AI Agents

In `@search/elasticsearch/impl/src/java/org/sakaiproject/search/elasticsearch/ElasticSearchResult.java` around lines 170 - 174, The catch blocks in ElasticSearchResult.java that currently return e.getMessage() for IOException and InvalidTokenOffsetsException should not expose internal messages; update the handlers in the method inside class ElasticSearchResult to log the caught exception via the class logger (e.g., LOG.error or the existing logger) including context, and return a safe fallback such as an empty string or a generic message (e.g., "") instead of e.getMessage(); ensure both the IOException and InvalidTokenOffsetsException catch blocks are changed consistently.

SAK-52338 Search Refactor content producers to remove html in search …

0fe6240

…results

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SAK-52338 Search Refactor content producers to remove html in search results#14367

SAK-52338 Search Refactor content producers to remove html in search results#14367
kunaljaykam wants to merge 1 commit intosakaiproject:masterfrom
kunaljaykam:SAK-52338

kunaljaykam commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 7, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 7, 2026

Uh oh!

coderabbitai bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kunaljaykam commented Feb 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 7, 2026

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kunaljaykam commented Feb 7, 2026 •

edited by coderabbitai bot

Loading