Skip to content

Comments

fix(crawler): escape + character in MIME type regex patterns#3035

Merged
marevol merged 1 commit intomasterfrom
fix/crawler-mime-type-regex-escaping
Feb 4, 2026
Merged

fix(crawler): escape + character in MIME type regex patterns#3035
marevol merged 1 commit intomasterfrom
fix/crawler-mime-type-regex-escaping

Conversation

@marevol
Copy link
Contributor

@marevol marevol commented Feb 4, 2026

Summary

This PR extends the regex pattern fix from #3032 to escape the + character in additional MIME type patterns:

  • crawler/rule.xml: Escape + in application/xhtml+xml and application/rdf+xml patterns for both webFileRule and fsFileRule
  • fess_thumbnail.xml: Fix double-escaped \\+ to single-escaped \+ for image/svg+xml pattern

Changes Made

  • Escape + character in application/xhtml+xml regex pattern (webFileRule, fsFileRule)
  • Escape + character in application/rdf+xml regex pattern (webFileRule, fsFileRule)
  • Fix over-escaped image/svg\\+xml to image/svg\+xml in thumbnail generator config
  • Add comprehensive unit tests for MIME type pattern matching in CrawlerRuleMimeTypePatternTest.java
  • Add additional tests for SVG/XHTML/RDF MIME type handling in BaseThumbnailGeneratorTest.java

Technical Details

The + character has special meaning in regex (matches "one or more" of the previous character). Without escaping, patterns like application/xhtml+xml would not match the literal MIME type string because the regex engine interprets l+ as "one or more 'l' characters" rather than the literal sequence l+.

Testing

  • Added CrawlerRuleMimeTypePatternTest.java with 18 test cases covering:
    • Pattern matching for all MIME types in webFileRule and fsFileRule
    • Demonstration of the bug (unescaped + fails to match)
    • Verification that escaped patterns work correctly
  • Extended BaseThumbnailGeneratorTest.java with tests for isTarget() method with SVG, XHTML, and RDF MIME types

🤖 Generated with Claude Code

Escape the + character in regex patterns for MIME types containing +
in crawler/rule.xml (application/xhtml+xml, application/rdf+xml) and
fix double-escaped pattern in fess_thumbnail.xml (image/svg+xml).

The + character has special meaning in regex (one or more of previous),
so it must be escaped with backslash to match literally.

Add comprehensive unit tests for MIME type pattern matching.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@marevol marevol self-assigned this Feb 4, 2026
@marevol marevol added this to the 15.5.0 milestone Feb 4, 2026
@marevol marevol merged commit 064e5b6 into master Feb 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant