-
Notifications
You must be signed in to change notification settings - Fork 184
Strip invisible Unicode from content model at editor initialization #3299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
romanisa
merged 9 commits into
microsoft:master
from
romanisa:romasha/strip-invisible-unicode
Mar 17, 2026
Merged
Changes from 6 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ea0d958
fix: strip invisible Unicode from link hrefs for defense-in-depth
romanisa f0da586
Update packages/roosterjs-content-model-dom/lib/formatHandlers/segmen…
romanisa 7211d20
fix: address PR review - strip before script: check, guard empty href…
romanisa 1a7c8eb
fix: expand invisible Unicode regex coverage and add defense-in-depth…
romanisa 5355ae4
fix: handle percent-encoded invisible Unicode in hrefs
romanisa 658cd45
refactor: move invisible Unicode stripping to editor init instead of …
romanisa 26bbb44
fix: remove decodeURIComponent from stripInvisibleUnicode
romanisa 9b5a4a0
fix test
JiuqingSong f0fa815
Merge branch 'master' into romasha/strip-invisible-unicode
JiuqingSong File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
11 changes: 8 additions & 3 deletions
11
packages/roosterjs-content-model-api/lib/publicApi/utils/checkXss.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,15 @@ | ||
| import { stripInvisibleUnicode } from 'roosterjs-content-model-dom'; | ||
|
|
||
| /** | ||
| * @internal Check if there is XSS attack in the link | ||
| * @param link The link to be checked | ||
| * @returns The safe link, or empty string if there is XSS attack | ||
| * @remarks This function checks for patterns like s\nc\nr\ni\np\nt: to prevent XSS attacks. This may block some valid links, | ||
| * @returns The safe link with invisible Unicode characters stripped, or empty string if there is XSS attack | ||
| * @remarks This function strips invisible Unicode characters (zero-width chars, Unicode Tags, etc.) | ||
| * and checks for patterns like s\nc\nr\ni\np\nt: to prevent XSS attacks. This may block some valid links, | ||
| * but it is necessary for security reasons. We treat the word "script" as safe if there are "/" before it. | ||
| */ | ||
| export function checkXss(link: string): string { | ||
| return link.match(/^[^\/]*s\n*c\n*r\n*i\n*p\n*t\n*:/i) ? '' : link; | ||
| // Defense-in-depth: strip invisible Unicode even if already handled elsewhere | ||
| const sanitized = stripInvisibleUnicode(link); | ||
| return sanitized.match(/^[^\/]*s\n*c\n*r\n*i\n*p\n*t\n*:/i) ? '' : sanitized; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
29 changes: 29 additions & 0 deletions
29
packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| const INVISIBLE_UNICODE_REGEX = | ||
| // eslint-disable-next-line no-misleading-character-class | ||
| /[\u00AD\u034F\u061C\u115F\u1160\u17B4\u17B5\u180B-\u180E\u200B-\u200F\u202A-\u202E\u2028\u2029\u2060-\u2064\u2066-\u2069\u3164\uFEFF\uFFA0\uFFF9-\uFFFB]|\uDB40[\uDC01-\uDCFF]/g; | ||
|
|
||
| /** | ||
| * Strip invisible Unicode characters from a string. | ||
| * This removes zero-width characters, bidirectional marks, Unicode Tags (U+E0001-U+E00FF), | ||
| * interlinear annotation anchors, Mongolian free variation selectors, | ||
| * and other invisible formatting characters that can be used to hide content in links. | ||
| * Percent-encoded invisible characters (e.g. %E2%80%8B for U+200B) are also handled | ||
| * by decoding the string first. | ||
| * | ||
| * @remarks This function strips ZWJ (U+200D) which may affect emoji sequences. | ||
| * It should only be applied to href attributes, not to visible text content. | ||
| * @param value The string to strip invisible characters from | ||
| * @returns The string with invisible characters removed | ||
| */ | ||
| export function stripInvisibleUnicode(value: string): string { | ||
| let decoded: string; | ||
|
|
||
| try { | ||
| decoded = decodeURIComponent(value); | ||
romanisa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } catch { | ||
| // If decoding fails (malformed percent-encoding), use the original value | ||
| decoded = value; | ||
| } | ||
|
|
||
| return decoded.replace(INVISIBLE_UNICODE_REGEX, ''); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
89 changes: 89 additions & 0 deletions
89
packages/roosterjs-content-model-dom/lib/modelApi/common/sanitizeInvisibleUnicode.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| import { stripInvisibleUnicode } from '../../domUtils/stripInvisibleUnicode'; | ||
| import type { | ||
| ContentModelBlock, | ||
| ContentModelBlockGroup, | ||
| ContentModelDocument, | ||
| ContentModelSegment, | ||
| } from 'roosterjs-content-model-types'; | ||
|
|
||
| /** | ||
| * Strip invisible Unicode characters from all text and link hrefs in a content model. | ||
| * This sanitizes the model at initialization time to prevent hidden content in links | ||
| * or text (e.g. zero-width chars, bidirectional marks, Unicode Tags). | ||
| * For General segments, all Text nodes under the element are also sanitized. | ||
| * @param model The content model document to sanitize in-place | ||
| */ | ||
| export function sanitizeInvisibleUnicode(model: ContentModelDocument): void { | ||
| sanitizeBlockGroup(model); | ||
| } | ||
|
|
||
| function sanitizeBlockGroup(group: ContentModelBlockGroup): void { | ||
| for (const block of group.blocks) { | ||
| sanitizeBlock(block); | ||
| } | ||
| } | ||
|
|
||
| function sanitizeBlock(block: ContentModelBlock): void { | ||
| switch (block.blockType) { | ||
| case 'Paragraph': | ||
| for (const segment of block.segments) { | ||
| sanitizeSegment(segment); | ||
| } | ||
| break; | ||
|
|
||
| case 'Table': | ||
| for (const row of block.rows) { | ||
| for (const cell of row.cells) { | ||
| sanitizeBlockGroup(cell); | ||
| } | ||
| } | ||
| break; | ||
|
|
||
| case 'BlockGroup': | ||
| sanitizeBlockGroup(block); | ||
|
|
||
| if (block.blockGroupType === 'General' && block.element) { | ||
| sanitizeTextNodes(block.element); | ||
| } | ||
| break; | ||
|
|
||
| case 'Entity': | ||
| case 'Divider': | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| function sanitizeSegment(segment: ContentModelSegment): void { | ||
| if (segment.link?.format.href) { | ||
| segment.link.format.href = stripInvisibleUnicode(segment.link.format.href); | ||
| } | ||
|
|
||
| switch (segment.segmentType) { | ||
| case 'Text': | ||
| segment.text = stripInvisibleUnicode(segment.text); | ||
| break; | ||
|
|
||
| case 'General': | ||
| sanitizeTextNodes(segment.element); | ||
| sanitizeBlockGroup(segment); | ||
| break; | ||
|
|
||
| case 'Image': | ||
| case 'Entity': | ||
| case 'Br': | ||
| case 'SelectionMarker': | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| function sanitizeTextNodes(element: HTMLElement): void { | ||
| const walker = element.ownerDocument.createTreeWalker(element, NodeFilter.SHOW_TEXT); | ||
|
|
||
| let node: Text | null; | ||
|
|
||
| while ((node = walker.nextNode() as Text | null)) { | ||
| if (node.nodeValue) { | ||
| node.nodeValue = stripInvisibleUnicode(node.nodeValue); | ||
| } | ||
| } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.