Treat Indic conjunct clusters as a single grapheme (UAX #29 GB9c)#6074
Open
greymoth-jp wants to merge 2 commits into
Open
Treat Indic conjunct clusters as a single grapheme (UAX #29 GB9c)#6074greymoth-jp wants to merge 2 commits into
greymoth-jp wants to merge 2 commits into
Conversation
🦋 Changeset detectedLatest commit: 9763fd2 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
dylans
requested changes
Jun 30, 2026
dylans
left a comment
Collaborator
There was a problem hiding this comment.
Prettier is complaining so please clean up the PR, then this should be ok. There's a bit of extra comments that feel unnecessary that could be cleaned up.
Author
|
Ran prettier and trimmed the comments down to the GB9c note. Thanks for the review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
getCharacterDistanceimplements the UAX #29 extended grapheme cluster rules (GB6 through GB13) but not GB9c, which was added in Unicode 15.1. GB9c keeps an Indic conjunct together:Without it, a
Consonant + virama + Consonantsequence breaks right after the virama. SincegetCharacterDistanceandgetWordDistancebackEditor.positions(and theEditor.before/Editor.afterbuilt on it), moving the cursor by character through scripts like Devanagari or Bengali stops in the middle of a conjunct that the reader sees as one glyph.क्त(U+0915 U+094D U+0924) is one grapheme, butgetCharacterDistance('क्त')returned2(justक्) instead of3.Issue
No existing issue.
Example
This also covers a conjunct with an intervening ZWJ (
क्त) and chained conjuncts (क्त्य), and still breaks between two consonants that have no linker between them (क|त).Context
The implementation follows the existing deferred-rule pattern. Three regex classes for the
Indic_Conjunct_Breakproperty values (Consonant,Extend,Linker) are taken fromDerivedCoreProperties.txt, in the same\uXXXXstyle as the existing tables. GB9c is added toNonBoundaryPairsas[InCBLinker | InCBExtend, InCBConsonant], with a deferred look-behind (endsWithConjunctLinker) that confirms the full left context before suppressing the break, the same way the GB11 emoji-ZWJ case is handled.I checked the change against the official Unicode
GraphemeBreakTest.txt(17.0.0): every GB9c line now passes in both LTR and RTL directions, and no previously passing line changed. The only failures left in that file are CR/LF control-character sequences (GB3 to GB5), which are unrelated to this change. Regression cases for the conjunct strings were added totest/utils/string.ts.Checks
yarn test.yarn lint. (Fix errors withyarn fix.)yarn start.)