Skip to content

Treat Indic conjunct clusters as a single grapheme (UAX #29 GB9c)#6074

Open
greymoth-jp wants to merge 2 commits into
ianstormtaylor:mainfrom
greymoth-jp:fix/grapheme-indic-conjunct-gb9c
Open

Treat Indic conjunct clusters as a single grapheme (UAX #29 GB9c)#6074
greymoth-jp wants to merge 2 commits into
ianstormtaylor:mainfrom
greymoth-jp:fix/grapheme-indic-conjunct-gb9c

Conversation

@greymoth-jp

Copy link
Copy Markdown

Description

getCharacterDistance implements the UAX #29 extended grapheme cluster rules (GB6 through GB13) but not GB9c, which was added in Unicode 15.1. GB9c keeps an Indic conjunct together:

Consonant [Extend Linker]* Linker [Extend Linker]* × Consonant

Without it, a Consonant + virama + Consonant sequence breaks right after the virama. Since getCharacterDistance and getWordDistance back Editor.positions (and the Editor.before / Editor.after built on it), moving the cursor by character through scripts like Devanagari or Bengali stops in the middle of a conjunct that the reader sees as one glyph.

क्त (U+0915 U+094D U+0924) is one grapheme, but getCharacterDistance('क्त') returned 2 (just क्) instead of 3.

Issue

No existing issue.

Example

// before
getCharacterDistance('क्त') // 2  ->  क् | त
// after
getCharacterDistance('क्त') // 3  ->  क्त

This also covers a conjunct with an intervening ZWJ (क्‍त) and chained conjuncts (क्त्य), and still breaks between two consonants that have no linker between them ( | ).

Context

The implementation follows the existing deferred-rule pattern. Three regex classes for the Indic_Conjunct_Break property values (Consonant, Extend, Linker) are taken from DerivedCoreProperties.txt, in the same \uXXXX style as the existing tables. GB9c is added to NonBoundaryPairs as [InCBLinker | InCBExtend, InCBConsonant], with a deferred look-behind (endsWithConjunctLinker) that confirms the full left context before suppressing the break, the same way the GB11 emoji-ZWJ case is handled.

I checked the change against the official Unicode GraphemeBreakTest.txt (17.0.0): every GB9c line now passes in both LTR and RTL directions, and no previously passing line changed. The only failures left in that file are CR/LF control-character sequences (GB3 to GB5), which are unrelated to this change. Regression cases for the conjunct strings were added to test/utils/string.ts.

Checks

  • The new code matches the existing patterns and styles.
  • The tests pass with yarn test.
  • The linter passes with yarn lint. (Fix errors with yarn fix.)
  • The relevant examples still work. (Run examples with yarn start.)
  • You've added a changeset.

@changeset-bot

changeset-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 9763fd2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
slate Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@dylans dylans left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prettier is complaining so please clean up the PR, then this should be ok. There's a bit of extra comments that feel unnecessary that could be cleaned up.

@greymoth-jp

Copy link
Copy Markdown
Author

Ran prettier and trimmed the comments down to the GB9c note. Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants