Implement hyphenless word breaking #3268

carlobeltrame · 2025-12-30T16:56:29Z

Implements the feature discussed in #3267

After the previous implementation of #3188 being partially reverted in #3267, this PR re-implements the following use cases:
Fixes #1642
Fixes #2456
Fixes #2564
Fixes #2739
Re-enables better solutions for the following issues:
Fixes #1238
Fixes #1380
Fixes #1416
Fixes #1662

changeset-bot · 2025-12-30T16:56:33Z

🦋 Changeset detected

Latest commit: 68fb141

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages

Name	Type
@react-pdf/textkit	Minor
@react-pdf/layout	Patch
@react-pdf/render	Patch
@react-pdf/renderer	Patch
next-14	Patch
next-15	Patch
@react-pdf/vite-example	Patch
@react-pdf/e2e-node-cjs	Patch
@react-pdf/e2e-node-esm	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

packages/textkit/src/types.ts

diegomura · 2025-12-31T11:02:37Z

Lmk when this is no longer a WIP and it's ready for review!

carlobeltrame · 2025-12-31T13:10:59Z

Yeah, I have some issues installing canvas with node-gyp on my current system. I'll let you know once it's ready.

carlobeltrame · 2026-01-02T12:29:47Z

@diegomura this is ready for review now. I have so far taken the following executive decisions (subject to discussion):

Since your first look at this PR, I have switched the types to use the wording "part" instead of "syllable". This is because in advanced use cases and in non-western languages, syllable does not necessarily describe the word parts.
The last part of a word is forced to always have no added hyphen, and this is not overridable like [..., { string: 'sum', hyphen: '-' }]. This is because IMO it's not a realistic use case, and having hyphens at the end of the word would require additional changes to the Knuth-Plass algorithm.
The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.
I have added new integration-style tests to wrapWords.test.ts. So far, the unit tests in there used a lot of mocking and unrealistic scenarios. I left those intact, but added more realistic tests which actually use the internal default wordHyphenation engine, in addition to some custom hyphenation callbacks. I am aware that the wordHyphenation engine has its own unit tests, but custom hyphenation callback tests wouldn't really fit in there. Let me know if you prefer a different place for the new integration tests.
I have implemented splitting after dashes in the default hyphenation algorithm.
I have not yet implemented automatic URL detection, because:
- detecting URLs involves either a very complex Regexp or usage of a new dependency
- in addition to URLs, email addresses are probably also a good candidate for the same rules, adding even more to the complexity of detection
- In URLs or emails or other technical identifiers, setting hyphen: null may not be enough. Disabling syllable breaking using the hyphen library may or may not be necessary in addition, and breaking on e.g. slashes, question marks, ampersand etc. may be desired. Otherwise, the URL "react-pdf.org/advanced?foo=bar#someAnchor" in a very narrow container could be broken into the following (which may or may not be what devs want):
```
re
act-
pdf.org/ad
van
ced?foo=bar#so
me
An
chor
```
IMO, a better hyphenation of this specific example would be:
```
react-
pdf.org/
advanced?
foo=bar
#someAnchor
```

Let me know if you disagree on any of these choices. Once we have these resolved, I can work on a documentation PR for when the feature is released.

Refs diegomura#3267 (comment) Custom hyphenation callbacks can now decide for each split point, whether a hyphen should be inserted at the end of the line when breaking at said split point. This change is backwards compatible. If a custom hyphenation callback returns an array of strings, it is assumed that all except the last part should get a hyphen added when breaking. Hyphens can be turned off for all parts of a word by returning an object in the format: `{ parts: <string array as previously>, hyphen: null }` This may be useful for URLs or similar technical identifiers which should be broken but should not be hyphenated. Alternatively, the hyphen can be activated or deactivated for each part separately, by returning an array of part objects: ``` [ { string: 'blue', hyphen: '-' }, { string: 'ish-', hyphen: null }, { string: 'green', hyphen: null } ] ``` For now, only a dash `'-'` or `null` are valid choices for `hyphen`. Also, for now, the last part of a word is always forced to `hyphen: null`. Finally, unless specified specifically, parts ending in a dash will not get an added hyphen.

Since our default hyphenation algorithm is optimized for English and western language text, we can make the assumption that line breaking on dashes is usually correct. The previous changes already made sure that no duplicate hyphen is inserted after a hyphen.

diegomura · 2026-01-02T12:38:31Z

Thanks @carlobeltrame ! I'll check the code soon. In the meantime

The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.

Why is this? Seems straighforward to support at least other code points right?

carlobeltrame · 2026-01-02T12:44:09Z

Thanks @carlobeltrame ! I'll check the code soon. In the meantime

The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.

Why is this? Seems straighforward to support at least other code points right?

Ah yes, forgot to explain. In the Knuth-Plass algorithm, the width of the added hyphen character is currently hardcoded to be 5, but this assumption is stretched much further if we allow anything other than '-' as hyphenation character. Also, currently the linebreaker algorithm adds the additional hyphen via its hardcoded code point, and supporting other characters or even multi-character strings (which Emoji are an edge case of) would require much more sophisticated logic in there, while benefitting only extremely rare use cases.

diegomura · 2026-01-02T12:45:59Z

packages/textkit/src/layout/wrapWords.ts

+            hyphenatedWord.hyphen === null ? null : undefined,
+          ),
+        );
+  if (normalized.length > 0) normalized[normalized.length - 1].hyphen = null;


Is this needed? Doesn't K&P handles this by nod adding a penalty at the end of words?

Yes, we could omit this. I felt like it would be good for debuggability to normalize this already here, but if you insist, we can remove it.

diegomura · 2026-01-02T12:50:09Z

packages/textkit/src/layout/wrapWords.ts

+      : hyphenatedWord.parts.map((part) =>
+          normalizeHyphenatedPart(
+            part,
+            hyphenatedWord.hyphen === null ? null : undefined,


I find this quite confusing. All we need to do here is transform strings into { string, hyphen }, hyphen being hyphenatedWord.hyphen on each item . Can't we just do it here right away?

So you are saying it should be the following?

: hyphenatedWord.parts.map((part) => ({ string: part, hyphen: hyphenatedWord.hyphen as '-', }));

This breaks two tests.

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Failed Tests 2 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ FAIL tests/layout/wrapWords.test.ts > wrapWords > with builtin engine > should split at hyphen and not duplicate hyphen in hyphenation AssertionError: expected [ Array(5) ] to deeply equal [ …(5) ] - Expected + Received Array [ Object { - "hyphen": null, + "hyphen": "-", "string": "Lo-", }, Object { "hyphen": null, "string": "rem", }, Object { "hyphen": null, "string": " ", }, Object { "hyphen": "-", "string": "ip", }, Object { "hyphen": null, "string": "sum-", }, ] ❯ tests/layout/wrapWords.test.ts:257:32 255| }); 256| 257| expect(result.syllables).toEqual([ | ^ 258| { string: 'Lo-', hyphen: null }, 259| { string: 'rem', hyphen: null }, ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/2]⎯ FAIL tests/layout/wrapWords.test.ts > wrapWords > with custom hyphenation callback > should tolerate missing hyphen specification AssertionError: expected [ …(4) ] to deeply equal [ …(4) ] - Expected + Received Array [ Object { "hyphen": null, "string": "Lorem", }, Object { "hyphen": null, "string": " ", }, Object { - "hyphen": "-", + "hyphen": undefined, "string": "ip", }, Object { "hyphen": null, "string": "sum", }, ] ❯ tests/layout/wrapWords.test.ts:395:32 393| }); 394| 395| expect(result.syllables).toEqual([ | ^ 396| { string: 'Lorem', hyphen: null }, 397| { string: ' ', hyphen: null }, ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[2/2]⎯ Test Files 1 failed | 86 passed (87) Tests 2 failed | 705 passed (707) Start at 14:14:23 Duration 1.02s (transform 4.18s, setup 2ms, collect 8.15s, tests 307ms, environment 4ms, prepare 1.75s)

The first one is because your proposal ignores a trailing hyphen in a syllable: "Lo-rem" would wrongly be split to

[ { string: 'Lo-', hyphen: '-' }, // <-- wrong hyphen character here { string: 'rem', hyphen: null }, }

The second test failure is due to an edge case when a custom hyphenation callback returns something like { parts: ['Lo', 'rem'], hyphen: undefined } or { parts: ['Lo', 'rem'], hyphen: 'X' } or even { parts: ['Lo', 'rem'] }.

We could choose to remove that second test (titled "should tolerate missing hyphen specification") and call this case undefined behaviour. But the first test indicates a real error in your reasoning.

Alternatively, we could refactor this ternary to let + if:

let normalized: HyphenatedPart[]; if (hyphenatedWord instanceof Array) { normalized = hyphenatedWord.map((part) => normalizeHyphenatedPart(part)) } else { normalized = hyphenatedWord.parts.map((part) => normalizeHyphenatedPart( part, hyphenatedWord.hyphen === null ? null : undefined, ), ); }

But I'm not sure that really addresses your concerns about the confusing nature of this transformation.

diegomura · 2026-01-02T12:51:04Z

packages/textkit/src/engines/linebreaker/knuthPlass.ts

  width: number,
  penalty: number,
  flagged: number,
+  hyphen?: '-',


Why does K&P need to have this char? Seems unnecessary to expose this. K&P just knows about widths which should be enough right?

It's needed in the linebreaker algorithm after K&P, and reconstructing the specified hyphen given only the box starts and ends would be much harder.
Have a look at the usage of this in the linebreaker. If we aren't allowed to store the hyphen value on the penalty box, that if condition first has to work out which syllable corresponds to prevNode.
It's possible if you want to avoid this at all costs, but it's going to be more complex to implement.

In other words, it's for the same reason that K&P box nodes get the start and end indices. K&P does not need it, but the linebreaker algorithm afterwards needs it.

diegomura reviewed Dec 30, 2025

View reviewed changes

packages/textkit/src/types.ts Outdated Show resolved Hide resolved

carlobeltrame force-pushed the hyphenless-hyphenation branch 3 times, most recently from d8e95a8 to 1fe2d89 Compare January 2, 2026 12:05

carlobeltrame marked this pull request as ready for review January 2, 2026 12:05

carlobeltrame force-pushed the hyphenless-hyphenation branch from 1fe2d89 to 2454d27 Compare January 2, 2026 12:31

carlobeltrame added 3 commits January 2, 2026 13:35

Add changeset

68fb141

carlobeltrame force-pushed the hyphenless-hyphenation branch from 2454d27 to 68fb141 Compare January 2, 2026 12:35

carlobeltrame changed the title ~~WIP implement hyphenless word breaking~~ Implement hyphenless word breaking Jan 2, 2026

diegomura reviewed Jan 2, 2026

View reviewed changes

Uh oh!

Implement hyphenless word breaking #3268

Are you sure you want to change the base?

Implement hyphenless word breaking #3268

Conversation

carlobeltrame commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

Uh oh!

diegomura commented Dec 31, 2025

Uh oh!

carlobeltrame commented Dec 31, 2025

Uh oh!

carlobeltrame commented Jan 2, 2026

Uh oh!

diegomura commented Jan 2, 2026

Uh oh!

carlobeltrame commented Jan 2, 2026

Uh oh!

diegomura Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

diegomura Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diegomura Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

carlobeltrame Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlobeltrame commented Dec 30, 2025 •

edited

Loading

changeset-bot bot commented Dec 30, 2025 •

edited

Loading

carlobeltrame Jan 2, 2026 •

edited

Loading

carlobeltrame Jan 2, 2026 •

edited

Loading