Skip to content

Conversation

@carlobeltrame
Copy link
Collaborator

@carlobeltrame carlobeltrame commented Dec 30, 2025

Implements the feature discussed in #3267

After the previous implementation of #3188 being partially reverted in #3267, this PR re-implements the following use cases:
Fixes #1642
Fixes #2456
Fixes #2564
Fixes #2739
Re-enables better solutions for the following issues:
Fixes #1238
Fixes #1380
Fixes #1416
Fixes #1662

@changeset-bot
Copy link

changeset-bot bot commented Dec 30, 2025

🦋 Changeset detected

Latest commit: 68fb141

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@react-pdf/textkit Minor
@react-pdf/layout Patch
@react-pdf/render Patch
@react-pdf/renderer Patch
next-14 Patch
next-15 Patch
@react-pdf/vite-example Patch
@react-pdf/e2e-node-cjs Patch
@react-pdf/e2e-node-esm Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@diegomura
Copy link
Owner

Lmk when this is no longer a WIP and it's ready for review!

@carlobeltrame
Copy link
Collaborator Author

Yeah, I have some issues installing canvas with node-gyp on my current system. I'll let you know once it's ready.

@carlobeltrame carlobeltrame force-pushed the hyphenless-hyphenation branch 3 times, most recently from d8e95a8 to 1fe2d89 Compare January 2, 2026 12:05
@carlobeltrame carlobeltrame marked this pull request as ready for review January 2, 2026 12:05
@carlobeltrame
Copy link
Collaborator Author

@diegomura this is ready for review now. I have so far taken the following executive decisions (subject to discussion):

  • Since your first look at this PR, I have switched the types to use the wording "part" instead of "syllable". This is because in advanced use cases and in non-western languages, syllable does not necessarily describe the word parts.
  • The last part of a word is forced to always have no added hyphen, and this is not overridable like [..., { string: 'sum', hyphen: '-' }]. This is because IMO it's not a realistic use case, and having hyphens at the end of the word would require additional changes to the Knuth-Plass algorithm.
  • The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.
  • I have added new integration-style tests to wrapWords.test.ts. So far, the unit tests in there used a lot of mocking and unrealistic scenarios. I left those intact, but added more realistic tests which actually use the internal default wordHyphenation engine, in addition to some custom hyphenation callbacks. I am aware that the wordHyphenation engine has its own unit tests, but custom hyphenation callback tests wouldn't really fit in there. Let me know if you prefer a different place for the new integration tests.
  • I have implemented splitting after dashes in the default hyphenation algorithm.
  • I have not yet implemented automatic URL detection, because:
    • detecting URLs involves either a very complex Regexp or usage of a new dependency
    • in addition to URLs, email addresses are probably also a good candidate for the same rules, adding even more to the complexity of detection
    • In URLs or emails or other technical identifiers, setting hyphen: null may not be enough. Disabling syllable breaking using the hyphen library may or may not be necessary in addition, and breaking on e.g. slashes, question marks, ampersand etc. may be desired. Otherwise, the URL "react-pdf.org/advanced?foo=bar#someAnchor" in a very narrow container could be broken into the following (which may or may not be what devs want):
    re
    act-
    pdf.org/ad
    van
    ced?foo=bar#so
    me
    An
    chor
    
    IMO, a better hyphenation of this specific example would be:
    react-
    pdf.org/
    advanced?
    foo=bar
    #someAnchor
    

Let me know if you disagree on any of these choices. Once we have these resolved, I can work on a documentation PR for when the feature is released.

@carlobeltrame carlobeltrame force-pushed the hyphenless-hyphenation branch from 1fe2d89 to 2454d27 Compare January 2, 2026 12:31
Refs diegomura#3267 (comment)

Custom hyphenation callbacks can now decide for each split point, whether
a hyphen should be inserted at the end of the line when breaking at said
split point.
This change is backwards compatible. If a custom hyphenation callback
returns an array of strings, it is assumed that all except the last part
should get a hyphen added when breaking.
Hyphens can be turned off for all parts of a word by returning an object
in the format:
`{ parts: <string array as previously>, hyphen: null }`
This may be useful for URLs or similar technical identifiers which
should be broken but should not be hyphenated.
Alternatively, the hyphen can be activated or deactivated for each part
separately, by returning an array of part objects:
```
[
  { string: 'blue', hyphen: '-' },
  { string: 'ish-', hyphen: null },
  { string: 'green', hyphen: null }
]
```

For now, only a dash `'-'` or `null` are valid choices for `hyphen`.
Also, for now, the last part of a word is always forced to `hyphen: null`.
Finally, unless specified specifically, parts ending in a dash will not
get an added hyphen.
Since our default hyphenation algorithm is optimized for English and
western language text, we can make the assumption that line breaking on
dashes is usually correct.
The previous changes already made sure that no duplicate hyphen is
inserted after a hyphen.
@carlobeltrame carlobeltrame force-pushed the hyphenless-hyphenation branch from 2454d27 to 68fb141 Compare January 2, 2026 12:35
@diegomura
Copy link
Owner

Thanks @carlobeltrame ! I'll check the code soon. In the meantime

The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.

Why is this? Seems straighforward to support at least other code points right?

@carlobeltrame carlobeltrame changed the title WIP implement hyphenless word breaking Implement hyphenless word breaking Jan 2, 2026
@carlobeltrame
Copy link
Collaborator Author

Thanks @carlobeltrame ! I'll check the code soon. In the meantime

The hyphen character is forced to be either '-' or null for now. All other values (including undefined, false and the empty string) are silently changed to '-'.

Why is this? Seems straighforward to support at least other code points right?

Ah yes, forgot to explain. In the Knuth-Plass algorithm, the width of the added hyphen character is currently hardcoded to be 5, but this assumption is stretched much further if we allow anything other than '-' as hyphenation character. Also, currently the linebreaker algorithm adds the additional hyphen via its hardcoded code point, and supporting other characters or even multi-character strings (which Emoji are an edge case of) would require much more sophisticated logic in there, while benefitting only extremely rare use cases.

hyphenatedWord.hyphen === null ? null : undefined,
),
);
if (normalized.length > 0) normalized[normalized.length - 1].hyphen = null;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Doesn't K&P handles this by nod adding a penalty at the end of words?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could omit this. I felt like it would be good for debuggability to normalize this already here, but if you insist, we can remove it.

: hyphenatedWord.parts.map((part) =>
normalizeHyphenatedPart(
part,
hyphenatedWord.hyphen === null ? null : undefined,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this quite confusing. All we need to do here is transform strings into { string, hyphen }, hyphen being hyphenatedWord.hyphen on each item . Can't we just do it here right away?

Copy link
Collaborator Author

@carlobeltrame carlobeltrame Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying it should be the following?

: hyphenatedWord.parts.map((part) => ({
    string: part,
    hyphen: hyphenatedWord.hyphen as '-',
  }));
This breaks two tests.
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Failed Tests 2 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

 FAIL  tests/layout/wrapWords.test.ts > wrapWords > with builtin engine > should split at hyphen and not duplicate hyphen in hyphenation
AssertionError: expected [ Array(5) ] to deeply equal [ …(5) ]

- Expected
+ Received

  Array [
    Object {
-     "hyphen": null,
+     "hyphen": "-",
      "string": "Lo-",
    },
    Object {
      "hyphen": null,
      "string": "rem",
    },
    Object {
      "hyphen": null,
      "string": " ",
    },
    Object {
      "hyphen": "-",
      "string": "ip",
    },
    Object {
      "hyphen": null,
      "string": "sum-",
    },
  ]

 ❯ tests/layout/wrapWords.test.ts:257:32
    255|       });
    256| 
    257|       expect(result.syllables).toEqual([
       |                                ^
    258|         { string: 'Lo-', hyphen: null },
    259|         { string: 'rem', hyphen: null },

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/2]⎯

 FAIL  tests/layout/wrapWords.test.ts > wrapWords > with custom hyphenation callback > should tolerate missing hyphen specification
AssertionError: expected [ …(4) ] to deeply equal [ …(4) ]

- Expected
+ Received

  Array [
    Object {
      "hyphen": null,
      "string": "Lorem",
    },
    Object {
      "hyphen": null,
      "string": " ",
    },
    Object {
-     "hyphen": "-",
+     "hyphen": undefined,
      "string": "ip",
    },
    Object {
      "hyphen": null,
      "string": "sum",
    },
  ]

 ❯ tests/layout/wrapWords.test.ts:395:32
    393|       });
    394| 
    395|       expect(result.syllables).toEqual([
       |                                ^
    396|         { string: 'Lorem', hyphen: null },
    397|         { string: ' ', hyphen: null },

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[2/2]⎯

 Test Files  1 failed | 86 passed (87)
      Tests  2 failed | 705 passed (707)
   Start at  14:14:23
   Duration  1.02s (transform 4.18s, setup 2ms, collect 8.15s, tests 307ms, environment 4ms, prepare 1.75s)

The first one is because your proposal ignores a trailing hyphen in a syllable: "Lo-rem" would wrongly be split to

[
  { string: 'Lo-', hyphen: '-' }, // <-- wrong hyphen character here
  { string: 'rem', hyphen: null },
}

The second test failure is due to an edge case when a custom hyphenation callback returns something like { parts: ['Lo', 'rem'], hyphen: undefined } or { parts: ['Lo', 'rem'], hyphen: 'X' } or even { parts: ['Lo', 'rem'] }.

We could choose to remove that second test (titled "should tolerate missing hyphen specification") and call this case undefined behaviour. But the first test indicates a real error in your reasoning.

Alternatively, we could refactor this ternary to let + if:

  let normalized: HyphenatedPart[];
  if (hyphenatedWord instanceof Array) {
    normalized = hyphenatedWord.map((part) => normalizeHyphenatedPart(part))
  } else {
    normalized = hyphenatedWord.parts.map((part) =>
      normalizeHyphenatedPart(
        part,
        hyphenatedWord.hyphen === null ? null : undefined,
      ),
    );
  }

But I'm not sure that really addresses your concerns about the confusing nature of this transformation.

width: number,
penalty: number,
flagged: number,
hyphen?: '-',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does K&P need to have this char? Seems unnecessary to expose this. K&P just knows about widths which should be enough right?

Copy link
Collaborator Author

@carlobeltrame carlobeltrame Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed in the linebreaker algorithm after K&P, and reconstructing the specified hyphen given only the box starts and ends would be much harder.
Have a look at the usage of this in the linebreaker. If we aren't allowed to store the hyphen value on the penalty box, that if condition first has to work out which syllable corresponds to prevNode.
It's possible if you want to avoid this at all costs, but it's going to be more complex to implement.

In other words, it's for the same reason that K&P box nodes get the start and end indices. K&P does not need it, but the linebreaker algorithm afterwards needs it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants