-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Implement hyphenless word breaking #3268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement hyphenless word breaking #3268
Conversation
🦋 Changeset detectedLatest commit: 68fb141 The changes in this PR will be included in the next version bump. This PR includes changesets to release 9 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Lmk when this is no longer a WIP and it's ready for review! |
|
Yeah, I have some issues installing canvas with node-gyp on my current system. I'll let you know once it's ready. |
d8e95a8 to
1fe2d89
Compare
|
@diegomura this is ready for review now. I have so far taken the following executive decisions (subject to discussion):
Let me know if you disagree on any of these choices. Once we have these resolved, I can work on a documentation PR for when the feature is released. |
1fe2d89 to
2454d27
Compare
Refs diegomura#3267 (comment) Custom hyphenation callbacks can now decide for each split point, whether a hyphen should be inserted at the end of the line when breaking at said split point. This change is backwards compatible. If a custom hyphenation callback returns an array of strings, it is assumed that all except the last part should get a hyphen added when breaking. Hyphens can be turned off for all parts of a word by returning an object in the format: `{ parts: <string array as previously>, hyphen: null }` This may be useful for URLs or similar technical identifiers which should be broken but should not be hyphenated. Alternatively, the hyphen can be activated or deactivated for each part separately, by returning an array of part objects: ``` [ { string: 'blue', hyphen: '-' }, { string: 'ish-', hyphen: null }, { string: 'green', hyphen: null } ] ``` For now, only a dash `'-'` or `null` are valid choices for `hyphen`. Also, for now, the last part of a word is always forced to `hyphen: null`. Finally, unless specified specifically, parts ending in a dash will not get an added hyphen.
Since our default hyphenation algorithm is optimized for English and western language text, we can make the assumption that line breaking on dashes is usually correct. The previous changes already made sure that no duplicate hyphen is inserted after a hyphen.
2454d27 to
68fb141
Compare
|
Thanks @carlobeltrame ! I'll check the code soon. In the meantime
Why is this? Seems straighforward to support at least other code points right? |
Ah yes, forgot to explain. In the Knuth-Plass algorithm, the width of the added hyphen character is currently hardcoded to be 5, but this assumption is stretched much further if we allow anything other than |
| hyphenatedWord.hyphen === null ? null : undefined, | ||
| ), | ||
| ); | ||
| if (normalized.length > 0) normalized[normalized.length - 1].hyphen = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Doesn't K&P handles this by nod adding a penalty at the end of words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could omit this. I felt like it would be good for debuggability to normalize this already here, but if you insist, we can remove it.
| : hyphenatedWord.parts.map((part) => | ||
| normalizeHyphenatedPart( | ||
| part, | ||
| hyphenatedWord.hyphen === null ? null : undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this quite confusing. All we need to do here is transform strings into { string, hyphen }, hyphen being hyphenatedWord.hyphen on each item . Can't we just do it here right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are saying it should be the following?
: hyphenatedWord.parts.map((part) => ({
string: part,
hyphen: hyphenatedWord.hyphen as '-',
}));This breaks two tests.
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Failed Tests 2 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
FAIL tests/layout/wrapWords.test.ts > wrapWords > with builtin engine > should split at hyphen and not duplicate hyphen in hyphenation
AssertionError: expected [ Array(5) ] to deeply equal [ …(5) ]
- Expected
+ Received
Array [
Object {
- "hyphen": null,
+ "hyphen": "-",
"string": "Lo-",
},
Object {
"hyphen": null,
"string": "rem",
},
Object {
"hyphen": null,
"string": " ",
},
Object {
"hyphen": "-",
"string": "ip",
},
Object {
"hyphen": null,
"string": "sum-",
},
]
❯ tests/layout/wrapWords.test.ts:257:32
255| });
256|
257| expect(result.syllables).toEqual([
| ^
258| { string: 'Lo-', hyphen: null },
259| { string: 'rem', hyphen: null },
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/2]⎯
FAIL tests/layout/wrapWords.test.ts > wrapWords > with custom hyphenation callback > should tolerate missing hyphen specification
AssertionError: expected [ …(4) ] to deeply equal [ …(4) ]
- Expected
+ Received
Array [
Object {
"hyphen": null,
"string": "Lorem",
},
Object {
"hyphen": null,
"string": " ",
},
Object {
- "hyphen": "-",
+ "hyphen": undefined,
"string": "ip",
},
Object {
"hyphen": null,
"string": "sum",
},
]
❯ tests/layout/wrapWords.test.ts:395:32
393| });
394|
395| expect(result.syllables).toEqual([
| ^
396| { string: 'Lorem', hyphen: null },
397| { string: ' ', hyphen: null },
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[2/2]⎯
Test Files 1 failed | 86 passed (87)
Tests 2 failed | 705 passed (707)
Start at 14:14:23
Duration 1.02s (transform 4.18s, setup 2ms, collect 8.15s, tests 307ms, environment 4ms, prepare 1.75s)
The first one is because your proposal ignores a trailing hyphen in a syllable: "Lo-rem" would wrongly be split to
[
{ string: 'Lo-', hyphen: '-' }, // <-- wrong hyphen character here
{ string: 'rem', hyphen: null },
}The second test failure is due to an edge case when a custom hyphenation callback returns something like { parts: ['Lo', 'rem'], hyphen: undefined } or { parts: ['Lo', 'rem'], hyphen: 'X' } or even { parts: ['Lo', 'rem'] }.
We could choose to remove that second test (titled "should tolerate missing hyphen specification") and call this case undefined behaviour. But the first test indicates a real error in your reasoning.
Alternatively, we could refactor this ternary to let + if:
let normalized: HyphenatedPart[];
if (hyphenatedWord instanceof Array) {
normalized = hyphenatedWord.map((part) => normalizeHyphenatedPart(part))
} else {
normalized = hyphenatedWord.parts.map((part) =>
normalizeHyphenatedPart(
part,
hyphenatedWord.hyphen === null ? null : undefined,
),
);
}But I'm not sure that really addresses your concerns about the confusing nature of this transformation.
| width: number, | ||
| penalty: number, | ||
| flagged: number, | ||
| hyphen?: '-', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does K&P need to have this char? Seems unnecessary to expose this. K&P just knows about widths which should be enough right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed in the linebreaker algorithm after K&P, and reconstructing the specified hyphen given only the box starts and ends would be much harder.
Have a look at the usage of this in the linebreaker. If we aren't allowed to store the hyphen value on the penalty box, that if condition first has to work out which syllable corresponds to prevNode.
It's possible if you want to avoid this at all costs, but it's going to be more complex to implement.
In other words, it's for the same reason that K&P box nodes get the start and end indices. K&P does not need it, but the linebreaker algorithm afterwards needs it.
Implements the feature discussed in #3267
After the previous implementation of #3188 being partially reverted in #3267, this PR re-implements the following use cases:
Fixes #1642
Fixes #2456
Fixes #2564
Fixes #2739
Re-enables better solutions for the following issues:
Fixes #1238
Fixes #1380
Fixes #1416
Fixes #1662