[v10] Add character count countType: "words" option using Intl.Segmenter#1899
[v10] Add character count countType: "words" option using Intl.Segmenter#1899colinrotherham wants to merge 4 commits into
countType: "words" option using Intl.Segmenter#1899Conversation
4933601 to
e51d1a4
Compare
6b80700 to
07676d2
Compare
e51d1a4 to
f61329c
Compare
07676d2 to
649b198
Compare
f61329c to
7fc007c
Compare
649b198 to
e76c2e7
Compare
8008c6a to
9e21bc3
Compare
6c80311 to
3290d27
Compare
9e21bc3 to
ea4e0f4
Compare
3290d27 to
9492e12
Compare
ea4e0f4 to
25511df
Compare
9492e12 to
e93b9da
Compare
25511df to
fa996f5
Compare
fa996f5 to
5c1cbb6
Compare
| }) | ||
| } | ||
|
|
||
| // Use improved word splitting if supported |
There was a problem hiding this comment.
If we keep both regexes then we have 3 levels of support for this feature, which seems like it would complicate testing and responding to issues, even if 2 of them have the same behaviour
- Doesn't support Intl.Segementer or the regex classes -> basic regex used by maxwords
- Doesn't support Intl.Segmenter but does support the regex classes -> good regex
- Supports Intl.Segmenter -> Intl.Segmenter
I think it would be fine to fall back to the basic regex whenever Intl.Segmenter is not available, rather than including a 3rd implementation. I.e. combine 1 & 2 into one implementation.
I might be wrong, but to me countType: 'words' feels like a soft limit that you might use if there is no hard constraint on the server side. So treating hyphenated words as one word probably won't break anything?
Another alternative might be to provide a polyfill that uses unicode ranges [\u0009-\u000D...] instead of the character classes, and use that as the fallback for Intl.Segmenter.
If we decide to have behaviour that differs across browsers we should probably document that wherever we introduce the feature.
There was a problem hiding this comment.
Thanks for looking at this @MatMoore
I've added a reply about the Babel regex transform over in #1899 (comment)
These are the browsers that shipped type="module" without Unicode character class escape support:
- Chrome 61–63
- Firefox 60–77
- Safari 11
We could simply upgrade the basic regex when opting in via countType: "words"?
- this.separator = /\s+/g;
+ this.separator = /[\s\-‑–—.,;:!\\/]+/g;Whilst it was nice to use \p escapes (via the u flag) since so few browsers lacked support, this "good regex" is probably good enough for all other browsers lacking Intl.Segmenter
There was a problem hiding this comment.
yeah that seems reasonable to me, it avoids the question of whether \p escapes are supported or not, and then when we drop maxwords we can get rid of the old regex.
There was a problem hiding this comment.
I've removed the various levels of regex (by browsers support) from this PR
This PR has been updated to apply the suggestion in alphagov/govuk-frontend#6995 (review) so that browsers without Intl.Segmenter fall back to the no-JS version
- Current options
maxlengthandmaxwordswork as usual for backwards compatibility - New options
countType: "characters"orcountType: "words"useIntl.Segmenter
e93b9da to
078f07a
Compare
5c1cbb6 to
7c7bae5
Compare
7c7bae5 to
0cc0ba3
Compare
078f07a to
7b1a233
Compare
Unless the (deprecated) `maxwords` option is used
…nter Unless the (deprecated) `maxwords` option is used
0cc0ba3 to
c71dbcd
Compare
|



Description
This PR extends #1895 to add character count support for
countType: "words"using Intl.Segmenter{{ characterCount({ label: { text: "Can you provide more detail?" }, name: "more-detail", - maxwords: 200 + maxlength: 200, + countType: "words" }) }}Issues with separator regex
The existing word count behaviour uses
/\S+/gto count all consecutive non-whitespace charactersBut consider this phrase where hyphens and em-dashes are used as separators:
It matches only 3 words currently:
Yet using the segmenter with
granularity: "word"it matches 6 words:To solve this, but avoid breaking changes:
countType: "wordswill use Intl.Segmenter in modern browsersmaxwords(deprecated) will use the existing word count behaviourBut what about older browsers that lack Intl.Segmenter support?
I've updated the word separator regex based on this comment: alphagov/govuk-frontend#1364 (comment)
Checklist