Add character count Intl.Segmenter support with customisable count function#6995
Add character count Intl.Segmenter support with customisable count function#6995colinrotherham wants to merge 18 commits into
Intl.Segmenter support with customisable count function#6995Conversation
| /** | ||
| * @private | ||
| * @type {Intl.Segmenter | null} | ||
| */ | ||
| segmenter = null |
There was a problem hiding this comment.
Not sure whether this should be @private?
For example this.segmenter can be accessed via the custom count function:
createAll(CharacterCount, {
countFunction(text) {
// this.segmenter
}
})
36degrees
left a comment
There was a problem hiding this comment.
I haven't had a chance to do a full review, but my gut feeling is that we should avoid scenarios where the character count can give a different result depending on what browser you're using.
That being the case, in browsers that do not support Intl.Segementer I think we should fall back to the no-JS behaviour, rather than using a regex.
|
Thanks @36degrees Did you have any thoughts on the new
That makes sense, and means we can drop the fallback regexes too For balance, there are some examples where browser differences are expected:
But for the latter issue polyfill weight was involved
We're happy with this though and I can update the PR |
66b3afb to
405410f
Compare
|
Pushed an update to do this:
|
romaricpascal
left a comment
There was a problem hiding this comment.
Cheers for proposing this @colinrotherham and the care around avoiding a breaking change 🙌🏻
Besides the comments about technical implementation, I'm concerned about a two things:
- keeping
maxlengthas the source for the maximum for both characters may be a bit confusing, both to users used to it being used only for characters and when switching codebases using different versions of GOV.UK Frontend. I'd be keen to use a completely new option (saymaximum) that'll be associated to the newcountTypeto avoid confusion - only offering to count words the way
Intl.Segmenterdoes with thecountTypeoption that the component is moving towards. I think we should check which way backends count words to make sure our default matches. It might be that we need two ways of counting: one counting likeIntl.Segmenterand another only considering the whitespace as we do now, even if it does not match Unicode definition of a word.
Let me know what you think 😊
| this.count = text.match(/\S+/g)?.length ?? 0 | ||
| break | ||
| } | ||
| this.count = this.countFunctions[countType].call(this, text) |
There was a problem hiding this comment.
suggestion Rather than executing the function like if it was a method of the CharacterCount, passing the segmenter as a second argument, inside an options object makes the boundary between the component and the count function clearer, allowing us to control what's exposed to the count function.
| this.count = this.countFunctions[countType].call(this, text) | |
| this.count = this.countFunctions[countType](text, {segmenter: this.segmenter}) |
There was a problem hiding this comment.
Maybe we could use a getter to only instantiate the segmenter if the function accesses it, but that's more an optimisation than anything.
There was a problem hiding this comment.
Sadly that means the custom countFunction would lose access to:
this.separatorto split words yourselfthis.segmenterto filter the segments yourselfthis.$textareato get the value yourself (e.g. trim, normalised line endings, row count etc)
Appreciate that all of these things are accessible anyway as @private isn't really private
Let me know if you'd like me to do anything
There was a problem hiding this comment.
I think I'm fine witht the countFunction not having access to much at the start. We can always expand what we provide to the function in minor releases, but we can only restrict what the function receives in breaking releases if we went too far at the start.
Using an object as a second parameter would also clarify what this represents in the component's count functions (where you may think it's the countFunctions object where the functions are defined if you miss the typings).
Overall, if that's OK, I'd prefer we:
- pass a second argument to the count function rather than use
this(should have flagged that as an 'issue' rather than a 'suggestion') - restrict what the function receives to only the
segmenterfor now and expand as demand grows, keeping theseparatorin the function for counting words (thinking that long term, if we want people to manipulate texts before counting 'like the component does', we'd be better off exposing thecountFunctionsthemselves rather than granular details of their implementation).
Hope that makes sense 😊
There was a problem hiding this comment.
Instead of a second parameter I've set a custom (restricted) this and updated the types:
- this.count = countFunction.call(this, text)
+ this.count = countFunction.call({ segmenter: this.segmenter }, text)Have a look at the diff for my last push to see this.separator has been removed too
There was a problem hiding this comment.
To avoid recreating the count function context object every time, it could be persisted?
Either using .call() as this
// Limit access via `this` when calling the count function to prevent
// unintended access to internal properties and methods
- this.count = countFunction.call(
- {
- config: this.config,
- segmenter: this.segmenter
- },
- text
- )
+ this.count = countFunction.call(this.countFunctionContext, text)Or as a 2nd param as you prefer:
// Limit access via `this` when calling the count function to prevent
// unintended access to internal properties and methods
- this.count = countFunction.call(
- {
- config: this.config,
- segmenter: this.segmenter
- },
- text
- )
+ this.count = countFunction(text, this.countFunctionContext)There was a problem hiding this comment.
Would definitely prefer a context as a parameter so that things are explicit rather than accessed through this, thanks. Feels more natural as second parameter. Not against caching it on the instance if you think that's an issue for performance to re-create it at each call.
Another thought that occured to me is that if this.config.maxwords is set, there is no this.segmenter, right? This means that the words function could branch on whether this.segmenter is definer rather than the value in the config. That would allow the public API for the context to be narrower.
Potentially, the characters function could work the same way for consistency. That would also clearly split which part of the component are responsible for what:
- constructor decides whether to create a segmenter or not based on the config
- count functions decide how to count based on whether they have a segmenter or not
How does that sound?
There was a problem hiding this comment.
Sounds good on the 2nd parameter
I'm a little bit lost on the rest 😆
Another thought that occured to me is that if
this.config.maxwordsis set, there is nothis.segmenter, right? This means that the words function could branch on whetherthis.segmenteris definer rather than the value in the config. That would allow the public API for the context to be narrower.
So this is what I did originally—I think?
Where branching on this.segmenter for countType: "words" gave different results by browser
But it differs to the feelings Ollie set in an earlier comment where he said:
I haven't had a chance to do a full review, but my gut feeling is that we should avoid scenarios where the character count can give a different result depending on what browser you're using.
That being the case, in browsers that do not support
Intl.SegementerI think we should fall back to the no-JS behaviour, rather than using a regex.
So from this we've determined:
- Users that set
maxwords(deprecated) should the existing regex word count - Users that set
countType: "words"should get segmenter word counting where supported - Users that set
countType: "words"should get the no-JS behaviour where NOT supported
i.e. If you opt-in to use Intl.Segmenter then that's what you get (or the no-JS behaviour)
Hope that's still alright?
Regarding browser support
Knowing that Intl.Segmenter is in Baseline 2024 compare the following queries:
- Audience coverage: 94.9 % for
supports es6-module, versus - Audience coverage: 9.1 % for
supports es6-module and not baseline 2024
Note: There sadly isn't a feature query intl-segmenter like there is for intl-pluralrules
There was a problem hiding this comment.
Would definitely prefer a context as a parameter so that things are explicit rather than accessed through
this, thanks. Feels more natural as second parameter. Not against caching it on the instance if you think that's an issue for performance to re-create it at each call.
✅ Done (pushed)
There was a problem hiding this comment.
So this is what I did originally—I think?
Where branching on
this.segmenterforcountType: "words"gave different results by browserBut it differs to the feelings Ollie set in an earlier comment where he said:
I think I didn't explain well. Found it easier to attach a comment to the countFunctionContext to explain 😊
|
Thanks @romaricpascal You might have missed that word counting retains the current approach if Regarding ✅ GOV.UK Frontend v2.2.0+The {{ govukCharacterCount({
label: {
text: "Always works"
},
name: "example",
maxlength: 200
}) }} |
|
Do think about a future opt-out though, like Or having the word Keen to lock in the API so we can release this on NHS.UK frontend |
We must only use `characters` or `words` as the translation key prefix, regardless of the `countType` value
405410f to
579f202
Compare
579f202 to
29c44bf
Compare
That's a great point, hadn't thought of that. 🙌🏻
My worry was for after we remove
Appreciate that'd reduce the divergence between both our Design Systems. However, we can't guarantee our responsiveness when looking at a topic we're not currently focusing on (like what happened for this PR), so please don't stay stuck because of us. |
Unless the (deprecated) `maxwords` option is used
…nter Unless the (deprecated) `maxwords` option is used
29c44bf to
b18ddd1
Compare
| this.countFunctionContext = { | ||
| config: this.config, | ||
| segmenter: this.segmenter | ||
| } |
There was a problem hiding this comment.
Having that context in place makes it easier to explain what I was on about with using the segmenter when counting words.
The idea is still to throw on line 148 if the Segmenter API is not available, not to fallback on the other way of counting when the API is not there.
Because we know the component will only keep initialising when a segmenter is needed if the Segmenter API is available, we can reduce the context to the following, keeping the initial public API narrower (less risk of breaking change in the future) and keeping all config related computations internal to the constructor.
| this.countFunctionContext = { | |
| config: this.config, | |
| segmenter: this.segmenter | |
| } | |
| this.countFunctionContext = { | |
| segmenter: this.segmenter | |
| } |
Then words can check if (this.segmenter) instead of if (this.config.maxwords).
Hope that makes more sense, aim is to control how much of a public API we offer at the start to avoid having to roll back on it down the line 😊
There was a problem hiding this comment.
Thanks, glad that made sense
Hmm you might want to hold off hiding the config for a major breaking release though?
Users that provide countFunction will use the config to:
- Determine whether the (deprecated)
maxwordsoption is used - Determine whether they're counting
"length","characters"or"words" - Provide their own non-segmenter fallback based on
config.countType
Especially when passing JavaScript configuration via initAll() or createAll() because a single application-wide countFunction will at least need to know the config.countType?
Without a config they can't provide their own fallbacks should the non-JS fallback be unsuitable
This PR updates the character count component to (optionally) use Intl.Segmenter
It's a non-breaking change and requires a new
countTypeoption to be set:countType: "length"(default) continues to count code pointscountType: "characters"counts graphemes (user-perceived characters)countType: "words"counts words regardless of punctuationCloses #1104, #1364 and partly #2888
Test coverage
I've skipped on tests until you're happy with the proposal (and comments) in:
These changes are lifted from NHS.UK frontend in:
maxwordsand addcountTypeoption nhsuk/nhsuk-frontend#1893countFunctionoption nhsuk/nhsuk-frontend#1897countType: "characters"option using Intl.Segmenter nhsuk/nhsuk-frontend#1895countType: "words"option using Intl.Segmenter nhsuk/nhsuk-frontend#1899With some related configuration changes in: