Skip to content

Add character count Intl.Segmenter support with customisable count function#6995

Open
colinrotherham wants to merge 18 commits into
alphagov:mainfrom
colinrotherham:character-count-segmenter
Open

Add character count Intl.Segmenter support with customisable count function#6995
colinrotherham wants to merge 18 commits into
alphagov:mainfrom
colinrotherham:character-count-segmenter

Conversation

@colinrotherham
Copy link
Copy Markdown
Contributor

@colinrotherham colinrotherham commented Apr 28, 2026

This PR updates the character count component to (optionally) use Intl.Segmenter

It's a non-breaking change and requires a new countType option to be set:

  • countType: "length" (default) continues to count code points
  • countType: "characters" counts graphemes (user-perceived characters)
  • countType: "words" counts words regardless of punctuation

Closes #1104, #1364 and partly #2888

Test coverage

I've skipped on tests until you're happy with the proposal (and comments) in:


These changes are lifted from NHS.UK frontend in:

With some related configuration changes in:

Comment on lines +36 to +40
/**
* @private
* @type {Intl.Segmenter | null}
*/
segmenter = null
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this should be @private?

For example this.segmenter can be accessed via the custom count function:

createAll(CharacterCount, {
  countFunction(text) {
    // this.segmenter
  }
})

Copy link
Copy Markdown
Contributor

@36degrees 36degrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to do a full review, but my gut feeling is that we should avoid scenarios where the character count can give a different result depending on what browser you're using.

That being the case, in browsers that do not support Intl.Segementer I think we should fall back to the no-JS behaviour, rather than using a regex.

@colinrotherham
Copy link
Copy Markdown
Contributor Author

Thanks @36degrees

Did you have any thoughts on the new countType Nunjucks option?

I haven't had a chance to do a full review, but my gut feeling is that we should avoid scenarios where the character count can give a different result depending on what browser you're using.

That makes sense, and means we can drop the fallback regexes too

For balance, there are some examples where browser differences are expected:

But for the latter issue polyfill weight was involved

That being the case, in browsers that do not support Intl.Segementer I think we should fall back to the no-JS behaviour, rather than using a regex.

We're happy with this though and I can update the PR

@colinrotherham colinrotherham force-pushed the character-count-segmenter branch from 66b3afb to 405410f Compare May 11, 2026 14:11
@colinrotherham
Copy link
Copy Markdown
Contributor Author

Pushed an update to do this:

That being the case, in browsers that do not support Intl.Segementer I think we should fall back to the no-JS behaviour, rather than using a regex.

  • Current options maxlength and maxwords work as usual for backwards compatibility
  • New options countType: "characters" or countType: "words" use Intl.Segmenter

Copy link
Copy Markdown
Member

@romaricpascal romaricpascal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers for proposing this @colinrotherham and the care around avoiding a breaking change 🙌🏻

Besides the comments about technical implementation, I'm concerned about a two things:

  1. keeping maxlength as the source for the maximum for both characters may be a bit confusing, both to users used to it being used only for characters and when switching codebases using different versions of GOV.UK Frontend. I'd be keen to use a completely new option (say maximum) that'll be associated to the new countType to avoid confusion
  2. only offering to count words the way Intl.Segmenter does with the countType option that the component is moving towards. I think we should check which way backends count words to make sure our default matches. It might be that we need two ways of counting: one counting like Intl.Segmenter and another only considering the whitespace as we do now, even if it does not match Unicode definition of a word.

Let me know what you think 😊

Comment thread packages/govuk-frontend/src/govuk/common/configuration.mjs Outdated
Comment thread packages/govuk-frontend/src/govuk/init.mjs
Comment thread packages/govuk-frontend/src/govuk/components/character-count/character-count.mjs Outdated
this.count = text.match(/\S+/g)?.length ?? 0
break
}
this.count = this.countFunctions[countType].call(this, text)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion Rather than executing the function like if it was a method of the CharacterCount, passing the segmenter as a second argument, inside an options object makes the boundary between the component and the count function clearer, allowing us to control what's exposed to the count function.

Suggested change
this.count = this.countFunctions[countType].call(this, text)
this.count = this.countFunctions[countType](text, {segmenter: this.segmenter})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use a getter to only instantiate the segmenter if the function accesses it, but that's more an optimisation than anything.

Copy link
Copy Markdown
Contributor Author

@colinrotherham colinrotherham May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that means the custom countFunction would lose access to:

  • this.separator to split words yourself
  • this.segmenter to filter the segments yourself
  • this.$textarea to get the value yourself (e.g. trim, normalised line endings, row count etc)

Appreciate that all of these things are accessible anyway as @private isn't really private

Let me know if you'd like me to do anything

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine witht the countFunction not having access to much at the start. We can always expand what we provide to the function in minor releases, but we can only restrict what the function receives in breaking releases if we went too far at the start.

Using an object as a second parameter would also clarify what this represents in the component's count functions (where you may think it's the countFunctions object where the functions are defined if you miss the typings).

Overall, if that's OK, I'd prefer we:

  • pass a second argument to the count function rather than use this (should have flagged that as an 'issue' rather than a 'suggestion')
  • restrict what the function receives to only the segmenter for now and expand as demand grows, keeping the separator in the function for counting words (thinking that long term, if we want people to manipulate texts before counting 'like the component does', we'd be better off exposing the countFunctions themselves rather than granular details of their implementation).

Hope that makes sense 😊

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a second parameter I've set a custom (restricted) this and updated the types:

- this.count = countFunction.call(this, text)
+ this.count = countFunction.call({ segmenter: this.segmenter }, text)

Have a look at the diff for my last push to see this.separator has been removed too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid recreating the count function context object every time, it could be persisted?

Either using .call() as this

  // Limit access via `this` when calling the count function to prevent
  // unintended access to internal properties and methods
- this.count = countFunction.call(
-   {
-     config: this.config,
-     segmenter: this.segmenter
-   },
-   text
- )
+ this.count = countFunction.call(this.countFunctionContext, text)

Or as a 2nd param as you prefer:

  // Limit access via `this` when calling the count function to prevent
  // unintended access to internal properties and methods
- this.count = countFunction.call(
-   {
-     config: this.config,
-     segmenter: this.segmenter
-   },
-   text
- )
+ this.count = countFunction(text, this.countFunctionContext)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would definitely prefer a context as a parameter so that things are explicit rather than accessed through this, thanks. Feels more natural as second parameter. Not against caching it on the instance if you think that's an issue for performance to re-create it at each call.

Another thought that occured to me is that if this.config.maxwords is set, there is no this.segmenter, right? This means that the words function could branch on whether this.segmenter is definer rather than the value in the config. That would allow the public API for the context to be narrower.

Potentially, the characters function could work the same way for consistency. That would also clearly split which part of the component are responsible for what:

  • constructor decides whether to create a segmenter or not based on the config
  • count functions decide how to count based on whether they have a segmenter or not

How does that sound?

Copy link
Copy Markdown
Contributor Author

@colinrotherham colinrotherham May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good on the 2nd parameter

I'm a little bit lost on the rest 😆

Another thought that occured to me is that if this.config.maxwords is set, there is no this.segmenter, right? This means that the words function could branch on whether this.segmenter is definer rather than the value in the config. That would allow the public API for the context to be narrower.

So this is what I did originally—I think?

Where branching on this.segmenter for countType: "words" gave different results by browser

But it differs to the feelings Ollie set in an earlier comment where he said:

I haven't had a chance to do a full review, but my gut feeling is that we should avoid scenarios where the character count can give a different result depending on what browser you're using.

That being the case, in browsers that do not support Intl.Segementer I think we should fall back to the no-JS behaviour, rather than using a regex.

So from this we've determined:

  • Users that set maxwords (deprecated) should the existing regex word count
  • Users that set countType: "words" should get segmenter word counting where supported
  • Users that set countType: "words" should get the no-JS behaviour where NOT supported

i.e. If you opt-in to use Intl.Segmenter then that's what you get (or the no-JS behaviour)

Hope that's still alright?


Regarding browser support

Knowing that Intl.Segmenter is in Baseline 2024 compare the following queries:

Note: There sadly isn't a feature query intl-segmenter like there is for intl-pluralrules

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would definitely prefer a context as a parameter so that things are explicit rather than accessed through this, thanks. Feels more natural as second parameter. Not against caching it on the instance if you think that's an issue for performance to re-create it at each call.

✅ Done (pushed)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is what I did originally—I think?

Where branching on this.segmenter for countType: "words" gave different results by browser

But it differs to the feelings Ollie set in an earlier comment where he said:

I think I didn't explain well. Found it easier to attach a comment to the countFunctionContext to explain 😊

Comment thread packages/govuk-frontend/src/govuk/components/character-count/character-count.mjs Outdated
@colinrotherham
Copy link
Copy Markdown
Contributor Author

colinrotherham commented May 12, 2026

Thanks @romaricpascal

You might have missed that word counting retains the current approach if maxwords is set 🙌

Regarding maximum versus using maxlength, wouldn't the latter mean zero changes are necessary should segmenter become the default in a future major release?

✅ GOV.UK Frontend v2.2.0+

The maxlength option always works

{{ govukCharacterCount({
  label: {
    text: "Always works"
  },
  name: "example",
  maxlength: 200
}) }}

@colinrotherham
Copy link
Copy Markdown
Contributor Author

Do think about a future opt-out though, like segmenter: false?

Or having the word separator regex as an option? If set, bypassing the segmenter

Keen to lock in the API so we can release this on NHS.UK frontend

@colinrotherham colinrotherham force-pushed the character-count-segmenter branch from 405410f to 579f202 Compare May 13, 2026 10:32
@colinrotherham colinrotherham force-pushed the character-count-segmenter branch from 579f202 to 29c44bf Compare May 14, 2026 11:42
@romaricpascal
Copy link
Copy Markdown
Member

Regarding maximum versus using maxlength, wouldn't the latter mean zero changes are necessary should segmenter become the default in a future major release?

That's a great point, hadn't thought of that. 🙌🏻

You might have missed that word counting retains the current approach if maxwords is set 🙌

My worry was for after we remove maxwords in the next major release (as it's being rightly deprecated in this PR).
Both your propositions of a separator and a segmenter: false opt-out would be a way to work around that, so I think that decision can be delayed until v7.0.0. Both may be useful as well:

  • separator to offer arbitrary splitting
  • segmenter: false to avoid creating segmenters unnecessarily if your countFunction does not need one

Keen to lock in the API so we can release this on NHS.UK frontend

Appreciate that'd reduce the divergence between both our Design Systems. However, we can't guarantee our responsiveness when looking at a topic we're not currently focusing on (like what happened for this PR), so please don't stay stuck because of us.

Unless the (deprecated) `maxwords` option is used
…nter

Unless the (deprecated) `maxwords` option is used
@colinrotherham colinrotherham force-pushed the character-count-segmenter branch from 29c44bf to b18ddd1 Compare May 14, 2026 13:31
Comment on lines +167 to +170
this.countFunctionContext = {
config: this.config,
segmenter: this.segmenter
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having that context in place makes it easier to explain what I was on about with using the segmenter when counting words.

The idea is still to throw on line 148 if the Segmenter API is not available, not to fallback on the other way of counting when the API is not there.

Because we know the component will only keep initialising when a segmenter is needed if the Segmenter API is available, we can reduce the context to the following, keeping the initial public API narrower (less risk of breaking change in the future) and keeping all config related computations internal to the constructor.

Suggested change
this.countFunctionContext = {
config: this.config,
segmenter: this.segmenter
}
this.countFunctionContext = {
segmenter: this.segmenter
}

Then words can check if (this.segmenter) instead of if (this.config.maxwords).

Hope that makes more sense, aim is to control how much of a public API we offer at the start to avoid having to roll back on it down the line 😊

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, glad that made sense

Hmm you might want to hold off hiding the config for a major breaking release though?

Users that provide countFunction will use the config to:

  • Determine whether the (deprecated) maxwords option is used
  • Determine whether they're counting "length", "characters" or "words"
  • Provide their own non-segmenter fallback based on config.countType

Especially when passing JavaScript configuration via initAll() or createAll() because a single application-wide countFunction will at least need to know the config.countType?

Without a config they can't provide their own fallbacks should the non-JS fallback be unsuitable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Character count component counts code points, not characters

3 participants