Skip to content

[v10] Add character count countType: "words" option using Intl.Segmenter#1899

Open
colinrotherham wants to merge 4 commits into
support/10.xfrom
character-count-word-segmenter
Open

[v10] Add character count countType: "words" option using Intl.Segmenter#1899
colinrotherham wants to merge 4 commits into
support/10.xfrom
character-count-word-segmenter

Conversation

@colinrotherham
Copy link
Copy Markdown
Contributor

@colinrotherham colinrotherham commented Apr 22, 2026

Description

This PR extends #1895 to add character count support for countType: "words" using Intl.Segmenter

  {{ characterCount({
    label: {
      text: "Can you provide more detail?"
    },
    name: "more-detail",
-   maxwords: 200
+   maxlength: 200,
+   countType: "words"
  }) }}

Issues with separator regex

The existing word count behaviour uses /\S+/g to count all consecutive non-whitespace characters

But consider this phrase where hyphens and em-dashes are used as separators:

My mother-in-law—Wait, what?

It matches only 3 words currently:

["My", "mother-in-law—Wait,", "what?"]

Yet using the segmenter with granularity: "word" it matches 6 words:

["My", "mother", "in", "law", "Wait", "what"]

To solve this, but avoid breaking changes:

  • Setting countType: "words will use Intl.Segmenter in modern browsers
  • Setting maxwords (deprecated) will use the existing word count behaviour

But what about older browsers that lack Intl.Segmenter support?

I've updated the word separator regex based on this comment: alphagov/govuk-frontend#1364 (comment)

Checklist

@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 22, 2026 13:37 Inactive
@colinrotherham colinrotherham added the Enhancement: feature request New feature or request label Apr 22, 2026
@colinrotherham colinrotherham changed the base branch from main to character-count-custom-function April 22, 2026 13:40
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 4933601 to e51d1a4 Compare April 22, 2026 13:40
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 22, 2026 13:41 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 6b80700 to 07676d2 Compare April 22, 2026 13:49
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from e51d1a4 to f61329c Compare April 22, 2026 13:50
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 22, 2026 13:50 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 07676d2 to 649b198 Compare April 22, 2026 13:54
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from f61329c to 7fc007c Compare April 22, 2026 13:56
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 22, 2026 13:56 Inactive
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 22, 2026 14:05 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 649b198 to e76c2e7 Compare April 27, 2026 08:37
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 8008c6a to 9e21bc3 Compare April 27, 2026 08:38
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 27, 2026 08:38 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch 2 times, most recently from 6c80311 to 3290d27 Compare April 28, 2026 11:47
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 9e21bc3 to ea4e0f4 Compare April 28, 2026 11:59
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 28, 2026 11:59 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 3290d27 to 9492e12 Compare April 28, 2026 12:06
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from ea4e0f4 to 25511df Compare April 28, 2026 12:06
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 28, 2026 12:06 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 9492e12 to e93b9da Compare April 28, 2026 14:15
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 25511df to fa996f5 Compare April 28, 2026 14:17
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 28, 2026 14:17 Inactive
@colinrotherham colinrotherham linked an issue Apr 28, 2026 that may be closed by this pull request
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from fa996f5 to 5c1cbb6 Compare April 28, 2026 17:02
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 April 28, 2026 17:02 Inactive
})
}

// Use improved word splitting if supported
Copy link
Copy Markdown
Contributor

@MatMoore MatMoore Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep both regexes then we have 3 levels of support for this feature, which seems like it would complicate testing and responding to issues, even if 2 of them have the same behaviour

  1. Doesn't support Intl.Segementer or the regex classes -> basic regex used by maxwords
  2. Doesn't support Intl.Segmenter but does support the regex classes -> good regex
  3. Supports Intl.Segmenter -> Intl.Segmenter

I think it would be fine to fall back to the basic regex whenever Intl.Segmenter is not available, rather than including a 3rd implementation. I.e. combine 1 & 2 into one implementation.

I might be wrong, but to me countType: 'words' feels like a soft limit that you might use if there is no hard constraint on the server side. So treating hyphenated words as one word probably won't break anything?

Another alternative might be to provide a polyfill that uses unicode ranges [\u0009-\u000D...] instead of the character classes, and use that as the fallback for Intl.Segmenter.

If we decide to have behaviour that differs across browsers we should probably document that wherever we introduce the feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this @MatMoore

I've added a reply about the Babel regex transform over in #1899 (comment)

These are the browsers that shipped type="module" without Unicode character class escape support:

  • Chrome 61–63
  • Firefox 60–77
  • Safari 11

We could simply upgrade the basic regex when opting in via countType: "words"?

- this.separator = /\s+/g;
+ this.separator = /[\s\-‑–—.,;:!\\/]+/g;

Whilst it was nice to use \p escapes (via the u flag) since so few browsers lacked support, this "good regex" is probably good enough for all other browsers lacking Intl.Segmenter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that seems reasonable to me, it avoids the question of whether \p escapes are supported or not, and then when we drop maxwords we can get rid of the old regex.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the various levels of regex (by browsers support) from this PR

This PR has been updated to apply the suggestion in alphagov/govuk-frontend#6995 (review) so that browsers without Intl.Segmenter fall back to the no-JS version

  • Current options maxlength and maxwords work as usual for backwards compatibility
  • New options countType: "characters" or countType: "words" use Intl.Segmenter

Comment thread eslint.config.mjs Outdated
@colinrotherham colinrotherham added this to the v10.5.0 milestone May 8, 2026
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from e93b9da to 078f07a Compare May 11, 2026 09:47
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 5c1cbb6 to 7c7bae5 Compare May 11, 2026 09:51
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 May 11, 2026 09:51 Inactive
@colinrotherham colinrotherham force-pushed the character-count-word-segmenter branch from 7c7bae5 to 0cc0ba3 Compare May 11, 2026 12:01
@colinrotherham colinrotherham temporarily deployed to nhsuk-frontend-pr-1899 May 11, 2026 12:02 Inactive
@colinrotherham colinrotherham force-pushed the character-count-custom-function branch from 078f07a to 7b1a233 Compare May 11, 2026 13:49
@sonarqubecloud
Copy link
Copy Markdown

Base automatically changed from character-count-custom-function to support/10.x May 12, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Character count component counts code points, not characters

2 participants