Clarify the difference between GRAPHEME_EXTR_COUNT and size-limited grapheme_extract()

### Affected page

https://www.php.net/manual/en/function.grapheme-extract.php

### Current issue

The current `grapheme_extract()` documentation describes the three extraction types:

 * `GRAPHEME_EXTR_COUNT`
 * `GRAPHEME_EXTR_MAXBYTES`
 * `GRAPHEME_EXTR_MAXCHARS`

It also says that the returned string ends on a default grapheme cluster boundary and conforms to the specified `size` and `type`.

However, the documentation does not explain the practical difference between grapheme-count extraction and size-limited extraction.

`GRAPHEME_EXTR_COUNT` is appropriate when the application wants to extract a given number of user-perceived characters, for example for display-oriented text such as UI counters, excerpts, or previews.

However, a grapheme cluster does not have a small fixed maximum size. A single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. In unusual or hostile input, a string may have a small grapheme count while still containing many code points or bytes.

For example, a string may contain a base character followed by a very long sequence of combining marks. Such a string may be treated as one or a few grapheme clusters, but it can still be large in bytes and expensive to process, store, transmit, render, or normalize.

Therefore, code like the following may not provide the size limit that users expect:

```php
$result = grapheme_extract($input, 100, GRAPHEME_EXTR_COUNT);
```

This limits the number of default grapheme clusters. It does not necessarily limit the returned byte length, the number of Unicode code points, the encoded length, memory usage, database storage size, HTTP request size, or rendering/normalization cost.

This distinction matters even for display-oriented processing. `GRAPHEME_EXTR_COUNT` is often the right choice when the goal is to avoid splitting user-perceived characters. However, applications that process untrusted input may still need a separate resource limit. Historically, some operating systems and applications have had rendering bugs triggered by unusual Unicode sequences, including text strings that caused iOS/macOS applications to crash. Therefore, grapheme-aware extraction should not be treated as a complete robustness or security boundary by itself.

### Suggested improvement

Add a note explaining that the extraction type determines what `size` limits, and that grapheme-count extraction is different from size-limited extraction.

Suggested note:

Note:

`GRAPHEME_EXTR_COUNT` limits the number of default grapheme clusters. This is useful for display-oriented extraction because it helps avoid splitting user-perceived characters.

This is not the same as a byte-size or resource limit. A single grapheme cluster may contain multiple Unicode code points and may require many bytes in UTF-8. A result with a small grapheme count may still be large in bytes or expensive to process, render, normalize, store, or transmit.

For size-limited extraction, especially when processing untrusted input, consider `GRAPHEME_EXTR_MAXBYTES` or `GRAPHEME_EXTR_MAXCHARS`, and also enforce any limits required by the target system.

### Additional context (optional)

This clarification would help users understand that `GRAPHEME_EXTR_COUNT`, `GRAPHEME_EXTR_MAXBYTES`, and `GRAPHEME_EXTR_MAXCHARS` are not interchangeable.

`GRAPHEME_EXTR_COUNT` is suitable for display-oriented text processing where the desired unit is a user-perceived character.

`GRAPHEME_EXTR_MAXBYTES` and `GRAPHEME_EXTR_MAXCHARS` are useful when the caller needs to constrain the amount of returned text while still ending on a default grapheme cluster boundary.

This distinction is security-relevant because grapheme clusters do not have a small fixed maximum size. A small grapheme count does not guarantee a small byte length, a small encoded length, or low rendering/normalization cost.

The documentation already exposes this distinction through the three extraction types, but it does not currently explain why choosing the right extraction type matters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the difference between GRAPHEME_EXTR_COUNT and size-limited grapheme_extract() #5565

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarify the difference between GRAPHEME_EXTR_COUNT and size-limited grapheme_extract() #5565

Description

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions