Document that grapheme_strlen() must not be used as the only size check for untrusted input

### Affected page

https://www.php.net/manual/en/function.grapheme-strlen.php

### Current issue

The current grapheme_strlen() documentation says that it returns the string length in grapheme units, “not bytes or characters”. It also states that the input string must be valid UTF-8.

However, it does not explain an important security and robustness concern: **grapheme length is not an input size limit.**

A grapheme cluster represents a user-perceived character boundary. This is useful for user-visible text, such as UI counters, excerpt lengths, or display-oriented truncation.

However, a single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. In unusual or hostile input, a string may have a very small grapheme length while still containing many code points or bytes.

For example, a string may contain a base character followed by a very long sequence of combining marks. Such a string may be counted as one or a few grapheme clusters, but it can still be large in bytes and expensive to process, store, transmit, render, or normalize.

This means that code like the following may look like a size check, but it is not sufficient as a resource limit:

```php
if (grapheme_strlen($input) <= 100) {
    // accept input
}
```

This only limits the number of grapheme clusters. It does not limit byte length, code point length, encoded length, memory usage, database storage size, HTTP request size, or rendering/normalization cost.

This matters for byte-oriented or storage-oriented limits, such as:

 * database column or index limits
 * file size limits
 * HTTP header name or header value size limits
 * URL length limits
 * request body size limits
 * Cookie header size limits
 * API payload size limits

A string with a small grapheme length may still exceed these limits, especially after UTF-8 encoding, percent-encoding, JSON encoding, MIME encoding, or other transport/storage encoding.

This concern is also related to `grapheme_extract()`. Unlike `grapheme_strlen()`, which only counts grapheme units, `grapheme_extract()` provides size modes such as `GRAPHEME_EXTR_MAXBYTES` and `GRAPHEME_EXTR_MAXCHARS`, allowing callers to extract text while applying a maximum size constraint and still ending on a default grapheme cluster boundary.

In other words, grapheme-aware processing and size-limited processing are separate concerns. `grapheme_strlen()` answers how many grapheme units a string contains; it does not answer how large the string is.

### Suggested improvement

Add a note explaining that grapheme_strlen() is useful for user-visible text length, but must not be used as the only size check for untrusted input.

Suggested note:

Note:

Grapheme length is useful for user-visible text because it approximates the number of user-perceived characters.

However, grapheme length is not an input size limit. A single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. Therefore, a string with a small grapheme length may still be large in bytes or expensive to process.

Do not use `grapheme_strlen()` as the only size check for untrusted input. For byte-oriented or storage-oriented limits, such as database column or index limits, file size limits, HTTP header name or header value size limits, URL length limits, request body size limits, Cookie header size limits, or API payload size limits, also check the length unit required by that system.

If size-limited grapheme-aware extraction is needed, consider `grapheme_extract()` with an appropriate extraction type, such as `GRAPHEME_EXTR_MAXBYTES` or `GRAPHEME_EXTR_MAXCHARS`.



### Additional context (optional)

This clarification would help users understand that `grapheme_strlen()` is not simply a safer or more correct replacement for `strlen()` or `mb_strlen()`.

It is appropriate when the goal is to count user-perceived characters in displayed text. For example, it can be useful for UI character counters, excerpt generation, or avoiding truncation in the middle of a user-perceived character.

However, it is not sufficient for validating the size of untrusted input. Because grapheme clusters can contain multiple code points, and because extremely long combining sequences are possible, applications should separately enforce byte length, storage length, encoded length, or other system-specific limits where those limits matter.

The current “See Also” section links to `strlen()`, `mb_strlen()`, and `iconv_strlen()`, but the page does not currently explain this practical and security-relevant distinction.

The `grapheme_extract()` documentation already exposes the distinction between grapheme cluster count and maximum returned size through `GRAPHEME_EXTR_COUNT`, `GRAPHEME_EXTR_MAXBYTES`, and `GRAPHEME_EXTR_MAXCHARS`. That distinction would be useful to mention from the `grapheme_strlen()` page as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document that grapheme_strlen() must not be used as the only size check for untrusted input #5564

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Document that grapheme_strlen() must not be used as the only size check for untrusted input #5564

Description

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions