Skip to content

Document that grapheme_strlen() must not be used as the only size check for untrusted input #5564

@masakielastic

Description

@masakielastic

Affected page

https://www.php.net/manual/en/function.grapheme-strlen.php

Current issue

The current grapheme_strlen() documentation says that it returns the string length in grapheme units, “not bytes or characters”. It also states that the input string must be valid UTF-8.

However, it does not explain an important security and robustness concern: grapheme length is not an input size limit.

A grapheme cluster represents a user-perceived character boundary. This is useful for user-visible text, such as UI counters, excerpt lengths, or display-oriented truncation.

However, a single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. In unusual or hostile input, a string may have a very small grapheme length while still containing many code points or bytes.

For example, a string may contain a base character followed by a very long sequence of combining marks. Such a string may be counted as one or a few grapheme clusters, but it can still be large in bytes and expensive to process, store, transmit, render, or normalize.

This means that code like the following may look like a size check, but it is not sufficient as a resource limit:

if (grapheme_strlen($input) <= 100) {
    // accept input
}

This only limits the number of grapheme clusters. It does not limit byte length, code point length, encoded length, memory usage, database storage size, HTTP request size, or rendering/normalization cost.

This matters for byte-oriented or storage-oriented limits, such as:

  • database column or index limits
  • file size limits
  • HTTP header name or header value size limits
  • URL length limits
  • request body size limits
  • Cookie header size limits
  • API payload size limits

A string with a small grapheme length may still exceed these limits, especially after UTF-8 encoding, percent-encoding, JSON encoding, MIME encoding, or other transport/storage encoding.

This concern is also related to grapheme_extract(). Unlike grapheme_strlen(), which only counts grapheme units, grapheme_extract() provides size modes such as GRAPHEME_EXTR_MAXBYTES and GRAPHEME_EXTR_MAXCHARS, allowing callers to extract text while applying a maximum size constraint and still ending on a default grapheme cluster boundary.

In other words, grapheme-aware processing and size-limited processing are separate concerns. grapheme_strlen() answers how many grapheme units a string contains; it does not answer how large the string is.

Suggested improvement

Add a note explaining that grapheme_strlen() is useful for user-visible text length, but must not be used as the only size check for untrusted input.

Suggested note:

Note:

Grapheme length is useful for user-visible text because it approximates the number of user-perceived characters.

However, grapheme length is not an input size limit. A single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. Therefore, a string with a small grapheme length may still be large in bytes or expensive to process.

Do not use grapheme_strlen() as the only size check for untrusted input. For byte-oriented or storage-oriented limits, such as database column or index limits, file size limits, HTTP header name or header value size limits, URL length limits, request body size limits, Cookie header size limits, or API payload size limits, also check the length unit required by that system.

If size-limited grapheme-aware extraction is needed, consider grapheme_extract() with an appropriate extraction type, such as GRAPHEME_EXTR_MAXBYTES or GRAPHEME_EXTR_MAXCHARS.

Additional context (optional)

This clarification would help users understand that grapheme_strlen() is not simply a safer or more correct replacement for strlen() or mb_strlen().

It is appropriate when the goal is to count user-perceived characters in displayed text. For example, it can be useful for UI character counters, excerpt generation, or avoiding truncation in the middle of a user-perceived character.

However, it is not sufficient for validating the size of untrusted input. Because grapheme clusters can contain multiple code points, and because extremely long combining sequences are possible, applications should separately enforce byte length, storage length, encoded length, or other system-specific limits where those limits matter.

The current “See Also” section links to strlen(), mb_strlen(), and iconv_strlen(), but the page does not currently explain this practical and security-relevant distinction.

The grapheme_extract() documentation already exposes the distinction between grapheme cluster count and maximum returned size through GRAPHEME_EXTR_COUNT, GRAPHEME_EXTR_MAXBYTES, and GRAPHEME_EXTR_MAXCHARS. That distinction would be useful to mention from the grapheme_strlen() page as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions