Affected page
https://www.php.net/manual/en/function.grapheme-extract.php
Current issue
The current grapheme_extract() documentation describes the three extraction types:
GRAPHEME_EXTR_COUNT
GRAPHEME_EXTR_MAXBYTES
GRAPHEME_EXTR_MAXCHARS
It also says that the returned string ends on a default grapheme cluster boundary and conforms to the specified size and type.
However, the documentation does not explain the practical difference between grapheme-count extraction and size-limited extraction.
GRAPHEME_EXTR_COUNT is appropriate when the application wants to extract a given number of user-perceived characters, for example for display-oriented text such as UI counters, excerpts, or previews.
However, a grapheme cluster does not have a small fixed maximum size. A single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. In unusual or hostile input, a string may have a small grapheme count while still containing many code points or bytes.
For example, a string may contain a base character followed by a very long sequence of combining marks. Such a string may be treated as one or a few grapheme clusters, but it can still be large in bytes and expensive to process, store, transmit, render, or normalize.
Therefore, code like the following may not provide the size limit that users expect:
$result = grapheme_extract($input, 100, GRAPHEME_EXTR_COUNT);
This limits the number of default grapheme clusters. It does not necessarily limit the returned byte length, the number of Unicode code points, the encoded length, memory usage, database storage size, HTTP request size, or rendering/normalization cost.
This distinction matters even for display-oriented processing. GRAPHEME_EXTR_COUNT is often the right choice when the goal is to avoid splitting user-perceived characters. However, applications that process untrusted input may still need a separate resource limit. Historically, some operating systems and applications have had rendering bugs triggered by unusual Unicode sequences, including text strings that caused iOS/macOS applications to crash. Therefore, grapheme-aware extraction should not be treated as a complete robustness or security boundary by itself.
Suggested improvement
Add a note explaining that the extraction type determines what size limits, and that grapheme-count extraction is different from size-limited extraction.
Suggested note:
Note:
GRAPHEME_EXTR_COUNT limits the number of default grapheme clusters. This is useful for display-oriented extraction because it helps avoid splitting user-perceived characters.
This is not the same as a byte-size or resource limit. A single grapheme cluster may contain multiple Unicode code points and may require many bytes in UTF-8. A result with a small grapheme count may still be large in bytes or expensive to process, render, normalize, store, or transmit.
For size-limited extraction, especially when processing untrusted input, consider GRAPHEME_EXTR_MAXBYTES or GRAPHEME_EXTR_MAXCHARS, and also enforce any limits required by the target system.
Additional context (optional)
This clarification would help users understand that GRAPHEME_EXTR_COUNT, GRAPHEME_EXTR_MAXBYTES, and GRAPHEME_EXTR_MAXCHARS are not interchangeable.
GRAPHEME_EXTR_COUNT is suitable for display-oriented text processing where the desired unit is a user-perceived character.
GRAPHEME_EXTR_MAXBYTES and GRAPHEME_EXTR_MAXCHARS are useful when the caller needs to constrain the amount of returned text while still ending on a default grapheme cluster boundary.
This distinction is security-relevant because grapheme clusters do not have a small fixed maximum size. A small grapheme count does not guarantee a small byte length, a small encoded length, or low rendering/normalization cost.
The documentation already exposes this distinction through the three extraction types, but it does not currently explain why choosing the right extraction type matters.
Affected page
https://www.php.net/manual/en/function.grapheme-extract.php
Current issue
The current
grapheme_extract()documentation describes the three extraction types:GRAPHEME_EXTR_COUNTGRAPHEME_EXTR_MAXBYTESGRAPHEME_EXTR_MAXCHARSIt also says that the returned string ends on a default grapheme cluster boundary and conforms to the specified
sizeandtype.However, the documentation does not explain the practical difference between grapheme-count extraction and size-limited extraction.
GRAPHEME_EXTR_COUNTis appropriate when the application wants to extract a given number of user-perceived characters, for example for display-oriented text such as UI counters, excerpts, or previews.However, a grapheme cluster does not have a small fixed maximum size. A single grapheme cluster may consist of multiple Unicode code points and may require many bytes in UTF-8. In unusual or hostile input, a string may have a small grapheme count while still containing many code points or bytes.
For example, a string may contain a base character followed by a very long sequence of combining marks. Such a string may be treated as one or a few grapheme clusters, but it can still be large in bytes and expensive to process, store, transmit, render, or normalize.
Therefore, code like the following may not provide the size limit that users expect:
This limits the number of default grapheme clusters. It does not necessarily limit the returned byte length, the number of Unicode code points, the encoded length, memory usage, database storage size, HTTP request size, or rendering/normalization cost.
This distinction matters even for display-oriented processing.
GRAPHEME_EXTR_COUNTis often the right choice when the goal is to avoid splitting user-perceived characters. However, applications that process untrusted input may still need a separate resource limit. Historically, some operating systems and applications have had rendering bugs triggered by unusual Unicode sequences, including text strings that caused iOS/macOS applications to crash. Therefore, grapheme-aware extraction should not be treated as a complete robustness or security boundary by itself.Suggested improvement
Add a note explaining that the extraction type determines what
sizelimits, and that grapheme-count extraction is different from size-limited extraction.Suggested note:
Note:
GRAPHEME_EXTR_COUNTlimits the number of default grapheme clusters. This is useful for display-oriented extraction because it helps avoid splitting user-perceived characters.This is not the same as a byte-size or resource limit. A single grapheme cluster may contain multiple Unicode code points and may require many bytes in UTF-8. A result with a small grapheme count may still be large in bytes or expensive to process, render, normalize, store, or transmit.
For size-limited extraction, especially when processing untrusted input, consider
GRAPHEME_EXTR_MAXBYTESorGRAPHEME_EXTR_MAXCHARS, and also enforce any limits required by the target system.Additional context (optional)
This clarification would help users understand that
GRAPHEME_EXTR_COUNT,GRAPHEME_EXTR_MAXBYTES, andGRAPHEME_EXTR_MAXCHARSare not interchangeable.GRAPHEME_EXTR_COUNTis suitable for display-oriented text processing where the desired unit is a user-perceived character.GRAPHEME_EXTR_MAXBYTESandGRAPHEME_EXTR_MAXCHARSare useful when the caller needs to constrain the amount of returned text while still ending on a default grapheme cluster boundary.This distinction is security-relevant because grapheme clusters do not have a small fixed maximum size. A small grapheme count does not guarantee a small byte length, a small encoded length, or low rendering/normalization cost.
The documentation already exposes this distinction through the three extraction types, but it does not currently explain why choosing the right extraction type matters.