Skip to content

Fix UTF8FirstLetterNumBytes to handle malformed UTF-8 correctly#1379

Open
kodareef5 wants to merge 1 commit intobufbuild:mainfrom
kodareef5:fix-utf8len-malformed
Open

Fix UTF8FirstLetterNumBytes to handle malformed UTF-8 correctly#1379
kodareef5 wants to merge 1 commit intobufbuild:mainfrom
kodareef5:fix-utf8len-malformed

Conversation

@kodareef5
Copy link
Copy Markdown

UTF8FirstLetterNumBytes in validate/validate.h returns the byte count from OneCharLen without validating that the expected continuation bytes actually follow the leader byte. Malformed UTF-8 causes Utf8Len to undercount characters by 2-4x, bypassing string length validation constraints (min_len, max_len, len).

Example: 20 bytes of \xC0 (bare 2-byte leaders, no continuations) produces Utf8Len=10 instead of 20. A field with max_len = 10 incorrectly accepts this input.

This is exploitable when C++ protobuf deserialization doesn't enforce UTF-8 validity (the default), allowing malformed strings to reach pgv validation.

Fix:

  • Clamp consumed bytes to remaining buffer length (prevents reading past end)
  • Validate continuation bytes have the 10xxxxxx pattern
  • Return 1 for any invalid byte sequence (count as single character)

Valid UTF-8 counting is unchanged: "hello"=5, "café"=4, "你好"=2, "😀😀"=2.

UTF8FirstLetterNumBytes returns the byte count from OneCharLen
without validating that the expected continuation bytes actually
follow the leader byte. Malformed UTF-8 (e.g., bare leader bytes
without continuations) causes Utf8Len to undercount characters
by 2-4x, bypassing string length validation constraints.

For example, 20 bytes of 0xC0 (invalid 2-byte leaders) produces
Utf8Len=10 instead of 20, allowing a max_len=10 constraint to
accept 20 bytes of data.

Fix:
- Clamp consumed bytes to remaining buffer length
- Validate continuation bytes have the 10xxxxxx pattern
- Return 1 for any invalid byte (count as single character)

Valid UTF-8 counting is unchanged.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 22, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants