Skip to content

UTF8Encoding should support encoding/decoding of unpaired surrogates #14785

Open
@tmat

Description

@tmat

According to RFC 3629 encoding/decoding unmatched surrogates should be disallowed:

"The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters."

However, this hasn't been followed by real world encoders/decoders. For example, the ECMA-335 standard encodes string arguments of custom attributes using UTF8 and the compilers allowed unpaired surrogates in the attribute argument. Another example is PDB - the file paths in PDB are stored as UTF8 encoded strings and unpaired surrogates are also allowed. The same for values of local string constants (e.g. const string surrogate = "\ud800").

To avoid breaking changes Roslyn needs to allow unpaired surrogates in the above cases and the MetadataReader should also use a variant of UTF8 encoding that is able to decode them. Currently Roslyn has a custom implementation of UTF8 encoder originating from CCI. In general, it seems that pragmatically a UTF16-UTF8 round-tripping is desirable in certain scenarios and UTF8Encoding should support it.

I propose to add a constructor to UTF8 Encoding that takes a bool allowUnpairedSurrogates (false by default) that can be used by both Roslyn and MetadataReader.

Metadata

Metadata

Assignees

No one assigned

    Labels

    api-needs-workAPI needs work before it is approved, it is NOT ready for implementationarea-System.Text.Encodinghelp wanted[up-for-grabs] Good issue for external contributors

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions