Skip to content

YAML parser rejects Unicode surrogates. #2206

Open
@lhmathies

Description

Describe the bug

The YAML parser rejects Unicode surrogates, but the JSON parser accepts them. This breaks the expectation that you can parse JSON as if it were YAML.

I have a real world example where the JSON response body from an HTTP call to an API endpoint contains surrogates (in string values), but I'll illustrate it with a single non-BMP character (u+10336 Gothic Letter Iuja, 𐌶):

$ yq --version
yq (https://github.com/mikefarah/yq/) version v4.44.5
$ echo -n '"\ud800\udf36"' | yq -py
Error: bad file '-': yaml: found invalid Unicode character escape code
$ echo -n '"\ud800\udf36"' | yq -pj
𐌶
$ echo -n '"\ud800\udf36"' | yq -pj | tr -d \\n | iconv -t utf-32le | od -t x4
0000000 00010336
0000004

So arguably the author of the JSON should have used \U{10336}, but that only works for EcmaScript strings. (Tested in the Firefox console, but yq -pj doesn't grok it. Firefox also accepts surrogates).

YAML supports \U00010336, but that only works with yq -py. FWIW, the YAML 1.2 spec doesn't mention surrogates, but you can argue that they aren't "characters". I just need them to work...

(This sort of proves that YAML doesn't have JSON as a subset if you use the full EcmaScript string definition; but the JSON spec only has \uxxxx, so it's cool but you do need surrogates to reach outside the BMP).

Version of yq: 4.44.5
Operating system: linux amd64
Installed via: binary release

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions