Description
Describe the bug
The YAML parser rejects Unicode surrogates, but the JSON parser accepts them. This breaks the expectation that you can parse JSON as if it were YAML.
I have a real world example where the JSON response body from an HTTP call to an API endpoint contains surrogates (in string values), but I'll illustrate it with a single non-BMP character (u+10336 Gothic Letter Iuja, 𐌶):
$ yq --version
yq (https://github.com/mikefarah/yq/) version v4.44.5
$ echo -n '"\ud800\udf36"' | yq -py
Error: bad file '-': yaml: found invalid Unicode character escape code
$ echo -n '"\ud800\udf36"' | yq -pj
𐌶
$ echo -n '"\ud800\udf36"' | yq -pj | tr -d \\n | iconv -t utf-32le | od -t x4
0000000 00010336
0000004
So arguably the author of the JSON should have used \U{10336}, but that only works for EcmaScript strings. (Tested in the Firefox console, but yq -pj
doesn't grok it. Firefox also accepts surrogates).
YAML supports \U00010336, but that only works with yq -py
. FWIW, the YAML 1.2 spec doesn't mention surrogates, but you can argue that they aren't "characters". I just need them to work...
(This sort of proves that YAML doesn't have JSON as a subset if you use the full EcmaScript string definition; but the JSON spec only has \uxxxx, so it's cool but you do need surrogates to reach outside the BMP).
Version of yq: 4.44.5
Operating system: linux amd64
Installed via: binary release