feat(std.zon): add escape_unicode options to zon.serializer #23596

nurulhudaapon · 2025-04-17T06:42:06Z

Currently std.zon.stringify.serialize always escapes Unicode characters, while std.json.stringify by default does not. This change adds an escape_unicode option that matches the JSON serializer's behavior. To maintain backward compatibility, the default value is true, preserving the current behavior of escaping Unicode.

Change

Before

std.zon.stringify.serialize(buff, .{ .whitespace = true }, writer)

test "std.zon.stringify.serialize escape_unicode = true (default)" {
    const buff = .{ .name = "Test", .description = "⚡ Lightning Bolt", .emoji = "⚡" };
    var buff_str = std.ArrayList(u8).init(std.testing.allocator);
    defer buff_str.deinit();
    try std.zon.stringify.serialize(buff, .{ .whitespace = true }, buff_str.writer());
    std.debug.print("\n{s}\n", .{buff_str.items});
}

Output:

.{
    .name = "Test",
    .description = "\xe2\x9a\xa1 Lightning Bolt",
    .emoji = "\xe2\x9a\xa1",
}

After

std.zon.stringify.serialize(buff, .{ .escape_unicode = false, .whitespace = true }, writer)

test "std.zon.stringify.serialize escape_unicode = false (added option)" {
    const buff = .{ .name = "Test", .description = "⚡ Lightning Bolt", .emoji = "⚡" };
    var buff_str = std.ArrayList(u8).init(std.testing.allocator);
    defer buff_str.deinit();
    try std.zon.stringify.serialize(buff, .{ .escape_unicode = false, .whitespace = true }, buff_str.writer());
    std.debug.print("\n{s}\n", .{buff_str.items});
}

Output:

.{
    .name = "Test",
    .description = "⚡ Lightning Bolt",
    .emoji = "⚡",
}

Test

const std = @import("std");

test "std.zon.stringify.serialize escape_unicode = false" {
    var buf = std.ArrayList(u8).init(std.testing.allocator);
    defer buf.deinit();

    try std.zon.stringify.serialize(
        .{ .char = "abc⚡" },
        .{ .escape_unicode = false },
        buf.writer(),
    );
    try std.testing.expectEqualStrings(".{ .char = \"abc⚡\" }", buf.items);
    buf.clearRetainingCapacity();
}

Use Case

I was trying to store Unicode data in a ZON file, which I previously did in JSON. When converting from JSON to ZON using the JSON parser and ZON serializer, the Unicode characters were always escaped. This made the ZON file hard to read, which defeats its purpose as a human-readable format.

Closes #23535

Currently std.zon.stringify.serialize will always produce unicode to be escaped, whereas in std.json.stringify by default doesn't escape unicode. Adding escape_unicode option matching with the json serializer but by default it is false (as the current behaviour) to keep things backward compatible. ```zig const std = @import("std"); test "std.zon.stringify.serialize escape_unicode = false" { var buf = std.ArrayList(u8).init(std.testing.allocator); defer buf.deinit(); try std.zon.stringify.serialize( .{ .char = 'অ' }, .{ .escape_unicode = false }, buf.writer(), ); try std.testing.expectEqualStrings(".{ .char = \"অ\" }", buf.items); buf.clearRetainingCapacity(); } ```

alexrp · 2025-04-17T17:38:41Z

cc @MasonRemaley

MasonRemaley · 2025-04-17T17:42:31Z

Thanks for the PR!

I'll take a look at this and the other Unicode related issue today. In particular, I want to look into whether or not it's necessary to maintain backwards compatibility with the current behavior.

[EDIT] Sorry for the delay, haven't forgotten about this though will get to it soon!

nurulhudaapon · 2025-04-17T19:33:21Z

Thanks for the PR!

I'll take a look at this and the other Unicode related issue today. In particular, I want to look into whether or not it's necessary to maintain backwards compatibility with the current behavior.

Yeah, I feel like it doesn't need to be backward compatible and should by default not escape unicode since this is usual behavior in most serializer and zon.serializer has not been adopted that much yet.

MasonRemaley · 2025-04-24T03:24:36Z

Apologies for the delay on this!

Looking it over, there was no good reason for me to escape everything by default. Adding escape_unicode as an option is good, and it should be false by default.

However there's one important case that needs to be addressed before this can be merged. Unless I'm missing something, the implementation here now doesn't escape \ or " which is necessary for correctness.

You can see how std.json handles this here. I think escaping these two characters is sufficient to guarantee that the output is a valid Zig string, but it's worth double checking stringEscape to make sure it's not doing anything else necessary.

MasonRemaley · 2025-04-24T05:30:19Z

Linking the issue you filed #23535 here since it's related to this PR in that it's an example of a character that can't really be printed the way you'd expect right now. We probably want to figure out how to address this as well.

…de when needed Previousely `⚡` -> `'\xe2\x9a\xa1'` (Notice the hex code is single quoted which is not valid Zig/ZON syntax) Now `⚡` -> `"\xe2\x9a\xa1"` `127` -> `'\x7f'` (Will still emit single quoted hex when possible)

nurulhudaapon · 2025-05-21T05:06:32Z

Apologies for the delay on this!

Looking it over, there was no good reason for me to escape everything by default. Adding escape_unicode as an option is good, and it should be false by default.

However there's one important case that needs to be addressed before this can be merged. Unless I'm missing something, the implementation here now doesn't escape \ or " which is necessary for correctness.

You can see how std.json handles this here. I think escaping these two characters is sufficient to guarantee that the output is a valid Zig string, but it's worth double checking stringEscape to make sure it's not doing anything else necessary.

Updated the code to escape items that needs to be escaped following what stringEscape does. I'm wondering though if it is okay to have this almost duplicated string escape logic here or just update stringEscape to have options to not escape unicode and re-use the same here. Otherwise everything should be good.

nurulhudaapon · 2025-05-30T10:24:25Z

Hi @MasonRemaley, let me know if the current changes look good! Thanks.

MasonRemaley · 2025-06-03T19:49:04Z

This looks good, but I don't think pub fn codePoint is correct. As written it will sometimes output a double quoted value, and sometimes output a single quoted value. While both of these are valid outputs, they aren't actually interchangeable with eachother since one is a value and one is an array.

As you pointed out in #23535 the way I had it before wasn't correct either. One option is to just not change codePoint in this PR since the rest of the PR doesn't depend on it, and then merge otherwise as is. I can figure out fixing codePoint in a separate PR.

On the other hand, if you want to resolve #2353 in this PR, we could use the \u{hex number} syntax for characters that don't have recognizable escape codes like \n. Though maybe this should also respect the option to use escapes or not.

…d hex code when needed" This reverts commit b9783ef.

nurulhudaapon · 2025-06-04T05:37:51Z

This looks good, but I don't think pub fn codePoint is correct. As written it will sometimes output a double quoted value, and sometimes output a single quoted value. While both of these are valid outputs, they aren't actually interchangeable with eachother since one is a value and one is an array.

As you pointed out in #23535 the way I had it before wasn't correct either. One option is to just not change codePoint in this PR since the rest of the PR doesn't depend on it, and then merge otherwise as is. I can figure out fixing codePoint in a separate PR.

On the other hand, if you want to resolve #2353 in this PR, we could use the \u{hex number} syntax for characters that don't have recognizable escape codes like \n. Though maybe this should also respect the option to use escapes or not.

Thank you for reviewing. Going with the first option to not change codePoint here as this PR doesn't depend on it nor it is currently blocking at least my use case. Just reverted the codePoint changes.

Yes, \u{hex} should probably the best way to resolve this as it aligns with json.stringify as well.

MasonRemaley

Looking at this a little more carefully, it's still not quite right, because some characters need to be escaped even if escapes are "off", for example, quotes.

I went ahead and fixed it locally, and also fixed #23535 by printing either \x or \u escapes as appropriate. However GitHub's "suggest changes" feature is broken for me today.

Here's a gist with my fixes, you should be able to paste this over your file locally to see the diff.

…oint

nurulhudaapon · 2025-06-06T04:08:57Z

Looking at this a little more carefully, it's still not quite right, because some characters need to be escaped even if escapes are "off", for example, quotes.

I went ahead and fixed it locally, and also fixed #23535 by printing either \x or \u escapes as appropriate. However GitHub's "suggest changes" feature is broken for me today.

Here's a gist with my fixes, you should be able to paste this over your file locally to see the diff.

Thank you so much @MasonRemaley, applied the suggested changes!

MasonRemaley · 2025-06-06T04:14:31Z

No prob!

MasonRemaley

LGTM!

nurulhudaapon · 2025-06-12T06:01:27Z

Can we get the workflow run to be approved for this PR?

MasonRemaley · 2025-06-24T19:12:28Z

@mlugg checked over this for me (I don't have power to merge) and found a couple problems with my fix.

I'll take care of these myself, but I'm listing them here for the record/so that I don't have to dig through chat to find them later:

Right now both single and double quotes are always escaped, it would be more natural to only escape single quotes when in a single quoted literal, and double quotes when in a double quoted literal.
Codepoint literals should always use the \u escape since they are unicode codepoint literals by definition, but strings should not since the string may not actually be unicode. Instead strings should be conservative and alway suse the \x escapes.
catch return error.InvalidCodepoint -- this code is incorrect, surrogates (0xD800...0xDFFF)are valid codepoints and can be represented with \u escapes. He suggested something like this.
There are some places the code is a bit complicated in an effort to move the conditional out of the loop which may be overkill.

(Also side note: escape unicode isn't exactly a correct name since not all unicode characters are escaped, escape_non_ascii or ascii_only may make more sense. That being said I think the intended meaning is clear from context and the json parser uses the same convention so we may just leave that alone for now.)

[EDIT]

Also I should updated the incorrect doc comment on fmtEscapes that lead me astray.

nurulhudaapon added 3 commits May 9, 2025 10:52

wip: make escape_unicode = false by default

066ea42

fix: escape ",\r,\n,\t,\

7f6586b

Merge branch 'master' into zon/serializer-unicode-escaping

a277bc8

Merge branch 'master' into zon/serializer-unicode-escaping

3cc48cc

Revert "fix: make emit_codepoint_literals = .always emit double quote…

e8dacd5

…d hex code when needed" This reverts commit b9783ef.

Merge branch 'master' into zon/serializer-unicode-escaping

758be84

MasonRemaley suggested changes Jun 6, 2025

View reviewed changes

fix: always escape some char, move escaping logic and re-use in codep…

6de0089

…oint

nurulhudaapon requested a review from MasonRemaley June 6, 2025 04:09

Merge branch 'master' into zon/serializer-unicode-escaping

f16c6bb

MasonRemaley approved these changes Jun 6, 2025

View reviewed changes

nurulhudaapon added 2 commits June 6, 2025 17:53

Merge branch 'master' into zon/serializer-unicode-escaping

a3b9f3e

Merge branch 'master' into zon/serializer-unicode-escaping

0144c53

Merge branch 'master' into zon/serializer-unicode-escaping

7761c02

Uh oh!

feat(std.zon): add escape_unicode options to zon.serializer #23596

Are you sure you want to change the base?

feat(std.zon): add escape_unicode options to zon.serializer #23596

Uh oh!

Conversation

nurulhudaapon commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Before

After

Test

Use Case

Uh oh!

alexrp commented Apr 17, 2025

Uh oh!

MasonRemaley commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nurulhudaapon commented Apr 17, 2025

Uh oh!

MasonRemaley commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasonRemaley commented Apr 24, 2025

Uh oh!

nurulhudaapon commented May 21, 2025

Uh oh!

nurulhudaapon commented May 30, 2025

Uh oh!

MasonRemaley commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nurulhudaapon commented Jun 4, 2025

Uh oh!

MasonRemaley left a comment

Choose a reason for hiding this comment

Uh oh!

nurulhudaapon commented Jun 6, 2025

Uh oh!

MasonRemaley commented Jun 6, 2025

Uh oh!

MasonRemaley left a comment

Choose a reason for hiding this comment

Uh oh!

nurulhudaapon commented Jun 12, 2025

Uh oh!

MasonRemaley commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nurulhudaapon commented Apr 17, 2025 •

edited

Loading

MasonRemaley commented Apr 17, 2025 •

edited

Loading

MasonRemaley commented Apr 24, 2025 •

edited

Loading

MasonRemaley commented Jun 3, 2025 •

edited

Loading

MasonRemaley commented Jun 24, 2025 •

edited

Loading