Skip to content

feat(std.zon): add escape_unicode options to zon.serializer #23596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

nurulhudaapon
Copy link

@nurulhudaapon nurulhudaapon commented Apr 17, 2025

Currently std.zon.stringify.serialize always escapes Unicode characters, while std.json.stringify by default does not. This change adds an escape_unicode option that matches the JSON serializer's behavior. To maintain backward compatibility, the default value is true, preserving the current behavior of escaping Unicode.

Change

Before

std.zon.stringify.serialize(buff, .{ .whitespace = true }, writer)

test "std.zon.stringify.serialize escape_unicode = true (default)" {
    const buff = .{ .name = "Test", .description = "⚡ Lightning Bolt", .emoji = "⚡" };
    var buff_str = std.ArrayList(u8).init(std.testing.allocator);
    defer buff_str.deinit();
    try std.zon.stringify.serialize(buff, .{ .whitespace = true }, buff_str.writer());
    std.debug.print("\n{s}\n", .{buff_str.items});
}

Output:

.{
    .name = "Test",
    .description = "\xe2\x9a\xa1 Lightning Bolt",
    .emoji = "\xe2\x9a\xa1",
}

After

std.zon.stringify.serialize(buff, .{ .escape_unicode = false, .whitespace = true }, writer)

test "std.zon.stringify.serialize escape_unicode = false (added option)" {
    const buff = .{ .name = "Test", .description = "⚡ Lightning Bolt", .emoji = "⚡" };
    var buff_str = std.ArrayList(u8).init(std.testing.allocator);
    defer buff_str.deinit();
    try std.zon.stringify.serialize(buff, .{ .escape_unicode = false, .whitespace = true }, buff_str.writer());
    std.debug.print("\n{s}\n", .{buff_str.items});
}

Output:

.{
    .name = "Test",
    .description = "⚡ Lightning Bolt",
    .emoji = "⚡",
}

Test

const std = @import("std");

test "std.zon.stringify.serialize escape_unicode = false" {
    var buf = std.ArrayList(u8).init(std.testing.allocator);
    defer buf.deinit();

    try std.zon.stringify.serialize(
        .{ .char = "abc⚡" },
        .{ .escape_unicode = false },
        buf.writer(),
    );
    try std.testing.expectEqualStrings(".{ .char = \"abc⚡\" }", buf.items);
    buf.clearRetainingCapacity();
}

Use Case

I was trying to store Unicode data in a ZON file, which I previously did in JSON. When converting from JSON to ZON using the JSON parser and ZON serializer, the Unicode characters were always escaped. This made the ZON file hard to read, which defeats its purpose as a human-readable format.

Closes #23535

Currently std.zon.stringify.serialize will always produce unicode to be escaped, whereas in std.json.stringify by default doesn't escape unicode. Adding escape_unicode option matching with the json serializer but by default it is false (as the current behaviour) to keep things backward compatible.

```zig
const std = @import("std");

test "std.zon.stringify.serialize escape_unicode = false" {
    var buf = std.ArrayList(u8).init(std.testing.allocator);
    defer buf.deinit();

    try std.zon.stringify.serialize(
        .{ .char = 'অ' },
        .{ .escape_unicode = false },
        buf.writer(),
    );
    try std.testing.expectEqualStrings(".{ .char = \"অ\" }", buf.items);
    buf.clearRetainingCapacity();
}
```
@alexrp
Copy link
Member

alexrp commented Apr 17, 2025

cc @MasonRemaley

@MasonRemaley
Copy link
Contributor

MasonRemaley commented Apr 17, 2025

Thanks for the PR!

I'll take a look at this and the other Unicode related issue today. In particular, I want to look into whether or not it's necessary to maintain backwards compatibility with the current behavior.

[EDIT] Sorry for the delay, haven't forgotten about this though will get to it soon!

@nurulhudaapon
Copy link
Author

Thanks for the PR!

I'll take a look at this and the other Unicode related issue today. In particular, I want to look into whether or not it's necessary to maintain backwards compatibility with the current behavior.

Yeah, I feel like it doesn't need to be backward compatible and should by default not escape unicode since this is usual behavior in most serializer and zon.serializer has not been adopted that much yet.

@MasonRemaley
Copy link
Contributor

MasonRemaley commented Apr 24, 2025

Apologies for the delay on this!

Looking it over, there was no good reason for me to escape everything by default. Adding escape_unicode as an option is good, and it should be false by default.

However there's one important case that needs to be addressed before this can be merged. Unless I'm missing something, the implementation here now doesn't escape \ or " which is necessary for correctness.

You can see how std.json handles this here. I think escaping these two characters is sufficient to guarantee that the output is a valid Zig string, but it's worth double checking stringEscape to make sure it's not doing anything else necessary.

@MasonRemaley
Copy link
Contributor

Linking the issue you filed #23535 here since it's related to this PR in that it's an example of a character that can't really be printed the way you'd expect right now. We probably want to figure out how to address this as well.

…de when needed

Previousely
`⚡` -> `'\xe2\x9a\xa1'` (Notice the hex code is single quoted which is not valid Zig/ZON syntax)

Now
`⚡` -> `"\xe2\x9a\xa1"`

`127` -> `'\x7f'` (Will still emit single quoted hex when possible)
@nurulhudaapon
Copy link
Author

Apologies for the delay on this!

Looking it over, there was no good reason for me to escape everything by default. Adding escape_unicode as an option is good, and it should be false by default.

However there's one important case that needs to be addressed before this can be merged. Unless I'm missing something, the implementation here now doesn't escape \ or " which is necessary for correctness.

You can see how std.json handles this here. I think escaping these two characters is sufficient to guarantee that the output is a valid Zig string, but it's worth double checking stringEscape to make sure it's not doing anything else necessary.

Updated the code to escape items that needs to be escaped following what stringEscape does. I'm wondering though if it is okay to have this almost duplicated string escape logic here or just update stringEscape to have options to not escape unicode and re-use the same here. Otherwise everything should be good.

@nurulhudaapon
Copy link
Author

Hi @MasonRemaley, let me know if the current changes look good! Thanks.

@MasonRemaley
Copy link
Contributor

MasonRemaley commented Jun 3, 2025

This looks good, but I don't think pub fn codePoint is correct. As written it will sometimes output a double quoted value, and sometimes output a single quoted value. While both of these are valid outputs, they aren't actually interchangeable with eachother since one is a value and one is an array.

As you pointed out in #23535 the way I had it before wasn't correct either. One option is to just not change codePoint in this PR since the rest of the PR doesn't depend on it, and then merge otherwise as is. I can figure out fixing codePoint in a separate PR.

On the other hand, if you want to resolve #2353 in this PR, we could use the \u{hex number} syntax for characters that don't have recognizable escape codes like \n. Though maybe this should also respect the option to use escapes or not.

@nurulhudaapon
Copy link
Author

This looks good, but I don't think pub fn codePoint is correct. As written it will sometimes output a double quoted value, and sometimes output a single quoted value. While both of these are valid outputs, they aren't actually interchangeable with eachother since one is a value and one is an array.

As you pointed out in #23535 the way I had it before wasn't correct either. One option is to just not change codePoint in this PR since the rest of the PR doesn't depend on it, and then merge otherwise as is. I can figure out fixing codePoint in a separate PR.

On the other hand, if you want to resolve #2353 in this PR, we could use the \u{hex number} syntax for characters that don't have recognizable escape codes like \n. Though maybe this should also respect the option to use escapes or not.

Thank you for reviewing. Going with the first option to not change codePoint here as this PR doesn't depend on it nor it is currently blocking at least my use case. Just reverted the codePoint changes.

Yes, \u{hex} should probably the best way to resolve this as it aligns with json.stringify as well.

Copy link
Contributor

@MasonRemaley MasonRemaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a little more carefully, it's still not quite right, because some characters need to be escaped even if escapes are "off", for example, quotes.

I went ahead and fixed it locally, and also fixed #23535 by printing either \x or \u escapes as appropriate. However GitHub's "suggest changes" feature is broken for me today.

Here's a gist with my fixes, you should be able to paste this over your file locally to see the diff.

@nurulhudaapon
Copy link
Author

Looking at this a little more carefully, it's still not quite right, because some characters need to be escaped even if escapes are "off", for example, quotes.

I went ahead and fixed it locally, and also fixed #23535 by printing either \x or \u escapes as appropriate. However GitHub's "suggest changes" feature is broken for me today.

Here's a gist with my fixes, you should be able to paste this over your file locally to see the diff.

Thank you so much @MasonRemaley, applied the suggested changes!

@MasonRemaley
Copy link
Contributor

No prob!

Copy link
Contributor

@MasonRemaley MasonRemaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nurulhudaapon
Copy link
Author

Can we get the workflow run to be approved for this PR?

@MasonRemaley
Copy link
Contributor

MasonRemaley commented Jun 24, 2025

@mlugg checked over this for me (I don't have power to merge) and found a couple problems with my fix.

I'll take care of these myself, but I'm listing them here for the record/so that I don't have to dig through chat to find them later:

  • Right now both single and double quotes are always escaped, it would be more natural to only escape single quotes when in a single quoted literal, and double quotes when in a double quoted literal.
  • Codepoint literals should always use the \u escape since they are unicode codepoint literals by definition, but strings should not since the string may not actually be unicode. Instead strings should be conservative and alway suse the \x escapes.
  • catch return error.InvalidCodepoint -- this code is incorrect, surrogates (0xD800...0xDFFF)are valid codepoints and can be represented with \u escapes. He suggested something like this.
  • There are some places the code is a bit complicated in an effort to move the conditional out of the loop which may be overkill.

(Also side note: escape unicode isn't exactly a correct name since not all unicode characters are escaped, escape_non_ascii or ascii_only may make more sense. That being said I think the intended meaning is clear from context and the json parser uses the same convention so we may just leave that alone for now.)

[EDIT]

  • Also I should updated the incorrect doc comment on fmtEscapes that lead me astray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

std.zon.stringify.serialize is escaping integer literals when emit_codepoint_literals = .always
3 participants