Skip to content

UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore #4552

Open
@gentooise

Description

@gentooise

Description

According to this: https://json.nlohmann.me/api/basic_json/dump/#parameters , when passing error_handler_t::ignore to dump() function, invalid UTF-8 characters should be ignored and copied as-is into the final string.

However, I'm debugging the following minimal code:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

and the final test_dump string contains test\005 (byte \334 is gone).

image

Is this expected? Am I missing something?

Reproduction steps

Just try to run/debug the following:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Expected vs. actual results

Actual: test_dump contains test\005
Expected: test_dump contains test\334\005

Minimal code example

std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Error messages

No response

Compiler and operating system

gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924

Library version

3.11.2

Validation

  • The bug also occurs if the latest version from the develop branch is used.
    I can successfully compile and run the unit tests.

Activity

nlohmann

nlohmann commented on Dec 17, 2024

@nlohmann
Owner

Yes, there seems to be a logic error. \334 is written to the string buffer, but then overwritten by \u0005. The reason for this is line

bytes = bytes_after_last_accept;

but bytes_after_last_accept was not incremented after reading \334.

jordan-hoang

jordan-hoang commented on Dec 17, 2024

@jordan-hoang
Contributor

Interesting, I'll take a look perhaps.

nlohmann

nlohmann commented on Dec 18, 2024

@nlohmann
Owner

I had another look at the issue and this needs some discussion:

  • Right now, the error_handler_t::ignore option has few test cases, but these in connection with the current implementation all ignore invalid UTF-8 sequences and do not follow the documentation "all bytes are copied to the output unchanged".
  • Therefore, I think fixing the semantics of the error_handler_t::ignore function to meet that sentence from the documentation would be very surprising for existing implementations as after the fix, suddenly more bytes appear in the output.
  • Instead, I would propose adding a new enumerator, maybe error_handler_t::copy or error_handler_t::keep, that implements an approach that meets the sentence in the documentation and actually preserves input even in case of UTF-8 errors.

What do you think?

added a commit that references this issue on Dec 18, 2024
jordan-hoang

jordan-hoang commented on Dec 18, 2024

@jordan-hoang
Contributor

I think adding a new enumerator would be the best in this case then. Fixing the semantics of error_handler_t::ignore would make anyone who currently depends on it's current behavior unhappy. and we can probably update the docs to make it more accurate.

t-b

t-b commented on Dec 18, 2024

@t-b
Contributor

Yes I also think adding a new enumerator which preserves invalid UTF8 makes sense.

self-assigned this
on Dec 18, 2024
nlohmann

nlohmann commented on Dec 19, 2024

@nlohmann
Owner

Thanks for the input. Any preference on the name? error_handler_t::keep?

t-b

t-b commented on Dec 19, 2024

@t-b
Contributor

Sounds good!

jordan-hoang

jordan-hoang commented on Dec 19, 2024

@jordan-hoang
Contributor

Yeah keep sounds good.

nlohmann

nlohmann commented on Dec 22, 2024

@nlohmann
Owner

@gentooise @t-b @jordan-hoang Please take a look at #4555.

added
solution: proposed fixa fix for the issue has been proposed and waits for confirmation
and removed
state: please discussplease discuss the issue or vote for your favorite option
on Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Participants

@nlohmann@t-b@gentooise@jordan-hoang

Issue actions

    UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore · Issue #4552 · nlohmann/json