Open
Description
Description
According to this: https://json.nlohmann.me/api/basic_json/dump/#parameters , when passing error_handler_t::ignore
to dump()
function, invalid UTF-8 characters should be ignored and copied as-is into the final string.
However, I'm debugging the following minimal code:
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
and the final test_dump
string contains test\005
(byte \334
is gone).
Is this expected? Am I missing something?
Reproduction steps
Just try to run/debug the following:
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
Expected vs. actual results
Actual: test_dump
contains test\005
Expected: test_dump
contains test\334\005
Minimal code example
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
Error messages
No response
Compiler and operating system
gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924
Library version
3.11.2
Validation
- The bug also occurs if the latest version from the
develop
branch is used.I can successfully compile and run the unit tests.
Activity
nlohmann commentedon Dec 17, 2024
Yes, there seems to be a logic error.
\334
is written to the string buffer, but then overwritten by\u0005
. The reason for this is linebut
bytes_after_last_accept
was not incremented after reading\334
.jordan-hoang commentedon Dec 17, 2024
Interesting, I'll take a look perhaps.
nlohmann commentedon Dec 18, 2024
I had another look at the issue and this needs some discussion:
error_handler_t::ignore
option has few test cases, but these in connection with the current implementation all ignore invalid UTF-8 sequences and do not follow the documentation "all bytes are copied to the output unchanged".error_handler_t::ignore
function to meet that sentence from the documentation would be very surprising for existing implementations as after the fix, suddenly more bytes appear in the output.error_handler_t::copy
orerror_handler_t::keep
, that implements an approach that meets the sentence in the documentation and actually preserves input even in case of UTF-8 errors.What do you think?
🚧 WIP for #4552
jordan-hoang commentedon Dec 18, 2024
I think adding a new enumerator would be the best in this case then. Fixing the semantics of error_handler_t::ignore would make anyone who currently depends on it's current behavior unhappy. and we can probably update the docs to make it more accurate.
t-b commentedon Dec 18, 2024
Yes I also think adding a new enumerator which preserves invalid UTF8 makes sense.
nlohmann commentedon Dec 19, 2024
Thanks for the input. Any preference on the name?
error_handler_t::keep
?t-b commentedon Dec 19, 2024
Sounds good!
jordan-hoang commentedon Dec 19, 2024
Yeah keep sounds good.
nlohmann commentedon Dec 22, 2024
@gentooise @t-b @jordan-hoang Please take a look at #4555.