Description
Hi,
utf8proc_map_custom
takes a void *custom_data
parameter. However, if the custom_data
is modified by the custom function, utf8proc_map_custom might not work as expected-- and possibly corrupt memory-- because utf8proc_map_custom
calls utf8proc_decompose_custom
twice, and only the second time's results are kept, but there is no way to "reset" the custom data to its initial state before the second call.
As an example, imagine I use a custom transformation to replace the first character with 'A', but keep the rest of the string. Using utf8proc_map_custom
would seem easy enough:
struct ctx {
char start_of_string
};
static utf8proc_int32_t replaceFirstCharWithA(utf8proc_int32_t codepoint, struct ctx *ctx) {
if(ctx->start_of_string)
codepoint = 'A';
ctx->start_of_string = 0;
return codepoint;
}
void test() {
struct ctx ctx;
ctx.start_of_string = 1;
utf8proc_map_custom(..., replaceFirstCharWithA, &ctx);
}
However, this will not actually work because utf8proc_map_custom
only keeps the results of its second utf8proc_decompose_custom
call, at which time ctx->start_of_string
will already be set to 0.
I believe this could also lead to memory corruption if the above example was run with an input string that had a multi-byte first character (in which case the first run of utf8proc_decompose_custom
would receive a length assuming a single-byte first char, but the second run would write a multi-byte first char).
It would be nice if there was a way to fix this issue inside utf8proc_map_custom
without changing its signature, but that does not seem possible. But at a minimum, it would seem safer and more accurate if custom_data
was a const void *
instead of a void *
(and arguably, a bug in the latter form as it currently stands)