The status of UTF-16 vs UTF-8 output #13806
Replies: 5 comments
-
The goal is that either
The CRT functions often do their own internal conversions before writing to the console. See information on
It what? The CRT? Probably not for compatibility reasons. You will probably have to always declare your intent to the CRT to distinguish yourself from a classic application that just assumed. Flaws in Conhost and Terminal that lead to the mishandling of any particular Unicode character or sequence? Yes, hopefully, eventually. We tend to make slow and steady progress over time improving the buffers, renderers, and translators to cover top-requested issues on this tracker.
They store things internally as The PTY mechanism always submits between the processes as UTF-8. Data is converted back and forth from rest to transit to rest again, but because UTF8/UTF16 is an algorithmic conversion, we believe this is lossless.
For a conhost acting in PTY, something is always translated as the rest buffer is ideally UTF-16 and the PTY communication channel is UTF-8. Though again, algorithmic conversions SHOULD be no problem. |
Beta Was this translation helpful? Give feedback.
-
shouldn't |
Beta Was this translation helpful? Give feedback.
-
I was always under the impression that wide CRT functions are lossless and do not locale-convert (This is AFAIK true for BMP).
Conhost as well as Terminal failing to display/output non-BMP characters in the majority of UTF-16 scenarios. Here is the sample code example (compile with #define _USE_CRT_SECURE_NO_WARNINGS
#define WIN32_LEAN_AND_MEAN
#define NOMINMAX
#include <Windows.h>
#include <iostream>
#include <fcntl.h>
#include <io.h>
#include <cstdio>
#include <tuple>
constexpr auto str_utf16 = L"\U0002002C\U0001F495";
constexpr auto str_utf8 = "\U0002002C\U0001F495";
constexpr auto bmp_utf16 = L"BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n";
constexpr auto bmp_utf8 = "BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n";
int main(int argc, char** argv)
{
if (argc == 2)
{
switch (argv[1][0])
{
case '1': // Single WriteConsoleW
{
DWORD nWritten;
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL);
WriteConsoleW(hOut, L"Single WriteConsoleW call: ", 27, &nWritten, NULL);
WriteConsoleW(hOut, str_utf16, 4, &nWritten, NULL);
WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);
SetConsoleMode(hOut, mode);
break;
}
case '2': // Multiple WriteConsoleW
{
DWORD nWritten;
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL);
WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (1st split): ", 42, &nWritten, NULL);
WriteConsoleW(hOut, str_utf16, 1, &nWritten, NULL);
WriteConsoleW(hOut, str_utf16 + 1, 3, &nWritten, NULL);
WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);
WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (2nd split): ", 42, &nWritten, NULL);
WriteConsoleW(hOut, str_utf16, 3, &nWritten, NULL);
WriteConsoleW(hOut, str_utf16 + 3, 1, &nWritten, NULL);
WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);
SetConsoleMode(hOut, mode);
break;
}
case '3': // Single fwrite
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
fwrite(bmp_utf16, 2, 46, stdout);
fwrite(L"Single fwrite call: ", 2, 20, stdout);
fwrite(str_utf16, 2, 4, stdout);
fwrite(L"\n", 2, 1, stdout);
SetConsoleMode(hOut, mode);
break;
}
case '4': // Multiple fwrite
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' };
wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' };
fwrite(bmp_utf16, 2, 46, stdout);
fwrite(L"Multiple fwrite calls (1st split): ", 2, 35, stdout);
fwrite(str1, 2, 1, stdout);
fwrite(str1 + 2, 2, 3, stdout);
fwrite(L"\n", 2, 1, stdout);
fwrite(L"Multiple fwrite calls (2nd split): ", 2, 35, stdout);
fwrite(str2, 2, 3, stdout);
fwrite(str2 + 4, 2, 1, stdout);
fwrite(L"\n", 2, 1, stdout);
SetConsoleMode(hOut, mode);
break;
}
case '5': // Single fputws
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
fputws(bmp_utf16, stdout);
fputws(L"Single fputws call: ", stdout);
fputws(str_utf16, stdout);
fputws(L"\n", stdout);
SetConsoleMode(hOut, mode);
break;
}
case '6': // Multiple fputws
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
fflush(stdout);
std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' };
wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' };
fputws(bmp_utf16, stdout);
fputws(L"Multiple fputws calls (1st split): ", stdout);
fputws(str1, stdout);
fputws(str1 + 2, stdout);
fputws(L"\n", stdout);
fputws(L"Multiple fputws calls (2nd split): ", stdout);
fputws(str2, stdout);
fputws(str2 + 4, stdout);
fputws(L"\n", stdout);
SetConsoleMode(hOut, mode);
break;
}
case 'a': // Single WriteConsoleA
{
DWORD nWritten;
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
WriteConsoleA(hOut, bmp_utf8, strlen(bmp_utf8), &nWritten, NULL);
WriteConsoleA(hOut, "Single WriteConsoleA call: ", 27, &nWritten, NULL);
WriteConsoleA(hOut, str_utf8, strlen(str_utf8), &nWritten, NULL);
WriteConsoleA(hOut, L"\n", 1, &nWritten, NULL);
SetConsoleMode(hOut, mode);
break;
}
case 'b': // Multiple WriteConsoleA
{
DWORD nWritten;
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
WriteConsoleA(hOut, bmp_utf8, 13, &nWritten, NULL);
WriteConsoleA(hOut, bmp_utf8 + 13, strlen(bmp_utf8) - 13, &nWritten, NULL);
WriteConsoleA(hOut, "Multiple WriteConsoleA calls (1st split): ", 42, &nWritten, NULL);
WriteConsoleA(hOut, str_utf8, 1, &nWritten, NULL);
WriteConsoleA(hOut, str_utf8 + 1, 7, &nWritten, NULL);
WriteConsoleA(hOut, "\n", 1, &nWritten, NULL);
WriteConsoleA(hOut, "Multiple WriteConsoleA calls (2nd split): ", 42, &nWritten, NULL);
WriteConsoleA(hOut, str_utf8, 5, &nWritten, NULL);
WriteConsoleA(hOut, str_utf8 + 5, 3, &nWritten, NULL);
WriteConsoleA(hOut, "\n", 1, &nWritten, NULL);
SetConsoleMode(hOut, mode);
break;
}
case 'c': // Single fwrite
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout);
fwrite("Single fwrite call: ", 1, 20, stdout);
fwrite(str_utf8, 1, 8, stdout);
fwrite("\n", 1, 1, stdout);
SetConsoleMode(hOut, mode);
break;
}
case 'd': // Multiple fwrite
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout);
fwrite("Multiple fwrite calls (1st split): ", 1, 35, stdout);
fwrite(str1, 1, 1, stdout);
fwrite(str1 + 2, 1, 7, stdout);
fwrite("\n", 1, 1, stdout);
fwrite("Multiple fwrite calls (2nd split): ", 1, 35, stdout);
fwrite(str2, 1, 5, stdout);
fwrite(str2 + 6, 1, 3, stdout);
fwrite("\n", 1, 1, stdout);
SetConsoleMode(hOut, mode);
break;
}
case 'e': // Single fputs
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
fputs(bmp_utf8, stdout);
fputs("Single fputs call: ", stdout);
fputs(str_utf8, stdout);
fputs("\n", stdout);
SetConsoleMode(hOut, mode);
break;
}
case 'f': // Multiple fputs
{
DWORD mode;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleOutputCP(CP_UTF8);
GetConsoleMode(hOut, &mode);
SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
fputs(bmp_utf8, stdout);
fputs("Multiple fputs calls (1st split): ", stdout);
fputs(str1, stdout);
fputs(str1 + 2, stdout);
fputs("\n", stdout);
fputs("Multiple fputs calls (2nd split): ", stdout);
fputs(str2, stdout);
fputs(str2 + 6, stdout);
fputs("\n", stdout);
SetConsoleMode(hOut, mode);
break;
}
default:
break;
}
}
}
|
Beta Was this translation helpful? Give feedback.
-
Redirecting to file results in correct output in all scenarios. So it is not CRT functions which corrupt the output, but especially the rendering / screen text buffer part shared between Conhost and Terminal, as behavior is exactly analogous. Copying from Terminal / Conhost window results in � REPLACEMENT CHARACTERs:
The fact that it handles utf-8 correctly suggests me that this issue is wide-interface -specific. It blocks outputting text without detecting surrogate pairs in it, for surrogate pairs split between |
Beta Was this translation helpful? Give feedback.
-
I too was puzzled at what's going on 🤔. Coming from Linux, I've spent a day deep in the docs to figure the available options to print international text (surrogate pairs included).
(many similar examples on other sites) While his test works for single That realization took me an extreme amount of time, as the MS documentation makes really sure you know that everything Windows is UTF-16. Well, this and the fact that
I'd say the windows terminal team didn't bother to fix what was already broken, and moved on to UTF-8. I fully agree with this decision. I've even found a lone docs page suggesting the migration😳. The issue here is all in the MS documentation being unusually vague in this particular detail (despite the long troubled history of windows character handling, I think the docs put much effort in explaining its quirks, maybe a bit too dispersive). In my case, my dependencies rely heavily on |
Beta Was this translation helpful? Give feedback.
-
Windows Terminal version (or Windows build number)
Terminal: 1.7.1033.0, Windows: 10.0.19041.928
Other Software
cmd.exe
Steps to reproduce
WriteConsoleW
fputws
,fwrite
,wprintf
)SetConsoleOutputCP(CP_UTF8)
WriteConsoleA
or any CRT facilityExpected Behavior
2/3. Supplementary Plane characters are correctly displayed in Windows Terminal regardless of printing function used. The behavior of "normal" conhost is consistent between functions.
5. The behavior is the same as UTF-16 / wchar_t one.
Actual Behavior
UTF-16 / Wide output
Windows Terminal
Supplementary Plane characters are displayed correctly by Windows Terminal only if both elements of the surrogate pair are printed by a single
WriteConsoleW
call. Emitting a surrogate pair by two consecutiveWriteConsoleW
calls results in REPLACEMENT CHARACTER (U+FFFD) being displayed. CRT functions result in U+FFFD being displayed by Windows Terminal in any scenario I've tested.When copying, U+FFFD characters get copied.
Redirecting the output to a text file results, however, produces correct/uncorrupted UTF-16 (no BOM) text for all output functions used. Subsequently printing it to Windows Terminal by
pwsh -c "get-content -encoding Unicode output.txt"
displays the text correctly.Saving it with the BOM in any capable text editor and subsequently printing to the console via
type
works as well."Normal" conhost
In "normal" conhost printing Supplementary Plane characters via a single
WriteConsoleW
call results in "wide" � being displayed. However, copying fromcmd.exe
window produces uncorrupted characters.The rest of behavior is analogous.
UTF-8 char output
Windows Terminal
Supplementary Plane characters are displayed correctly regardless of function employed.
Redirecting produces UTF-8 encoded files.
"Normal" conhost
The behavior is analogous.
Clarify the status of Unicode beyond UCS-2
What method of outputting Unicode text should be used by newly written software? Why do wide CRT functions (and writing one element of a surrogate pair at a time via
WriteConsoleW
) result in incorrect Windows Terminal behavior?Is it expected to be fixed?
What is the internal encoding used by Windows Terminal and "normal" conhost? Which output method would allow an application to avoid conversions being performed by conhost?
Beta Was this translation helpful? Give feedback.
All reactions