You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
avoid encoding errors with unicode content piped through stdio on Windows
Consider this trivial file (with a trailing LF):
print('This is a unicode character: ≠'.encode("UTF-8"))
This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly:
$ cat test.py | pyupgrade.exe --py38-plus -
This crashed with an encoding error:
$ cat test.py | pyupgrade.exe --py38-plus - > reformated.py
Traceback (most recent call last):
File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module>
File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main
ret |= _fix_file(filename, args)
File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file
print(contents_text, end='')
File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined>
Since bytes are read from `stdin.buffer` and decoded as UTF-8 when the input
file is '-', it makes sense to write UTF-8 bytes to `stdout.buffer`, and avoid
using the default codepage. The use case here is wiring this up to the `hg fix`
extension, which writes content to the tool's stdin and reads it back from its
stdout to reformat files. That shouldn't change the encoding.
A workaround using the existing code is to set `PYTHONUTF8=1` in the environment,
but that's not obvious or always easily done. This change also has the nice side
effect of no longer changing LF input to CRLF output. (You'd think that
`print(..., end='')` would avoid printing the EOL, but that's apparently baked
into the `TextIO` object that is `sys.stdout`, and not something the print
function can override.)
0 commit comments