-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Better Unicode handling plan
Siddharth Agarwal edited this page May 20, 2016
·
3 revisions
Watchman is currently entirely oblivious to Unicode: on POSIX platforms, it treats all filenames as raw bytes internally, and on Windows it converts to and from UTF-8.
This has served us well so far, but has a number of issues:
- Keys don't follow this rule: they're always UTF-8 (and in many cases ASCII-only).
- Warnings are logically text and are typically ASCII, but we can include filenames in them, and we don't know what encoding those filenames are in. This is generally referred to as the makefile problem and has no general, portable solution.
- Python 2 is a big consumer of Watchman, and while it is efficient at representing ASCII text, it has an inefficient Unicode type. It's likely that any attempt to use Unicode strings will cause a noticeable performance regression.
- Python 3 forces consumers to treat text and bytes as completely separate entities. That causes problems for things like warnings, which are partly text and partly bytes.
- Reasonable programs in Python 3 will want to either:
- Receive filenames in results as Unicode strings with
surrogateescape, similar to whatos.listdir('directory')would return. - Receive filenames in results as raw bytes, similar to what
os.listdir(b'directory')would return. - Reasonable programs in Python 3 will want to either:
- Receive warnings as a valid Unicode string.
- Receive warnings as bytestrings, which might or might not make sense in a given encoding.
The BSER layer in the clients doesn't currently have enough information to figure out that filenames and warnings need to be decoded in different ways. Only the Watchman server has enough context.