-
Notifications
You must be signed in to change notification settings - Fork 79
Avoid non-ASCII source code to workaround encoding problems with webpack #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.
cp437 in unicode escapes instead of literals
|
Thanks for contributing! Can you explain the situation where webpack interprets the source as plain ASCII? i thought UTF-8 would be pretty well supported these days. |
|
I don't have a minimal repro case, only my rather large Electron app.
Normally, Webpack's html-webpack-plugin generates starting htmls properly,
injecting bundles of JS into them, or just linking via <script>, I'm not
sure what exactly happened. But the moment I added a templateContent
setting to my webpack configs, apparently forcing the template engine to
start with a provided string instead of a blank buffer, it caused zip
filename corruption which I traced to the cp437 string being interpreted
"as is".
…On Thu, 8 May 2025, 12:30 Josh Wolfe, ***@***.***> wrote:
*thejoshwolfe* left a comment (thejoshwolfe/yauzl#162)
<#162 (comment)>
Thanks for contributing! Can you explain the situation where webpack
interprets the source as plain ASCII? i thought UTF-8 would be pretty well
supported these days.
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3QELOCFDN6MOLARXNPOBL25MW2TAVCNFSM6AAAAAB4VERHV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNRSGU2TOMBTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
sounds like a really tricky issue. i'm not comfortable removing an assumed UTF-8 assumption from this package's source code to work around an obscure issue in webpack, although I'm considering removing CP437 entirely in a future version. (I'm currently in the midst of a months-long research project to determine how correct or safe that would be.) I want to believe that the real problem is somewhere in your setup. Are you sure you're not getting the sometimes-default |
|
The rest of your code is free from string literals and doesn't depend on UTF8 being used correctly, so I'm a little surprised by your reluctance to switch an encoding string to a file-encoding-agnostic version, essentially equal bit-to-bit to your version running on UTF8. And even if the "real problem" was indeed in my setup, and some files were using an archaic ISO encoding, why would it be acceptable for an unzipping library to be affected by it, especially if its core tenet was to work "no matter what", bad zipfiles or whatnot? It's only likely to send more unsuspecting coders down rabbit holes, trying to figure out why their ASCII-based app (for whatever reason that choice got made) is suddenly not just displaying accent characters in a weird way, but fails to unpack zipfiles (getting filenames of "–=+36â9-()7â-);)6¨0%77-'††¨–â>=+36¼1)8%"). I'll try to whip up a simplified case, to see what exactly causes the encoding mayhem. |
|
OK, we have a problem. I just tested again and Webpack bundling converts the nice escaped string into literals anyway (the line This PR is void. Fix coming up. |
Take that, Webpack.
|
Sorry about the close/reopen, I assumed I'd have to make a new PR, and yet another commit to my master branch got included here. So, here we are. Webpack-proof, encoding-proof, charcode-based CP437 array. |
Is it possible to add There's something that's not quite right about your proposed explanation of what's going on. If the source file was interpreted as "plain ASCII", then the non-ascii characters in the string literal would result in encoding errors, not corrupt strings. There'd be no way to get This problem would effectively result in mapping from the first row of characters to the second: ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
'™‚♀♪♫☼►◄↕‼¶§▬↨↑↓→â†?∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'and the second roughly looks like the alphabet that your example corrupted string contains. This isn't a perfect hypothesis, because there are some broken characters in this mapping that I had to ignore, but perhaps some browser is also doing something like that. Here's the code I used to come up with this: >>> "".join(chr(x) for x in range(32, 127))
' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
>>> re.sub(rb'[\x81\x8d\x8f\x90\x9d]', b'?', '\u0000☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ '.encode("utf8")[32:127]).decode("cp1252")
'™‚♀♪♫☼►◄↕‼¶§▬↨↑↓→â†?∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'If this is what's happening, then the problem is that a UTF-8 source file is being interpreted as ISO-8859-1, or rather cp1252. I don't think it's yauzl's responsibility to put in workarounds for interpreting the source code in the wrong charset. |
|
Admittedly, my bad, I was using "ASCII" as a vague synonym for "some sort of single-byte encoding". It was likely ISO-8859-1 or CP1252, as you're saying. It could've been plain ASCII, though, too, the result would've been similar: decodeBuffer picking characters out of a 510-byte string full of non-recognized UTF8 sequences, instead of producing correct CP437-based Unicode characters. I'd think resilience against packers, bundlers or weird parsers' choice of encoding - seeing as it may happen with a popular bundler and without much effort - would be good to have. But if not - I'll just keep my version for my use, happy with the open source. Cheers! :) |
|
Intriguing, though, that once I set up this simple While it's not a surprise Webpack packs the library three times in three different bundles I use (for my reasons), two of them are expectedly 508 bytes when reading the utf8 string using a single-byte encoding, BUT one somehow manages to... I don't know, read it as single bytes, interpret as utf8, then reinterpret as single bytes again? I'm lost and just glad the array approach is absolutely immune to any madness to which Webpack subjects the poor source. |
The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.