Skip to content

Conversation

@SinusPi
Copy link

@SinusPi SinusPi commented May 8, 2025

The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.

SinusPi added 2 commits May 8, 2025 01:55
The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.
cp437 in unicode escapes instead of literals
@thejoshwolfe
Copy link
Owner

Thanks for contributing! Can you explain the situation where webpack interprets the source as plain ASCII? i thought UTF-8 would be pretty well supported these days.

@SinusPi
Copy link
Author

SinusPi commented May 8, 2025 via email

@thejoshwolfe
Copy link
Owner

sounds like a really tricky issue. i'm not comfortable removing an assumed UTF-8 assumption from this package's source code to work around an obscure issue in webpack, although I'm considering removing CP437 entirely in a future version. (I'm currently in the midst of a months-long research project to determine how correct or safe that would be.)

I want to believe that the real problem is somewhere in your setup. Are you sure you're not getting the sometimes-default ISO-8859-1 chatset (which is really cp-1252) that browsers and/or html sometimes uses? Is there an opportunity to configure the charset explicitly to UTF-8?

@SinusPi
Copy link
Author

SinusPi commented May 8, 2025

The rest of your code is free from string literals and doesn't depend on UTF8 being used correctly, so I'm a little surprised by your reluctance to switch an encoding string to a file-encoding-agnostic version, essentially equal bit-to-bit to your version running on UTF8. And even if the "real problem" was indeed in my setup, and some files were using an archaic ISO encoding, why would it be acceptable for an unzipping library to be affected by it, especially if its core tenet was to work "no matter what", bad zipfiles or whatnot? It's only likely to send more unsuspecting coders down rabbit holes, trying to figure out why their ASCII-based app (for whatever reason that choice got made) is suddenly not just displaying accent characters in a weird way, but fails to unpack zipfiles (getting filenames of "–=+36â9-()7â-);)6¨0%77-'††¨–â>=+36¼1)8%").

I'll try to whip up a simplified case, to see what exactly causes the encoding mayhem.

@SinusPi
Copy link
Author

SinusPi commented May 8, 2025

OK, we have a problem. I just tested again and Webpack bundling converts the nice escaped string into literals anyway (the line result += cp437[buffer[i]] gets minified to i+="\0â�şâ�»â........."[e[o]]), nullifying my change entirely. I didn't think to retest it, as the patch resulted in 1:1 the same string, but clearly Webpack knows better. The resulting bundle has no UTF8 BOM, and loading it via <script src> into an HTML file that lacks <meta charset="utf8"> causes it to be read as plain ASCII.

This PR is void. Fix coming up.

@SinusPi SinusPi closed this May 8, 2025
@SinusPi SinusPi reopened this May 8, 2025
@SinusPi
Copy link
Author

SinusPi commented May 8, 2025

Sorry about the close/reopen, I assumed I'd have to make a new PR, and yet another commit to my master branch got included here. So, here we are. Webpack-proof, encoding-proof, charcode-based CP437 array.

@thejoshwolfe
Copy link
Owner

thejoshwolfe commented May 8, 2025

The resulting bundle has no UTF8 BOM, and loading it via <script src> into an HTML file that lacks <meta charset="utf8"> causes it to be read as plain ASCII.

Is it possible to add <meta charset="utf8">? That seems to be the generally recommended best practice, although there's something I don't quite understand about HTML5 declaring UTF-8 to be the default charset, so I'm not super confident about my analysis here.

There's something that's not quite right about your proposed explanation of what's going on. If the source file was interpreted as "plain ASCII", then the non-ascii characters in the string literal would result in encoding errors, not corrupt strings. There'd be no way to get â from your example from just plain ASCII. I suspect that the bytes are actually being interpreted as ISO-8859-1, which is really cp1252 in this case, which means that all the indexes would be shifted for each multi-byte character that comes before it.

This problem would effectively result in mapping from the first row of characters to the second:

' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
'™‚♀♪♫☼►◄↕‼¶§▬↨↑↓→â†?∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'

and the second roughly looks like the alphabet that your example corrupted string contains. This isn't a perfect hypothesis, because there are some broken characters in this mapping that I had to ignore, but perhaps some browser is also doing something like that. Here's the code I used to come up with this:

>>> "".join(chr(x) for x in range(32, 127))
' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
>>> re.sub(rb'[\x81\x8d\x8f\x90\x9d]', b'?', '\u0000☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ '.encode("utf8")[32:127]).decode("cp1252")
'™‚♀♪♫☼►◄↕‼¶§▬↨↑↓→â†?∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'

If this is what's happening, then the problem is that a UTF-8 source file is being interpreted as ISO-8859-1, or rather cp1252. I don't think it's yauzl's responsibility to put in workarounds for interpreting the source code in the wrong charset.

@SinusPi
Copy link
Author

SinusPi commented May 8, 2025

Admittedly, my bad, I was using "ASCII" as a vague synonym for "some sort of single-byte encoding". It was likely ISO-8859-1 or CP1252, as you're saying. It could've been plain ASCII, though, too, the result would've been similar: decodeBuffer picking characters out of a 510-byte string full of non-recognized UTF8 sequences, instead of producing correct CP437-based Unicode characters. I'd think resilience against packers, bundlers or weird parsers' choice of encoding - seeing as it may happen with a popular bundler and without much effort - would be good to have. But if not - I'll just keep my version for my use, happy with the open source. Cheers! :)

@SinusPi
Copy link
Author

SinusPi commented May 8, 2025

Intriguing, though, that once I set up this simple "templateContent":"" webpack setting, turning the resulting HTML into nothing more than <script src='bundle1.js'></script><script src='bundle2.js'></script><script src='bundle3.js'></script>, and added lines logging the length of my array and the original string when loading the library, I'm getting this:

cp437array length 256
cp437string length 508
cp437array length 256
cp437string length 1040
cp437array length 256
cp437string length 508

While it's not a surprise Webpack packs the library three times in three different bundles I use (for my reasons), two of them are expectedly 508 bytes when reading the utf8 string using a single-byte encoding, BUT one somehow manages to... I don't know, read it as single bytes, interpret as utf8, then reinterpret as single bytes again? I'm lost and just glad the array approach is absolutely immune to any madness to which Webpack subjects the poor source.

@thejoshwolfe thejoshwolfe changed the title CP437 in Unicode escapes instead of literals Avoid non-ASCII source code to workaround encoding problems with webpack May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants