Avoid non-ASCII source code to workaround encoding problems with webpack #162

SinusPi · 2025-05-08T00:01:56Z

The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.

cp437 in unicode escapes instead of literals

thejoshwolfe · 2025-05-08T10:29:40Z

Thanks for contributing! Can you explain the situation where webpack interprets the source as plain ASCII? i thought UTF-8 would be pretty well supported these days.

SinusPi · 2025-05-08T11:46:37Z

I don't have a minimal repro case, only my rather large Electron app. Normally, Webpack's html-webpack-plugin generates starting htmls properly, injecting bundles of JS into them, or just linking via <script>, I'm not sure what exactly happened. But the moment I added a templateContent setting to my webpack configs, apparently forcing the template engine to start with a provided string instead of a blank buffer, it caused zip filename corruption which I traced to the cp437 string being interpreted "as is".

…

On Thu, 8 May 2025, 12:30 Josh Wolfe, ***@***.***> wrote: *thejoshwolfe* left a comment (thejoshwolfe/yauzl#162) <#162 (comment)> Thanks for contributing! Can you explain the situation where webpack interprets the source as plain ASCII? i thought UTF-8 would be pretty well supported these days. — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3QELOCFDN6MOLARXNPOBL25MW2TAVCNFSM6AAAAAB4VERHV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNRSGU2TOMBTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thejoshwolfe · 2025-05-08T13:55:09Z

sounds like a really tricky issue. i'm not comfortable removing an assumed UTF-8 assumption from this package's source code to work around an obscure issue in webpack, although I'm considering removing CP437 entirely in a future version. (I'm currently in the midst of a months-long research project to determine how correct or safe that would be.)

I want to believe that the real problem is somewhere in your setup. Are you sure you're not getting the sometimes-default ISO-8859-1 chatset (which is really cp-1252) that browsers and/or html sometimes uses? Is there an opportunity to configure the charset explicitly to UTF-8?

SinusPi · 2025-05-08T14:46:56Z

The rest of your code is free from string literals and doesn't depend on UTF8 being used correctly, so I'm a little surprised by your reluctance to switch an encoding string to a file-encoding-agnostic version, essentially equal bit-to-bit to your version running on UTF8. And even if the "real problem" was indeed in my setup, and some files were using an archaic ISO encoding, why would it be acceptable for an unzipping library to be affected by it, especially if its core tenet was to work "no matter what", bad zipfiles or whatnot? It's only likely to send more unsuspecting coders down rabbit holes, trying to figure out why their ASCII-based app (for whatever reason that choice got made) is suddenly not just displaying accent characters in a weird way, but fails to unpack zipfiles (getting filenames of "â€“=+36Ă˘9-()7Ă˘-);)6Â¨0%77-'â€ â€ Â¨â€“Ă˘>=+36ÂĽ1)8%").

I'll try to whip up a simplified case, to see what exactly causes the encoding mayhem.

SinusPi · 2025-05-08T15:00:44Z

OK, we have a problem. I just tested again and Webpack bundling converts the nice escaped string into literals anyway (the line result += cp437[buffer[i]] gets minified to i+="\0â�şâ�»â........."[e[o]]), nullifying my change entirely. I didn't think to retest it, as the patch resulted in 1:1 the same string, but clearly Webpack knows better. The resulting bundle has no UTF8 BOM, and loading it via <script src> into an HTML file that lacks <meta charset="utf8"> causes it to be read as plain ASCII.

This PR is void. Fix coming up.

Take that, Webpack.

SinusPi · 2025-05-08T15:14:19Z

Sorry about the close/reopen, I assumed I'd have to make a new PR, and yet another commit to my master branch got included here. So, here we are. Webpack-proof, encoding-proof, charcode-based CP437 array.

thejoshwolfe · 2025-05-08T16:20:52Z

The resulting bundle has no UTF8 BOM, and loading it via <script src> into an HTML file that lacks <meta charset="utf8"> causes it to be read as plain ASCII.

Is it possible to add <meta charset="utf8">? That seems to be the generally recommended best practice, although there's something I don't quite understand about HTML5 declaring UTF-8 to be the default charset, so I'm not super confident about my analysis here.

There's something that's not quite right about your proposed explanation of what's going on. If the source file was interpreted as "plain ASCII", then the non-ascii characters in the string literal would result in encoding errors, not corrupt strings. There'd be no way to get â from your example from just plain ASCII. I suspect that the bytes are actually being interpreted as ISO-8859-1, which is really cp1252 in this case, which means that all the indexes would be shifted for each multi-byte character that comes before it.

This problem would effectively result in mapping from the first row of characters to the second:

' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
'™‚â™€â™ªâ™«â˜¼â–ºâ—„â†•â€¼Â¶Â§â–¬â†¨â†‘â†“â†’â†?âˆŸâ†”â–²â–¼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'

and the second roughly looks like the alphabet that your example corrupted string contains. This isn't a perfect hypothesis, because there are some broken characters in this mapping that I had to ignore, but perhaps some browser is also doing something like that. Here's the code I used to come up with this:

>>> "".join(chr(x) for x in range(32, 127))
' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
>>> re.sub(rb'[\x81\x8d\x8f\x90\x9d]', b'?', '\u0000☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñÑªº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ '.encode("utf8")[32:127]).decode("cp1252")
'™‚â™€â™ªâ™«â˜¼â–ºâ—„â†•â€¼Â¶Â§â–¬â†¨â†‘â†“â†’â†?âˆŸâ†”â–²â–¼ !"#$%&\'()*+,-./0123456789:;<=>?@AB'

If this is what's happening, then the problem is that a UTF-8 source file is being interpreted as ISO-8859-1, or rather cp1252. I don't think it's yauzl's responsibility to put in workarounds for interpreting the source code in the wrong charset.

SinusPi · 2025-05-08T17:55:11Z

Admittedly, my bad, I was using "ASCII" as a vague synonym for "some sort of single-byte encoding". It was likely ISO-8859-1 or CP1252, as you're saying. It could've been plain ASCII, though, too, the result would've been similar: decodeBuffer picking characters out of a 510-byte string full of non-recognized UTF8 sequences, instead of producing correct CP437-based Unicode characters. I'd think resilience against packers, bundlers or weird parsers' choice of encoding - seeing as it may happen with a popular bundler and without much effort - would be good to have. But if not - I'll just keep my version for my use, happy with the open source. Cheers! :)

SinusPi · 2025-05-08T19:01:47Z

Intriguing, though, that once I set up this simple "templateContent":"" webpack setting, turning the resulting HTML into nothing more than <script src='bundle1.js'></script><script src='bundle2.js'></script><script src='bundle3.js'></script>, and added lines logging the length of my array and the original string when loading the library, I'm getting this:

cp437array length 256
cp437string length 508
cp437array length 256
cp437string length 1040
cp437array length 256
cp437string length 508

While it's not a surprise Webpack packs the library three times in three different bundles I use (for my reasons), two of them are expectedly 508 bytes when reading the utf8 string using a single-byte encoding, BUT one somehow manages to... I don't know, read it as single bytes, interpret as utf8, then reinterpret as single bytes again? I'm lost and just glad the array approach is absolutely immune to any madness to which Webpack subjects the poor source.

SinusPi added 2 commits May 8, 2025 01:55

cp437 in unicode escapes instead of literals

a785ced

The cp437-to-unicode translation string can become mangled when the library source is read as plain ASCII (in specific cases of bundling with webpack, for example). Writing the cp437 string with Unicode escapes avoids this.

Merge pull request #1 from SinusPi/SinusPi-patch-cp437escapes

5564aca

cp437 in unicode escapes instead of literals

Unicode escapes replaced with charcodes

4e180eb

Take that, Webpack.

SinusPi closed this May 8, 2025

SinusPi reopened this May 8, 2025

thejoshwolfe changed the title ~~CP437 in Unicode escapes instead of literals~~ Avoid non-ASCII source code to workaround encoding problems with webpack May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid non-ASCII source code to workaround encoding problems with webpack #162

Avoid non-ASCII source code to workaround encoding problems with webpack #162

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

thejoshwolfe commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 via email

Uh oh!

thejoshwolfe commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 •

edited

Loading

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

thejoshwolfe commented May 8, 2025 •

edited

Loading

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Avoid non-ASCII source code to workaround encoding problems with webpack #162

Are you sure you want to change the base?

Avoid non-ASCII source code to workaround encoding problems with webpack #162

Uh oh!

Conversation

SinusPi commented May 8, 2025

Uh oh!

thejoshwolfe commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 via email

Uh oh!

thejoshwolfe commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

thejoshwolfe commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SinusPi commented May 8, 2025

Uh oh!

SinusPi commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SinusPi commented May 8, 2025 •

edited

Loading

thejoshwolfe commented May 8, 2025 •

edited

Loading

SinusPi commented May 8, 2025 •

edited

Loading