Results are inconsistent

Hi!

Friendly ping from [@exodus/bytes](https://github.com/ExodusOSS/bytes#exodusbytes) which also provides an impl (which is fast, consistent, spec-compliant, and tested to match on all platforms).

When cross-testing, I found a number of issues. I'm not filing them individually to avoid spam, but instead as a single list (which can be later split to subtasks).


1. UTF-8:
    1. BOM handling is inconsistent.
       On some platforms, `textDecode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x42))` is `'B'`, on others, it's `'\uFEFFB'`.
    2. `textDecode` fallback fails to polyfill replacement (or fatal mode) and produces garbage output:
        ```js
        import { textDecode } from '@borewit/text-codec'
        console.log(escape(textDecode(Uint8Array.of(0, 254, 255))))
        console.log(escape(textDecode(Uint8Array.of(0x80))))
        console.log(escape(textDecode(Uint8Array.of(0xf0, 0x90, 0x80))))
        console.log(escape(textDecode(Uint8Array.of(0xf0, 0x80, 0x80))))
        ```
    
        Should be:
        ```
        %00%uFFFD%uFFFD
        %uFFFD
        %uFFFD
        %uFFFD%uFFFD%uFFFD
        ```
    
        But results in this in polyfill:
        ```
        %00%uDABC%uDC00
        %00
        %uD800%uDC00
        %uDBC0%uDC00
        ```
  
        This behavior is platform-dependent and results are not consistent across platforms.
    3. Same for `textEncode`.

        `textEncode('\ud800')` is `Uint8Array(3) [ 239, 191, 189 ]` in native but `Uint8Array(3) [ 237, 160, 128 ]` in fallback.
2. Same for UTF-16: wrong output on non-well-formed input

    Moreover, utf-16le decoder can return non-well-formed strings, which can have security impact in some setups due to how hashing and signatures behave on those.
3. The fallback is slow overall.
     1. On Node.js, this is ~20x slower on utf-16le, ~50x slower on iso-8859-1, and 15x-190x slower on windows-1252.
     2. This lib is documented to be a polyfill for React Native, but a performant polyfill for RN is ~5x-10x faster.
4. Documentation is off. `windows-1252` is mentioned as implemented in Node.js, as well as UTF-16 in Node.js small-icu. In fact, in most Node.js versions released in 2025 `windows-1252` was broken and returned wrong output. As was UTF-16 in without-intl Node.js.    This also mentions as if it adds something on top of those, but in fact, this doesn't use native impls for anything except utf-8.
5. `token-types` internal documentation is off.
    https://github.com/Borewit/token-types/blob/1692c1cca988da14e28fc19fde24653e36f3db90/lib/index.ts#L434 says:
    > Supports all encodings supported by TextDecoder, plus 'windows-1252'.
   
   But it doesn't, only a few encodings are supported, and only utf-8 is used with TextDecoder, there is no transparent fallback.


Alternatively, just reuse `@exodus/bytes` or copy impl from it 🙃 (I can help to point to the imports that you'll likely want)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results are inconsistent #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Results are inconsistent #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions