utf8 encoder/decoder is not strict

and might not be suitable for e.g. hashing

i.e.:
* there exist non-equal JS strings being encoded into equal Uint8Array buffers
* there exist non-equal Uint8Array buffers being decoded into equal JS strings

Demo:
```js
import { utf8, hex } from './index.ts'
import { sha256 } from '@noble/hashes/sha2.js'

const h = hex.encode

// Unpaired surrogates
{
  const s0 = 'what\ud800ever'
  const s1 = 'what\ud820ever'
  const u0 = utf8.decode(s0)
  const u1 = utf8.decode(s1)
  console.log(`1. Strings equal: ${s0 === s1}, u8 equal: ${h(u0) === h(u1)}`) // expect false or throw
  console.log(`   Bonus: hashes equal: ${h(sha256(u0)) === h(sha256(u1))}`)
}

// Invalid utf-8
{
  const u0 = Uint8Array.of(0x80)
  const u1 = Uint8Array.of(0x81)
  const s0 = utf8.encode(u0)
  const s1 = utf8.encode(u1)
  console.log(`2. u8 equal: ${h(u0) === h(u1)}, strings equal: ${s0 === s1}`) // expect false or throw
}

// BOM
{
  const s0 = '\uFEFFHello, world!'
  const u0 = utf8.decode(s0)
  const s1 = utf8.encode(u0)
  const u1 = utf8.decode(s1)
  console.log(`3. Strings equal: ${s0 === s1}, u8 equal: ${h(u0) === h(u1)}`) // expect true
}

```


To build a strict impl:
1. Use `new TextDecoder('utf8', { ignoreBOM: true, fatal: true })` to preserve BOM and throw on errors
2. For `TextEncoder`, there is no `fatal` option. Use `string.isWellFormed()` when available. When not - check for presence of `EFBFBD` in the output, and if it's present recheck that the output decodes back to the same string.

Replacing the current impt with a strict one will be a breaking change
An alternative is to export it under a different name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

utf8 encoder/decoder is not strict #48

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

utf8 encoder/decoder is not strict #48

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions