Skip to content

utf8 encoder/decoder is not strict #48

@ChALkeR

Description

@ChALkeR

and might not be suitable for e.g. hashing

i.e.:

  • there exist non-equal JS strings being encoded into equal Uint8Array buffers
  • there exist non-equal Uint8Array buffers being decoded into equal JS strings

Demo:

import { utf8, hex } from './index.ts'
import { sha256 } from '@noble/hashes/sha2.js'

const h = hex.encode

// Unpaired surrogates
{
  const s0 = 'what\ud800ever'
  const s1 = 'what\ud820ever'
  const u0 = utf8.decode(s0)
  const u1 = utf8.decode(s1)
  console.log(`1. Strings equal: ${s0 === s1}, u8 equal: ${h(u0) === h(u1)}`) // expect false or throw
  console.log(`   Bonus: hashes equal: ${h(sha256(u0)) === h(sha256(u1))}`)
}

// Invalid utf-8
{
  const u0 = Uint8Array.of(0x80)
  const u1 = Uint8Array.of(0x81)
  const s0 = utf8.encode(u0)
  const s1 = utf8.encode(u1)
  console.log(`2. u8 equal: ${h(u0) === h(u1)}, strings equal: ${s0 === s1}`) // expect false or throw
}

// BOM
{
  const s0 = '\uFEFFHello, world!'
  const u0 = utf8.decode(s0)
  const s1 = utf8.encode(u0)
  const u1 = utf8.decode(s1)
  console.log(`3. Strings equal: ${s0 === s1}, u8 equal: ${h(u0) === h(u1)}`) // expect true
}

To build a strict impl:

  1. Use new TextDecoder('utf8', { ignoreBOM: true, fatal: true }) to preserve BOM and throw on errors
  2. For TextEncoder, there is no fatal option. Use string.isWellFormed() when available. When not - check for presence of EFBFBD in the output, and if it's present recheck that the output decodes back to the same string.

Replacing the current impt with a strict one will be a breaking change
An alternative is to export it under a different name

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions