Skip to content

http: percent-decode should follow WHATWG for invalid % sequences #3304

@cuiweixie

Description

@cuiweixie

Problem

url_decode / path_decode treat any % followed by two characters as a hex escape and decode it with hex_to_byte(). That has two failure modes:

  1. Historical bug: az / AZ were accepted as hex “digits”, so e.g. g was mapped to nibble 16 (g - 'a' + 10), which is not a valid hex nibble (must be 0–15). That produced wrong decoded bytes (e.g. %2g interpreted as a deliberate encoding instead of garbage).

  2. After narrowing to af: Non-hex characters (e.g. z) fall through to c - '0', which is still not a hex nibble—it is an arbitrary value (e.g. 'z' - '0' = 74). So invalid input still decodes to wrong bytes, just differently.

Neither matches the URL Standard.

Expected behavior

WHATWG URL — Percent-encoded bytes: a percent-encoded byte is % followed by two ASCII hex digits. If that is not the case, append % only and continue (see the spec’s example: %25%s%1G%%s%1G).

Suggested fix

  • Only decode when the next two code points are ASCII hex digits.
  • Otherwise emit a literal % (and do not consume the following characters as hex).
  • Trailing % or %X with fewer than two following bytes should not fail the whole decode; treat like the spec (literal %).

Regression tests should lock the WHATWG example and cases like %2g → literal %2g (not a bogus byte).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions