Skip to content

Make safe option for SURT actually safe #1217

@Mr0grog

Description

@Mr0grog

In #1215, I introduced a safe: true option for Surt.canonicalize with the intention of using it to normalize URLs. However, there are a few unsafe things that turned out not to have options (and some good safe things that don’t have options to turn on).

We should get that option working correctly. The main thing here is turning off deep/repeated unescaping. This might need some care — I have a feeling there are some places where we do need to unescape once and re-escape (e.g. hostname cleanup), but other places where we should not at all (e.g. paths…?). We may also need a different set of characters to escape for the safe case (see also this note about needing to investigate the Java vs. Python escaping:

# TODO: Internet Archive's SURT uses this crazy character set, but only one
# test fails if we just use Addressable's standard set. Maybe drop this?
# Also validate against the original Java version:
# https://github.com/iipc/webarchive-commons/blob/5fb641b7ff3c63ade0eb39782fc8b1982ea27330/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L219-L275
URL_SPECIAL_CHARACTERS = '!"$&\'()*+,-./:;<=>?@[\]^_`{|}~'.freeze
SAFE_CHARACTERS = "0-9a-zA-Z#{Regexp.escape(URL_SPECIAL_CHARACTERS)}".freeze
)

While doing this, it might also be good to add an option for upper- vs. lower-casing escape sequences. SURT always lower-cases, but RFC 3986 recommends upper-case as the normalized form (section 2.1, which is the intended use case for “safe” canonicalization.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions