-
-
Notifications
You must be signed in to change notification settings - Fork 25
Description
In #1215, I introduced a safe: true option for Surt.canonicalize with the intention of using it to normalize URLs. However, there are a few unsafe things that turned out not to have options (and some good safe things that don’t have options to turn on).
We should get that option working correctly. The main thing here is turning off deep/repeated unescaping. This might need some care — I have a feeling there are some places where we do need to unescape once and re-escape (e.g. hostname cleanup), but other places where we should not at all (e.g. paths…?). We may also need a different set of characters to escape for the safe case (see also this note about needing to investigate the Java vs. Python escaping:
web-monitoring-db/app/lib/surt/canonicalize.rb
Lines 78 to 83 in 9b38946
| # TODO: Internet Archive's SURT uses this crazy character set, but only one | |
| # test fails if we just use Addressable's standard set. Maybe drop this? | |
| # Also validate against the original Java version: | |
| # https://github.com/iipc/webarchive-commons/blob/5fb641b7ff3c63ade0eb39782fc8b1982ea27330/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L219-L275 | |
| URL_SPECIAL_CHARACTERS = '!"$&\'()*+,-./:;<=>?@[\]^_`{|}~'.freeze | |
| SAFE_CHARACTERS = "0-9a-zA-Z#{Regexp.escape(URL_SPECIAL_CHARACTERS)}".freeze |
While doing this, it might also be good to add an option for upper- vs. lower-casing escape sequences. SURT always lower-cases, but RFC 3986 recommends upper-case as the normalized form (section 2.1, which is the intended use case for “safe” canonicalization.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status