Make `safe` option for SURT actually safe

In #1215, I [introduced a `safe: true` option for `Surt.canonicalize`](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/9b389461f832bb26c8fcc8020d51f578c6b10374/app/lib/surt/canonicalize.rb#L36-L37) with the intention of using it to normalize URLs. However, there are a few unsafe things that turned out not to have options (and some good safe things that don’t have options to turn on).

We should get that option working correctly. The main thing here is turning off deep/repeated unescaping. This might need some care — I have a feeling there are some places where we do need to unescape once and re-escape (e.g. hostname cleanup), but other places where we should not at all (e.g. paths…?). We may also need a different set of characters to escape for the safe case (see also this note about needing to investigate the Java vs. Python escaping: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/9b389461f832bb26c8fcc8020d51f578c6b10374/app/lib/surt/canonicalize.rb#L78-L83)

While doing this, it might also be good to add an option for upper- vs. lower-casing escape sequences. SURT always lower-cases, but RFC 3986 recommends upper-case as the normalized form ([section 2.1](https://datatracker.ietf.org/doc/html/rfc3986#section-2.1), which is the intended use case for “safe” canonicalization.

	# TODO: Internet Archive's SURT uses this crazy character set, but only one
	# test fails if we just use Addressable's standard set. Maybe drop this?
	# Also validate against the original Java version:
	# https://github.com/iipc/webarchive-commons/blob/5fb641b7ff3c63ade0eb39782fc8b1982ea27330/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L219-L275
	URL_SPECIAL_CHARACTERS = '!"$&\'()*+,-./:;<=>?@[\]^_`{\|}~'.freeze
	SAFE_CHARACTERS = "0-9a-zA-Z#{Regexp.escape(URL_SPECIAL_CHARACTERS)}".freeze

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make `safe` option for SURT actually safe #1217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make safe option for SURT actually safe #1217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make `safe` option for SURT actually safe #1217