Skip to content

Extract SURT into a separate gem #767

@Mr0grog

Description

@Mr0grog

This project has a nearly complete Ruby port of the Internet Archive’s SURT Python package buried in the app/lib/ directory:

# Tools for canonicalizing and formatting URLs according to the Internet
# Archive's "Sort-friendly URI Reordering Transform" (SURT) format:
# http://crawler.archive.org/articles/user_manual/glossary.html#surt
#
# For example:
#
# URL: https://energy.gov/eere/sunshot/downloads/
# SURT: gov,energy)/eere/sunshot/downloads
#
# The implementations primarily live in submodules (Canonicalize and Format),
# while the methods here serve as public entry points. See each implementation
# module for a list of options and default values (at the top of each module).
#
# Code in the submodules is generally based on the Internet Archive's Python
# SURT module: https://github.com/internetarchive/surt
# With some added inspiration from Purell: https://github.com/PuerkitoBio/purell
# and normalize_url: https://github.com/rwz/normalize_url
module Surt

I wrote it because we needed URL canonicalization tools, none of the existing Ruby ones I could find quite met our needs perfectly, and having a method that roughly matched the Internet Archive’s was advantageous. Nobody had written a Ruby port of SURT.

Since we have generally been working to break more reusable, abstract pieces out of the web monitoring projects, this is probably a really good candidate for that on the ruby side. It might be nice to extract it and publish it as a Ruby Gem. (Gem name: SURT, repo name: edgi-govdata-archiving/ruby-surt)

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions