-
-
Notifications
You must be signed in to change notification settings - Fork 25
Description
This project has a nearly complete Ruby port of the Internet Archive’s SURT Python package buried in the app/lib/ directory:
web-monitoring-db/app/lib/surt.rb
Lines 3 to 20 in 3bb7e8a
| # Tools for canonicalizing and formatting URLs according to the Internet | |
| # Archive's "Sort-friendly URI Reordering Transform" (SURT) format: | |
| # http://crawler.archive.org/articles/user_manual/glossary.html#surt | |
| # | |
| # For example: | |
| # | |
| # URL: https://energy.gov/eere/sunshot/downloads/ | |
| # SURT: gov,energy)/eere/sunshot/downloads | |
| # | |
| # The implementations primarily live in submodules (Canonicalize and Format), | |
| # while the methods here serve as public entry points. See each implementation | |
| # module for a list of options and default values (at the top of each module). | |
| # | |
| # Code in the submodules is generally based on the Internet Archive's Python | |
| # SURT module: https://github.com/internetarchive/surt | |
| # With some added inspiration from Purell: https://github.com/PuerkitoBio/purell | |
| # and normalize_url: https://github.com/rwz/normalize_url | |
| module Surt |
I wrote it because we needed URL canonicalization tools, none of the existing Ruby ones I could find quite met our needs perfectly, and having a method that roughly matched the Internet Archive’s was advantageous. Nobody had written a Ruby port of SURT.
Since we have generally been working to break more reusable, abstract pieces out of the web monitoring projects, this is probably a really good candidate for that on the ruby side. It might be nice to extract it and publish it as a Ruby Gem. (Gem name: SURT, repo name: edgi-govdata-archiving/ruby-surt)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status