Open
Description
As we hit more corner cases for url extraction, our regex approach will be a limiting factor, and we should switch to using an external library. We would need to import our regression tests into the external library.
https://pypi.python.org/pypi/urlextract might be useful.
Or we may find a 'recursive' web scraper which handles file types of than .html.
I.e. there maybe a function in https://scrapy.org which handles plain text documents.