Releases: edgi-govdata-archiving/wayback
Version 0.3.0
This release marks a major update we’re really excited about: WaybackClient.get_memento no longer returns a Response object from the Requests package that takes a lot of extra work to interpret correctly. Instead, it returns a new Memento object. It’s really similar to the Response we used to return, but doesn’t mix up current and historical data — it represents the historical, archived HTTP response that is stored in the Wayback Machine. This is a big change to the API, so we’ve bumped the version number to 0.3.x.
Notable Changes
-
Breaking change:
WaybackClient.get_mementotakes new parameters and has a new return type. More details below. -
Breaking change:
memento_url_datanow returns 3 values instead of 2. The last value is a string representing the playback mode (see below description of the newmodeparameter onWaybackClient.get_mementofor more about playback modes). -
Requests to the Wayback Machine now have a default timeout of 60 seconds. This was important because we’ve seen many recent issues where the Wayback Machine servers don’t always close connections.
If needed, you can disable this by explicitly setting
timeout=Nonewhen creating aWaybackSession. Please note this is not a timeout on how long a whole request takes, but on the time between bytes received. -
WaybackClient.get_mementonow raisesNoMementoErrorwhen the requested URL has never been archived by the Wayback Machine. It no longer raisesrequests.exceptions.HTTPErrorunder any circumstances.
You may notice that removing APIs from the Requests package is a theme here. Under the hood, Wayback still uses Requests for HTTP requests, but we expect to change that in order to ensure this package is thread-safe. We will bump the version to v0.4.x when doing so.
get_memento() Parameters
The parameters in WaybackClient.get_memento have been re-organized. The method signature is now:
def get_memento(self,
url, # Accepts new types of values.
datetime=None, # New parameter.
mode=Mode.original, # New parameter.
*, # Everything below is keyword-only.
exact=True,
exact_redirects=None,
target_window=24 * 60 * 60,
follow_redirects=True) # New parameter.-
All parameters except
url(the first parameter) from v0.2.x must now be specified with keywords, and cannot be specified positionally.If you previously used keywords, your code will be fine and no changes are necessary:
# This still works great! client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/', exact=False, exact_redirects=False, target_window=3600)
However, positional parameters like the following will now cause problems, and you should switch to the above keyword form:
# This will now cause you some trouble :( client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/', False, False, 3600)
-
The
urlparameter can now be a normal, non-Wayback URL or aCdxRecord, and newdatetimeandmodeparameters have been added.Previously, if you wanted to get a memento of what
http://www.noaa.gov/looked like on August 1, 2018, you would have had to construct a complex string to pass toget_memento():client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov/')
Now you can pass the URL and time you want as separate parameters:
client.get_memento('http://www.noaa.gov/', datetime.datetime(2018, 8, 1))
If the
datetimeparameter does not specify a timezone, it will be treated as UTC (not local time).You can also pass a
CdxRecordthat you received fromWaybackClient.searchinstead of a URL and time:for record in client.search('http://www.noaa.gov/'): client.get_memento(record)
Finally, you can now specify the playback mode of a memento using the
modeparameter:client.get_memento('http://www.noaa.gov/', datetime=datetime.datetime(2018, 8, 1), mode=wayback.Mode.view)
The default mode is
Mode.original, which returns the exact HTTP response body as was originally archived. Other modes reformat the response body so it’s more friendly for browsing by changing the URLs of links, images, etc. and by adding informational content to the page about the memento you are viewing. They are the modes typically used when you view the Wayback Machine in a web browser.Don’t worry, though — complete Wayback URLs are still supported. This code still works fine:
client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov/')
-
A new
follow_redirectsparameter specifies whether to follow historical redirects (i.e. redirects that happened when the requested memento was captured). It defaults toTrue, which matches the old behavior of this method.
get_memento() Returns a Memento Object
get_memento() no longer returns a response object from the Requests package. Instead it returns a specialized Memento object, which is similar, but provides more useful information about the Memento than just the HTTP response from Wayback. For example, memento.url is the original URL the memento is a capture of (e.g. http://www.noaa.gov/) rather than the Wayback URL (e.g. http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/). You can still get the full Wayback URL from memento.memento_url.
You can check out the full API documentation for Memento, but here’s a quick guide to what’s available:
memento = client.get_memento('http://www.noaa.gov/home',
datetime(2018, 8, 16, 11, 19, 11),
exact=False)
# These values were previously not available except by parsing
# `memento.url`. The old `memento.url` is now `memento.memento_url`.
memento.url == 'http://www.noaa.gov/'
memento.timestamp == datetime(2018, 8, 29, 8, 8, 49, tzinfo=timezone.utc)
memento.mode == 'id_'
# Used to be `memento.url`:
memento.memento_url == 'http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/'
# Used to be a list of `Response` objects, now a *tuple* of Mementos. It
# lists only the redirects that are actual Mementos and not part of
# Wayback's internal machinery:
memento.history == (Memento<url='http://noaa.gov/home'>,)
# Used to be a list of `Response` objects, now a *tuple* of URL strings:
memento.debug_history == ('http://web.archive.org/web/20180816111911id_/http://noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov/')
# Headers now only lists headers from the original archived response, not
# additional headers from the Wayback Machine itself. (If there's
# important information you needed in the headers, file an issue and let
# us know! We'd like to surface that kind of information as attributes on
# the Memento now.
memento.headers = {'header_name': 'header_value',
'another_header': 'another_value',
'and': 'so on'}
# Same as before:
memento.status_code
memento.ok
memento.is_redirect
memento.encoding
memento.content
memento.textVersion 0.2.6
Fix a major bug where a session's timeout would not actually be applied to most requests. HUGE thanks to @LionSzl for discovering this issue and addressing it. (#68)
Version 0.3.0 Beta 1
wayback.WaybackClient.get_memento now raises wayback.exceptions.NoMementoError when the requested URL has never been archived. It also now raises wayback.exceptions.MementoPlaybackError in all other cases where an error was returned by the Wayback Machine (so you should never see a requests.exceptions.HTTPError). However, you may still see other network-level errors (e.g. ConnectionError).
Version 0.3.0 Alpha 3
Fixes a bug in the new Memento type where header parsing would fail for mementos with schemeless Location headers. (#61)
Version 0.3.0 Alpha 2
Fixes a bug in the new Memento type where header parsing would fail for mementos with path-based Location headers. (#60)
Version 0.3.0 Alpha 1
This release focuses on wayback.WaybackClient.get_memento and makes major, breaking changes to its parameters and return type. They’re all improvements, though, we promise!
get_memento() Parameters
The parameters in wayback.WaybackClient.get_memento have been re-organized. The method signature is now:
def get_memento(self,
url, # Accepts new types of values.
datetime=None, # New parameter.
mode=Mode.original, # New parameter.
*, # Everything below is keyword-only.
exact=True,
exact_redirects=None,
target_window=24 * 60 * 60,
follow_redirects=True) # New parameter.-
All parameters except
url(the first parameter) from v0.2.x must now be specified with keywords, and cannot be specified positionally.If you previously used keywords, your code will be fine and no changes are necessary:
# This still works great! client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov', exact=False, exact_redirects=False, target_window=3600)
However, positional parameters like the following will now cause problems, and you should switch to the above keyword form:
# This will now cause you some trouble :( client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov', False, False, 3600)
-
The
urlparameter can now be a normal, non-Wayback URL or awayback.CdxRecord, and newdatetimeandmodeparameters have been added.Previously, if you wanted to get a memento of what
http://www.noaa.gov/looked like on August 1, 2018, you would have had to construct a complex string to pass toget_memento():client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov')
Now you can pass the URL and time you want as separate parameters:
client.get_memento('http://www.noaa.gov', datetime.datetime(2018, 8, 1))
If the
datetimeparameter does not specify a timezone, it will be treated as UTC (not local time).You can also pass a
wayback.CdxRecordthat you received fromwayback.WaybackClient.searchinstead of a URL and time:for record in client.search('http://www.noaa.gov'): client.get_memento(record)
Finally, you can now specify the playback mode of a memento using the
modeparameter:client.get_memento('http://www.noaa.gov', datetime=datetime.datetime(2018, 8, 1), mode=wayback.Mode.view)
The default mode is
wayback.Mode.original, which returns the exact HTTP response body as was originally archived. Other modes reformat the response body so it’s more friendly for browsing by changing the URLs of links, images, etc. and by adding informational content to the page about the memento you are viewing. They are the modes typically used when you view the Wayback Machine in a web browser.Don’t worry, though — complete Wayback URLs are still supported. This code still works fine:
client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov')
-
A new
follow_redirectsparameter specifies whether to follow historical redirects (i.e. redirects that happened when the requested memento was captured). It defaults toTrue, which matches the old behavior of this method.
get_memento() Returns a New Memento Type
get_memento() no longer returns a response object from the Requests package. Instead it returns a specialized wayback.Memento object, which is similar, but provides more useful information about the Memento than just the HTTP response from Wayback. For example, memento.url is the original URL the memento is a capture of (e.g. http://www.noaa.gov/) rather than the Wayback URL (e.g. http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/). You can still get the full Wayback URL from memento.memento_url.
You can check out the full API docs for wayback.Memento, but here’s a quick guide to what’s available:
memento = client.get_memento('http://www.noaa.gov/home',
datetime(2018, 8, 16, 11, 19, 11),
exact=False)
# These values were previously not available except by parsing
# `memento.url`. The old `memento.url` is now `memento.memento_url`.
memento.url == 'http://www.noaa.gov'
memento.timestamp == datetime(2018, 8, 29, 8, 8, 49, tzinfo=timezone.utc)
memento.mode == 'id_'
# Used to be `memento.url`:
memento.memento_url == 'http://web.archive.org/web/20180816111911id_/http://www.noaa.gov'
# Used to be a list of `Response` objects, now a *tuple* of Mementos. It
# Still lists only the redirects that are actual Mementos and not part of
# Wayback's internal machinery:
memento.history == (Memento<url='http://noaa.gov/home'>,)
# Used to be a list of `Response` objects, now a *tuple* of URL strings:
memento.debug_history == ('http://web.archive.org/web/20180816111911id_/http:/noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov')
# Headers now only lists headers from the original, archived response, not
# additional headers from the Wayback Machine itself. (If there's
# important information you needed in the headers, file an issue and let
# us know! We'd like to surface that kind of information as attributes on
# the Memento now.
memento.headers = {'header_name': 'header_value',
'another_header': 'another_value',
'and': 'so on'}
# Same as before:
memento.status_code
memento.ok
memento.is_redirect
memento.encoding
memento.content
memento.textUnder the hood, Wayback still uses Requests for HTTP requests, but we expect to change that soon to ensure this package is thread-safe.
Other Breaking Changes
Finally, wayback.memento_url_data now returns 3 values instead of 2. The last value is a string representing the playback mode (see above description of the new mode parameter on wayback.WaybackClient.get_memento for more about playback modes).
Version 0.2.5
This release fixes a bug where the target_window parameter for WaybackClient.get_memento() did not work correctly if the memento you were redirected to was off by more than a day from the requested time. See #53 for more details.
Version 0.2.4
This release is focused on improved error handling.
Breaking Changes:
- The timestamps in
CdxRecordobjects returned bywayback.WaybackClient.searchnow include timezone information. (They are always in the UTC timezone.)
Updates:
-
The
historyattribute of a memento now only includes redirects that were mementos (i.e. redirects that would have been seen when browsing the recorded site at the time it was recorded). Other redirects involved in working with the memento API are still available indebug_history, which includes all redirects, whether or not they were mementos. -
Wayback’s CDX search API sometimes returns repeated, identical results. These are now filtered out, so repeat search results will not be yielded from
wayback.WaybackClient.search. -
wayback.exceptions.RateLimitErrorwill now be raised as an exception any time you breach the Wayback Machine's rate limits. This would previously have beenwayback.exceptions.WaybackException,wayback.exceptions.MementoPlaybackError, or regular HTTP responses, depending on the method you called. It has aretry_afterproperty that indicates how many seconds you should wait before trying again (if the server sent that information, otherwise it will beNone). -
wayback.exceptions.BlockedSiteErrorwill now be raised any time you search for a URL or request a memento that has been blocked from access (for example, in situations where the Internet Archive has received a takedown notice).
Version 0.2.3
This release downgrades the minimum Python version to 3.6! You can now use
Wayback in places like Google Colab.
The from_date and to_date arguments for wayback.WaybackClient.search can now be datetime.date instances in addition to datetime.datetime.
Huge thanks to @edsu for implementing both of these!
Version 0.2.2
When errors were raised or redirects were involved in WaybackClient.get_memento(), it was previously possible for connections to be left hanging open. Wayback now works harder to make sure connections aren't left open.
This release also updates the default user agent string to include the repo URL. It now looks like: wayback/0.2.2 (+https://github.com/edgi-govdata-archiving/wayback).