Skip to content

Remove default different check from API #1208

Closed
@Mr0grog

Description

@Mr0grog

When looking at versions, we only return records where the different column is true by default. Setting ?different=false in the API disables this check, and does not look for versions where different is false. This is confusing, but more importantly, it’s based on a bad assumption we made early on: that most of the time, you’d only want a list of versions that were really “versions,” where something had changed. We realized really quickly this was wrong, and almost all our tools and scripts have to go out of their way to set different=false. When writing new tools, this often catches people (including me, who created this situation!) unaware and causes problems when they forget to set it to false. Fixing this is long overdue.

We should:

  1. Just return all versions if ?different is not set in a request.

  2. Consider removing this field from the database entirely. It imposes a lot of extra work on imports that come out of order, which is extremely common now! We originally thought this would be rare and unusual.

    The theory behind this field was making it easy to know when there was a new “version,” or rather, to know when the page changed. This field doesn’t actually do that very well:

    1. Knowing how closely the timestamp of the “different” version matches the actual time of a change requires knowing when the previous version was, the previous version may not have been different from its preceding version, so it could be hidden in a normal query. If we really want a shortcut for this, this field should probably be the time of the preceding version instead of a boolean.
    2. In actual reality, a lot of pages will have some unique session ID or other random or frequently changing value in the HTML that has no impact on the page content, but causes it to be considered “different” on every check (since “different” is based on the hash of the response body). That makes this field largely pointless on a lot of pages. Mostly we just see it working “correctly” for PDF files.

(2) is more speculative, and should probably be in a separate PR and release than (1), which is more straightforward.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions