Skip to content

Remove default different check from API #1208

Closed
@Mr0grog

Description

@Mr0grog

When looking at versions, we only return records where the different column is true by default. Setting ?different=false in the API disables this check, and does not look for versions where different is false. This is confusing, but more importantly, it’s based on a bad assumption we made early on: that most of the time, you’d only want a list of versions that were really “versions,” where something had changed. We realized really quickly this was wrong, and almost all our tools and scripts have to go out of their way to set different=false. When writing new tools, this often catches people (including me, who created this situation!) unaware and causes problems when they forget to set it to false. Fixing this is long overdue.

We should:

  1. Just return all versions if ?different is not set in a request.

  2. Consider removing this field from the database entirely. It imposes a lot of extra work on imports that come out of order, which is extremely common now! We originally thought this would be rare and unusual.

    The theory behind this field was making it easy to know when there was a new “version,” or rather, to know when the page changed. This field doesn’t actually do that very well:

    1. Knowing how closely the timestamp of the “different” version matches the actual time of a change requires knowing when the previous version was, the previous version may not have been different from its preceding version, so it could be hidden in a normal query. If we really want a shortcut for this, this field should probably be the time of the preceding version instead of a boolean.
    2. In actual reality, a lot of pages will have some unique session ID or other random or frequently changing value in the HTML that has no impact on the page content, but causes it to be considered “different” on every check (since “different” is based on the hash of the response body). That makes this field largely pointless on a lot of pages. Mostly we just see it working “correctly” for PDF files.

(2) is more speculative, and should probably be in a separate PR and release than (1), which is more straightforward.

Activity

moved this from Inbox to Prioritized in Web Monitoringon Feb 28, 2025
Mr0grog

Mr0grog commented on Apr 2, 2025

@Mr0grog
MemberAuthor

Did a quick search and it looks like almost all our projects that interface with the API always set different=false. The exception is versionista-scraper, which is not really actively maintained, and hails from a time when we hadn’t yet realized how poorly this works.

Seems like good evidence this behavior in the API should be turned off.

added a commit that references this issue on Apr 4, 2025
added a commit that references this issue on Apr 4, 2025
Mr0grog

Mr0grog commented on Apr 7, 2025

@Mr0grog
MemberAuthor

I more-or-less took care of step 1 in #1232. However, I did retain usage of the different field (not the query param — those are all gone) in one place: when getting /pages?capture_time=<from_time>...<to_time>, it returns pages that have versions with different=true that were captured in that time range. This was the default before (you could disable with ?different=false) and is now the only way it works.

For now, I think that’s reasonable — most active pages will have captures at the same times, and those captures should be regular and frequent, so without the different field being involved, the capture_time query on pages doesn’t seem very meaningful. Without different, the capture_time query would mostly just be a proxy for active.

So we can’t really get rid of the column completely (i.e. we can’t do step 2) based on the way things work right now. Maybe that will change in the future, but for now I think this task is as complete as it will be, so I’m closing this as complete.

moved this from Prioritized to Done in Web Monitoringon Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Mr0grog

        Issue actions

          Remove default `different` check from API · Issue #1208 · edgi-govdata-archiving/web-monitoring-db