Description
When looking at versions
, we only return records where the different
column is true
by default. Setting ?different=false
in the API disables this check, and does not look for versions where different
is false
. This is confusing, but more importantly, it’s based on a bad assumption we made early on: that most of the time, you’d only want a list of versions that were really “versions,” where something had changed. We realized really quickly this was wrong, and almost all our tools and scripts have to go out of their way to set different=false
. When writing new tools, this often catches people (including me, who created this situation!) unaware and causes problems when they forget to set it to false
. Fixing this is long overdue.
We should:
-
Just return all versions if
?different
is not set in a request. -
Consider removing this field from the database entirely. It imposes a lot of extra work on imports that come out of order, which is extremely common now! We originally thought this would be rare and unusual.
The theory behind this field was making it easy to know when there was a new “version,” or rather, to know when the page changed. This field doesn’t actually do that very well:
- Knowing how closely the timestamp of the “different” version matches the actual time of a change requires knowing when the previous version was, the previous version may not have been different from its preceding version, so it could be hidden in a normal query. If we really want a shortcut for this, this field should probably be the time of the preceding version instead of a boolean.
- In actual reality, a lot of pages will have some unique session ID or other random or frequently changing value in the HTML that has no impact on the page content, but causes it to be considered “different” on every check (since “different” is based on the hash of the response body). That makes this field largely pointless on a lot of pages. Mostly we just see it working “correctly” for PDF files.
(2) is more speculative, and should probably be in a separate PR and release than (1), which is more straightforward.
Metadata
Metadata
Assignees
Type
Projects
Status
Activity
Mr0grog commentedon Apr 2, 2025
Did a quick search and it looks like almost all our projects that interface with the API always set
different=false
. The exception isversionista-scraper
, which is not really actively maintained, and hails from a time when we hadn’t yet realized how poorly this works.Seems like good evidence this behavior in the API should be turned off.
Drop support for `?different` query param
?different
query param #1232Drop support for `?different` query param (#1232)
Mr0grog commentedon Apr 7, 2025
I more-or-less took care of step 1 in #1232. However, I did retain usage of the
different
field (not the query param — those are all gone) in one place: when getting/pages?capture_time=<from_time>...<to_time>
, it returns pages that have versions withdifferent=true
that were captured in that time range. This was the default before (you could disable with?different=false
) and is now the only way it works.For now, I think that’s reasonable — most active pages will have captures at the same times, and those captures should be regular and frequent, so without the
different
field being involved, thecapture_time
query on pages doesn’t seem very meaningful. Withoutdifferent
, thecapture_time
query would mostly just be a proxy foractive
.So we can’t really get rid of the column completely (i.e. we can’t do step 2) based on the way things work right now. Maybe that will change in the future, but for now I think this task is as complete as it will be, so I’m closing this as complete.