Content-hash-based change tracking for data imports by jonathangreen · Pull Request #3199 · ThePalaceProject/circulation

jonathangreen · 2026-04-02T19:42:33Z

Description

Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously, BibliographicData and
CirculationData used only the data source's "last updated" timestamp to decide
whether to re-apply incoming data to an Edition or LicensePool. This caused
two problems:

Re-publishing the same content with a newer timestamp triggered unnecessary
writes and work creation.
A real change arriving with the same (or missing) timestamp was silently skipped.
This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
form of the incoming data is stored on the database record after each import.
Subsequent imports compare both the timestamp and the hash before deciding whether
to apply an update.
Key changes:
New json_hash() / json_canonical() utilities (util/json.py) produce a
stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
BaseMutableData gains updated_at, created_at, as_of_timestamp,
calculate_hash(), and should_apply_to(). The should_apply_to() method
is now the single decision point for both bibliographic and circulation data.
BibliographicData.has_changed() and CirculationData.has_changed() are
removed and replaced by the shared should_apply_to() logic.
Edition and LicensePool each gain an updated_at_data_hash column.
LicensePool also gains created_at and updated_at columns to track when
its CirculationData was first and most recently imported.
Individual-license pools (e.g. ODL) always re-apply availability even when the
hash matches, because license availability can change as licenses expire
independently of feed content.
Database migration f98e4049c87d adds all four new columns.

Motivation and Context

The original has_changed() implementation only compared timestamps, which is
insufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

Updated unit tests for BibliographicData and CirculationData cover the
new should_apply_to() logic, including the null-hash bootstrap case, the
timestamp-is-older short-circuit, and the hash-match skip.
New unit tests for json_canonical() and json_hash() verify ordering
stability across dict keys, list items, and float precision.
All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers
pass with the updated field names (updated_at in place of
data_source_last_updated).
Full test suite run via tox -e py312-docker -- --no-cov.

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2026-04-03T02:43:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.32%. Comparing base (45176e0) to head (d17bf96).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3199   +/-   ##
=======================================
  Coverage   93.31%   93.32%           
=======================================
  Files         502      502           
  Lines       46178    46227   +49     
  Branches     6315     6319    +4     
=======================================
+ Hits        43093    43143   +50     
+ Misses       2001     2000    -1     
  Partials     1084     1084

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at

…ion tool.

…g alembic.

- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.

…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.

claude · 2026-04-18T01:39:15Z

Claude finished @dbernstein's task in 12m 36s —— View job

PR Review: Content-hash-based change tracking for data imports

Gather context and understand the request
Read core implementation files
Read test files and migration
Post inline comments for bugs/issues
Write summary review

The overall design is solid. The two-stage guard (timestamp short-circuit → hash comparison) is the right primitive, the canonicalization in json_canonical handles all the tricky edge cases (heterogeneous type ordering, multiple None values, bool/int disambiguation), and the hash cache invalidation via __setattr__ is clean.

A few cross-cutting observations:

Timestamp/hash coupling after force-apply (even_if_not_apparently_updated=True): The Edition and LicensePool paths handle this differently — _update_edition_timestamp only advances updated_at while circulation.apply always overwrites it. As noted in the inline comments, this creates an edge case on the Edition path where a force-applied stale hash combined with a pinned-at-high-water updated_at can cause subsequent imports with intermediate timestamps to be rejected by the < check even though their hash would differ. In practice this is harmless (data sources advance timestamps monotonically), but a comment in _update_edition_timestamp documenting the intentional decoupling between updated_at and updated_at_data_hash would help future readers.

Migration / bootstrap: The migration comment is thorough and accurate — all existing rows start with updated_at_data_hash = NULL, which should_apply_to correctly treats as "never imported", so the first full import after deployment re-establishes all hashes as expected.

ODL handling: The self.licenses is not None bypass is correctly placed in both CirculationData.apply (inner guard) and needs_apply (outer guard at import time), so ODL pools are always re-applied without the hash check.

No bugs found. The four inline comments are all Minor or Nit level.

claude · 2026-04-18T01:51:36Z

+            # The data we have is strictly older than what is stored, no update needed.
+            return False
+
+        return self.calculate_hash() != db_object.updated_at_data_hash


Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.

claude · 2026-04-18T01:51:38Z

+                    as_of=self.as_of_timestamp,
                )

+            pool.updated_at = self.as_of_timestamp


Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.

claude · 2026-04-18T01:51:45Z

+# bool must come before int because bool is a subclass of int in Python.
+_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
+    bool: 0,
+    int: 1,


Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.

claude · 2026-04-18T01:51:50Z

        if edition.updated_at is None or edition.updated_at < updated_at:
            edition.updated_at = updated_at

+        edition.updated_at_data_hash = self.calculate_hash()


Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.

jonathangreen added the feature New feature label Apr 2, 2026

dbernstein changed the title ~~WIP: Content-hash-based change tracking for data imports~~ Content-hash-based change tracking for data imports Apr 6, 2026

dbernstein force-pushed the feature/change-tracking branch 6 times, most recently from f58ea44 to 8135dd0 Compare April 13, 2026 16:50

dbernstein force-pushed the feature/change-tracking branch from 8135dd0 to f5a7e26 Compare April 15, 2026 00:44

dbernstein and others added 11 commits April 17, 2026 11:13

Revert "Fix BibliographicData.has_changed to throttle updates when da…

021b336

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

First work on tracking changes in a hash

3d67241

Redo the revision: CLAUDE did not generate it using the alembic revis…

cac3965

…ion tool.

Update CLAUDE.md to include directive about generating revisions usin…

961db10

…g alembic.

Address multiple heads in alembic revision.

e02407d

Address broken tests.

eed0b37

Fix imports.

48e3230

Fix imports.

d17bf96

dbernstein force-pushed the feature/change-tracking branch from f5a7e26 to d17bf96 Compare April 18, 2026 01:39

claude bot reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content-hash-based change tracking for data imports#3199

Content-hash-based change tracking for data imports#3199
jonathangreen wants to merge 11 commits intomainfrom
feature/change-tracking

jonathangreen commented Apr 2, 2026 •

edited by dbernstein

Loading

Uh oh!

codecov bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

claude bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

claude bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonathangreen commented Apr 2, 2026 • edited by dbernstein Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Checklist

Uh oh!

codecov bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Content-hash-based change tracking for data imports

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonathangreen commented Apr 2, 2026 •

edited by dbernstein

Loading

codecov bot commented Apr 3, 2026 •

edited

Loading

claude bot commented Apr 18, 2026 •

edited

Loading