Content-hash-based change tracking for data imports#3199
Content-hash-based change tracking for data imports#3199jonathangreen wants to merge 11 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3199 +/- ##
=======================================
Coverage 93.31% 93.32%
=======================================
Files 502 502
Lines 46178 46227 +49
Branches 6315 6319 +4
=======================================
+ Hits 43093 43143 +50
+ Misses 2001 2000 -1
Partials 1084 1084 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f58ea44 to
8135dd0
Compare
8135dd0 to
f5a7e26
Compare
Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.
…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.
f5a7e26 to
d17bf96
Compare
|
Claude finished @dbernstein's task in 12m 36s —— View job PR Review: Content-hash-based change tracking for data imports
The overall design is solid. The two-stage guard (timestamp short-circuit → hash comparison) is the right primitive, the canonicalization in A few cross-cutting observations: Timestamp/hash coupling after force-apply ( Migration / bootstrap: The migration comment is thorough and accurate — all existing rows start with ODL handling: The No bugs found. The four inline comments are all |
| # The data we have is strictly older than what is stored, no update needed. | ||
| return False | ||
|
|
||
| return self.calculate_hash() != db_object.updated_at_data_hash |
There was a problem hiding this comment.
Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.
| as_of=self.as_of_timestamp, | ||
| ) | ||
|
|
||
| pool.updated_at = self.as_of_timestamp |
There was a problem hiding this comment.
Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.
| # bool must come before int because bool is a subclass of int in Python. | ||
| _CANONICALIZE_TYPE_ORDER: dict[type, int] = { | ||
| bool: 0, | ||
| int: 1, |
There was a problem hiding this comment.
Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.
| if edition.updated_at is None or edition.updated_at < updated_at: | ||
| edition.updated_at = updated_at | ||
|
|
||
| edition.updated_at_data_hash = self.calculate_hash() |
There was a problem hiding this comment.
Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.
Description
Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously,
BibliographicDataandCirculationDataused only the data source's "last updated" timestamp to decidewhether to re-apply incoming data to an
EditionorLicensePool. This causedtwo problems:
writes and work creation.
This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
form of the incoming data is stored on the database record after each import.
Subsequent imports compare both the timestamp and the hash before deciding whether
to apply an update.
Key changes:
json_hash()/json_canonical()utilities (util/json.py) produce astable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
BaseMutableDatagainsupdated_at,created_at,as_of_timestamp,calculate_hash(), andshould_apply_to(). Theshould_apply_to()methodis now the single decision point for both bibliographic and circulation data.
BibliographicData.has_changed()andCirculationData.has_changed()areremoved and replaced by the shared
should_apply_to()logic.EditionandLicensePooleach gain anupdated_at_data_hashcolumn.LicensePoolalso gainscreated_atandupdated_atcolumns to track whenits
CirculationDatawas first and most recently imported.hash matches, because license availability can change as licenses expire
independently of feed content.
f98e4049c87dadds all four new columns.Motivation and Context
The original
has_changed()implementation only compared timestamps, which isinsufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.
How Has This Been Tested?
BibliographicDataandCirculationDatacover thenew
should_apply_to()logic, including the null-hash bootstrap case, thetimestamp-is-older short-circuit, and the hash-match skip.
json_canonical()andjson_hash()verify orderingstability across dict keys, list items, and float precision.
pass with the updated field names (
updated_atin place ofdata_source_last_updated).tox -e py312-docker -- --no-cov.Checklist