Skip to content

Content-hash-based change tracking for data imports#3199

Draft
jonathangreen wants to merge 11 commits intomainfrom
feature/change-tracking
Draft

Content-hash-based change tracking for data imports#3199
jonathangreen wants to merge 11 commits intomainfrom
feature/change-tracking

Conversation

@jonathangreen
Copy link
Copy Markdown
Member

@jonathangreen jonathangreen commented Apr 2, 2026

Description

Replaces the timestamp-only change-detection logic in the data import pipeline
with a content-hash-based system. Previously, BibliographicData and
CirculationData used only the data source's "last updated" timestamp to decide
whether to re-apply incoming data to an Edition or LicensePool. This caused
two problems:

  • Re-publishing the same content with a newer timestamp triggered unnecessary
    writes and work creation.
  • A real change arriving with the same (or missing) timestamp was silently skipped.
    This branch reverts the original timestamp-throttling PR (Fix BibliographicData.has_changed to throttle updates when data sourc… #3198) and replaces it
    with a proper content-hash approach. A SHA-256 hash of the canonical, serialized
    form of the incoming data is stored on the database record after each import.
    Subsequent imports compare both the timestamp and the hash before deciding whether
    to apply an update.
    Key changes:
  • New json_hash() / json_canonical() utilities (util/json.py) produce a
    stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
  • BaseMutableData gains updated_at, created_at, as_of_timestamp,
    calculate_hash(), and should_apply_to(). The should_apply_to() method
    is now the single decision point for both bibliographic and circulation data.
  • BibliographicData.has_changed() and CirculationData.has_changed() are
    removed and replaced by the shared should_apply_to() logic.
  • Edition and LicensePool each gain an updated_at_data_hash column.
    LicensePool also gains created_at and updated_at columns to track when
    its CirculationData was first and most recently imported.
  • Individual-license pools (e.g. ODL) always re-apply availability even when the
    hash matches, because license availability can change as licenses expire
    independently of feed content.
  • Database migration f98e4049c87d adds all four new columns.

Motivation and Context

The original has_changed() implementation only compared timestamps, which is
insufficient: a data source can re-publish identical content with a newer timestamp,
or publish changed content with the same timestamp. Content hashing is the correct
primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

  • Updated unit tests for BibliographicData and CirculationData cover the
    new should_apply_to() logic, including the null-hash bootstrap case, the
    timestamp-is-older short-circuit, and the hash-match skip.
  • New unit tests for json_canonical() and json_hash() verify ordering
    stability across dict keys, list items, and float precision.
  • All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers
    pass with the updated field names (updated_at in place of
    data_source_last_updated).
  • Full test suite run via tox -e py312-docker -- --no-cov.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@jonathangreen jonathangreen added the feature New feature label Apr 2, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.32%. Comparing base (45176e0) to head (d17bf96).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3199   +/-   ##
=======================================
  Coverage   93.31%   93.32%           
=======================================
  Files         502      502           
  Lines       46178    46227   +49     
  Branches     6315     6319    +4     
=======================================
+ Hits        43093    43143   +50     
+ Misses       2001     2000    -1     
  Partials     1084     1084           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein dbernstein changed the title WIP: Content-hash-based change tracking for data imports Content-hash-based change tracking for data imports Apr 6, 2026
@dbernstein dbernstein force-pushed the feature/change-tracking branch 6 times, most recently from f58ea44 to 8135dd0 Compare April 13, 2026 16:50
@dbernstein dbernstein force-pushed the feature/change-tracking branch from 8135dd0 to f5a7e26 Compare April 15, 2026 00:44
dbernstein and others added 11 commits April 17, 2026 11:13
Fixes all broken tests, mypy errors, and incomplete source changes from
the initial WIP commit (bde0829).

This commit contains all Claude authored work.

- LicensePool model was missing `updated_at` and `created_at` columns
  referenced by new circulation code, causing 49 test failures
- 31 mypy errors across json.py, bibliographic.py, circulation.py,
  and integration importers
- Incomplete rename of `has_changed` → `needs_apply` left stale calls
  in bibliographic.py, circulation.py, and three integration importers
- `data_source_last_updated` still referenced in bibliographic.py,
  two OPDS extractors, and the Boundless parser/conftest
- Missing alembic migration for all new DB columns
- `LinkData.content` (bytes | str field) caused UnicodeDecodeError when
  hashing bibliographic data containing embedded binary images
- `_canonicalize` / `_canonicalize_sort_key` lacked type annotations
- ODL reimport of expired licenses was incorrectly skipped because
  license expiry is time-dependent, not detectable by content hash
src/palace/manager/sqlalchemy/model/licensing.py
- Add `created_at` and `updated_at` columns to LicensePool
src/palace/manager/data_layer/base/mutable.py
- Fix `should_apply_to` condition: `<=` → `<` so equal timestamps
  still trigger a hash check rather than an unconditional skip
src/palace/manager/data_layer/link.py
- Add `@field_serializer("content", when_used="json")` to base64-encode
  binary bytes in the `bytes | str | None` union field
src/palace/manager/data_layer/bibliographic.py
- Replace `data_source_last_updated` with `updated_at` throughout
- Replace `has_changed` calls with `should_apply_to` in apply() /
  apply_edition_only(); `_update_edition_timestamp` now also stores
  `updated_at_data_hash` on the edition
src/palace/manager/data_layer/circulation.py
- Replace remaining `has_changed` / `last_checked` references
- Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply
- Early-return skip is bypassed when `self.licenses is not None`
  (ODL-style pools) so time-expired licenses are always reprocessed;
  inner availability block gets the same treatment
src/palace/manager/util/json.py
- Add `int` type annotations to all `float_precision` parameters
src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py
- `has_changed` → `needs_apply`
src/palace/manager/integration/license/{opds1,odl}/extractor.py
src/palace/manager/integration/license/boundless/parser.py
- `data_source_last_updated=` → `updated_at=`
alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py
- New migration: `updated_at_data_hash` on editions and licensepools,
  `created_at` / `updated_at` on licensepools
tests/manager/data_layer/test_bibliographic.py
- Replace `data_source_last_updated` with `updated_at`; rewrite
  test_apply_no_changes_needed for hash-based semantics; rename
  test_data_source_last_updated_updates_timestamp
tests/manager/data_layer/test_measurement.py
- Update test_taken_at: taken_at now defaults to None
tests/manager/integration/license/{opds,overdrive}/test_importer.py
tests/manager/integration/license/boundless/conftest.py
- Update mock/fixture references from has_changed / last_checked
  to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash`
  so that identical content with different timestamps does not trigger
  spurious re-imports.
- Fix `_canonicalize_sort_key` crash when sorting sequences containing
  multiple `None` values (`None < None` raises TypeError in Python).
  Use a stable sentinel `""` as the second element of the sort key instead.
- Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid
  rebuilding the dict on every recursive call.
- Cache `calculate_hash()` result on the instance via `PrivateAttr` and
  invalidate on field mutation, avoiding a redundant SHA-256 computation
  per `apply()` cycle.
- Remove redundant `should_apply_to` guard inside `CirculationData.apply`;
  the early-return path already handles all the same conditions.
- Fix misleading log message when skipping a circulation data update.
- Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and
  `CirculationData.needs_apply`.
- Add tests for `json_hash`, multiple-None sequence sorting, and unsupported
  type errors in `_canonicalize_sort_key`.
- Add a note to the migration explaining the first-import-after-deploy
  performance impact.
…ction

The `opds_import_task` was not passing `apply_circulation` to
`importer.import_feed`, making the fallback path for "bibliographic
unchanged, circulation changed" completely dead code. Pass
`apply.circulation_apply.delay` to restore that path.
Add a `needs_apply` guard to the `elif` branch in
`import_feed_from_response` so `apply_circulation` is only queued when
the circulation data has actually changed, preventing redundant tasks on
every re-import of unchanged content.
Fix `CirculationData.needs_apply` to always return `True` when
`self.licenses` is not None (ODL-style pools). License expiry is
time-dependent and cannot be detected by content hashing alone; this
mirrors the existing exception already present in the `apply()` early-
return guard.
@dbernstein dbernstein force-pushed the feature/change-tracking branch from f5a7e26 to d17bf96 Compare April 18, 2026 01:39
@claude
Copy link
Copy Markdown

claude bot commented Apr 18, 2026

Claude finished @dbernstein's task in 12m 36s —— View job


PR Review: Content-hash-based change tracking for data imports

  • Gather context and understand the request
  • Read core implementation files
  • Read test files and migration
  • Post inline comments for bugs/issues
  • Write summary review

The overall design is solid. The two-stage guard (timestamp short-circuit → hash comparison) is the right primitive, the canonicalization in json_canonical handles all the tricky edge cases (heterogeneous type ordering, multiple None values, bool/int disambiguation), and the hash cache invalidation via __setattr__ is clean.

A few cross-cutting observations:

Timestamp/hash coupling after force-apply (even_if_not_apparently_updated=True): The Edition and LicensePool paths handle this differently — _update_edition_timestamp only advances updated_at while circulation.apply always overwrites it. As noted in the inline comments, this creates an edge case on the Edition path where a force-applied stale hash combined with a pinned-at-high-water updated_at can cause subsequent imports with intermediate timestamps to be rejected by the < check even though their hash would differ. In practice this is harmless (data sources advance timestamps monotonically), but a comment in _update_edition_timestamp documenting the intentional decoupling between updated_at and updated_at_data_hash would help future readers.

Migration / bootstrap: The migration comment is thorough and accurate — all existing rows start with updated_at_data_hash = NULL, which should_apply_to correctly treats as "never imported", so the first full import after deployment re-establishes all hashes as expected.

ODL handling: The self.licenses is not None bypass is correctly placed in both CirculationData.apply (inner guard) and needs_apply (outer guard at import time), so ODL pools are always re-applied without the hash check.

No bugs found. The four inline comments are all Minor or Nit level.

# The data we have is strictly older than what is stored, no update needed.
return False

return self.calculate_hash() != db_object.updated_at_data_hash
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.

as_of=self.as_of_timestamp,
)

pool.updated_at = self.as_of_timestamp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.

# bool must come before int because bool is a subclass of int in Python.
_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
bool: 0,
int: 1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.

if edition.updated_at is None or edition.updated_at < updated_at:
edition.updated_at = updated_at

edition.updated_at_data_hash = self.calculate_hash()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants