Skip to content

incremental: use content hash to ignore spurious mtime bumps#593

Open
cod3warrior wants to merge 1 commit intosafishamsi:v5from
cod3warrior:fix/content-hash-manifest
Open

incremental: use content hash to ignore spurious mtime bumps#593
cod3warrior wants to merge 1 commit intosafishamsi:v5from
cod3warrior:fix/content-hash-manifest

Conversation

@cod3warrior
Copy link
Copy Markdown

Problem

detect_incremental re-extracts files whose mtime was bumped by sync tools without any actual content change.

On vaults with active sync (Obsidian plugins, Nextcloud, daily git backup), 80%+ of files have their mtime touched by background processes without any content change. Previously these were all flagged as new and re-extracted via expensive semantic subagents.

Concrete measurement on a 1287-file Obsidian vault: 1287 files flagged as "new" but only ~201 actually edited. 84% wasted tokens.

Fix

Manifest now stores {path: {"mtime": float, "hash": str}} instead of {path: float}.

Algorithm:

  • Fast path: mtime unchanged → unchanged (no hash computed, free)
  • Slow path: mtime bumped → compute md5, compare. Same hash → treat as unchanged. Different hash → re-extract.

Backwards compat: legacy manifests with float mtime values are recognized via isinstance(entry, (int, float)) check and fall back to mtime-only behavior. First save after upgrade rewrites the manifest in the new format, so the cost is paid once.

md5 is streamed in 64KB chunks. Not security-relevant — we only need collision-resistance for change detection on local files.

Test scenario

# Fresh corpus
graphify ./test_corpus

# Simulate sync touching mtime without changing content
find ./test_corpus -name "*.md" -exec touch {} +

# Before this PR: re-extracts everything
# After this PR: detect_incremental returns 0 new files
graphify ./test_corpus --update

Verified on

A 1773-file Obsidian vault (~3.8M words). Subsequent --update runs that previously re-extracted 1338 files now correctly identify only the ~50 actually edited.

Note for users with existing manifests

First --update after upgrading to this version will compute hashes for all tracked files (one-time cost — same as a fresh run). Subsequent updates use the fast path.

On vaults with active sync (Obsidian plugins, Nextcloud, daily git
backup), 80%+ of files have their mtime touched by background
processes without any content change. Previously detect_incremental
treated all of these as new and re-extracted them via expensive
semantic subagents. On a 1287-file vault this caused 84% wasted
tokens (1287 "new" but only ~201 actually edited).

Manifest now stores {path: {mtime, hash}} instead of {path: mtime}.

Algorithm:
  - Fast path: mtime unchanged -> unchanged (no hash computed, free)
  - Slow path: mtime bumped -> compute md5, compare. Same hash ->
    treat as unchanged. Different hash -> re-extract.

Backwards compat: legacy manifests with float mtime values are
recognized via isinstance check and fall back to mtime-only behavior.
First save after upgrade rewrites the manifest in the new format,
so the cost is paid once.

md5 streamed in 64KB chunks. Not security-relevant - we only need
collision-resistance for change detection on local files.

Tested on a 1773-file Obsidian vault (~3.8M words). Subsequent
--update runs that previously re-extracted 1338 files now correctly
identify only the ~50 actually edited.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant