incremental: use content hash to ignore spurious mtime bumps#593
Open
cod3warrior wants to merge 1 commit intosafishamsi:v5from
Open
incremental: use content hash to ignore spurious mtime bumps#593cod3warrior wants to merge 1 commit intosafishamsi:v5from
cod3warrior wants to merge 1 commit intosafishamsi:v5from
Conversation
On vaults with active sync (Obsidian plugins, Nextcloud, daily git
backup), 80%+ of files have their mtime touched by background
processes without any content change. Previously detect_incremental
treated all of these as new and re-extracted them via expensive
semantic subagents. On a 1287-file vault this caused 84% wasted
tokens (1287 "new" but only ~201 actually edited).
Manifest now stores {path: {mtime, hash}} instead of {path: mtime}.
Algorithm:
- Fast path: mtime unchanged -> unchanged (no hash computed, free)
- Slow path: mtime bumped -> compute md5, compare. Same hash ->
treat as unchanged. Different hash -> re-extract.
Backwards compat: legacy manifests with float mtime values are
recognized via isinstance check and fall back to mtime-only behavior.
First save after upgrade rewrites the manifest in the new format,
so the cost is paid once.
md5 streamed in 64KB chunks. Not security-relevant - we only need
collision-resistance for change detection on local files.
Tested on a 1773-file Obsidian vault (~3.8M words). Subsequent
--update runs that previously re-extracted 1338 files now correctly
identify only the ~50 actually edited.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
detect_incrementalre-extracts files whose mtime was bumped by sync tools without any actual content change.On vaults with active sync (Obsidian plugins, Nextcloud, daily git backup), 80%+ of files have their mtime touched by background processes without any content change. Previously these were all flagged as new and re-extracted via expensive semantic subagents.
Concrete measurement on a 1287-file Obsidian vault: 1287 files flagged as "new" but only ~201 actually edited. 84% wasted tokens.
Fix
Manifest now stores
{path: {"mtime": float, "hash": str}}instead of{path: float}.Algorithm:
Backwards compat: legacy manifests with float mtime values are recognized via
isinstance(entry, (int, float))check and fall back to mtime-only behavior. First save after upgrade rewrites the manifest in the new format, so the cost is paid once.md5 is streamed in 64KB chunks. Not security-relevant — we only need collision-resistance for change detection on local files.
Test scenario
Verified on
A 1773-file Obsidian vault (~3.8M words). Subsequent
--updateruns that previously re-extracted 1338 files now correctly identify only the ~50 actually edited.Note for users with existing manifests
First
--updateafter upgrading to this version will compute hashes for all tracked files (one-time cost — same as a fresh run). Subsequent updates use the fast path.