feat: Faster scans with --assume-unchanged-sources#366
Open
Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
Open
feat: Faster scans with --assume-unchanged-sources#366Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
Conversation
This was referenced Feb 25, 2026
volmedo
requested changes
Feb 27, 2026
Comment on lines
+319
to
+324
| for _, path := range paths { | ||
| _, err := stmt.ExecContext(ctx, path, util.DbID(&sourceID), util.DbDID(&spaceDID)) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to delete FS entries for path %s: %w", path, err) | ||
| } | ||
| } |
Member
There was a problem hiding this comment.
I think this should happen in a transaction, right? If something goes wrong in the middle of the operation and some ancestor is left, paths below it won't be re-scanned because it will seem it is there with its children.
Peeja
added a commit
that referenced
this pull request
Mar 2, 2026
I'm tired of seeing log messages like `failed to add blob Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: failed to get reader for blob Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: iterating nodes in shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: failed to get sizes of blocks in shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: context canceled` and not knowing what actually canceled the context. Unfortunately, Go doesn't currently make it terribly easy to see that. This pattern should make it more obvious what's happening. 1. Rather than use `ctx.Err()` in our error messages, use `ctxutil.CausedError()`. It's similar to `context.Cause()`, except that simply returns the cause directly. This returns a wrapped error which shows the `ctx.Err()` and that it was *caused* by the `context.Cause()`. 2. Rather than use errors coming back from library functions directly in our wrapped errors, use `ctxutil.ErrorWithCause()`. This accounts for library functions which themselves return the `ctx.Err()` rather than noticing the `context.Cause()` (likely because they were written before `context.Cause()` existed). It checks to see if the returned error was the context's `ctx.Err()`, and if so, and if there's a `context.Cause()` available, wraps them like `ctxutil.CausedError()` for readability. This means touching *a lot* of call sites, but with a very clear pattern, and hopefully valuable results. #### PR Dependency Tree * **PR #339** 👈 * **PR #344** * **PR #358** * **PR #359** * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
551ac0b to
893a493
Compare
This was referenced Mar 2, 2026
Merged
Peeja
added a commit
that referenced
this pull request
Mar 2, 2026
Adds an option to assume the filesystem hasn't changed when resuming, skipping the filesystem rescan. This makes it much faster to retry when you know the source hasn't changed since the last scan. #### PR Dependency Tree * **PR #344** 👈 * **PR #358** * **PR #359** * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
893a493 to
5548cc2
Compare
Peeja
added a commit
that referenced
this pull request
Mar 4, 2026
We were indexing shards (sometimes) before they were closed, potentially when they were still empty, which meant an infinite number could fit in one index. Then the shards would be filled up and the index would cover too much. Now, shards don't get indexed until after they're closed and can no longer grow. #### PR Dependency Tree * **PR #359** 👈 * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
The commP hasher runs goroutines which have to be cleaned up. In the happy paths, they're already cleaned up, either by `marshal()` calling `Reset()` or `finalize()` calling `Sum()`. But in error cases, we return early and never get to the cleanup. This ensures the hasher is always cleaned up.
Proof of life when scans run long (which they're prone to do).
When `--assume-unchanged-sources` is used, previously we'd look for an existing complete scan and use it if we found it, because we don't need to rescan if we assume the source hasn't changed. But if we *didn't* find a complete scan, we'd start one from scratch. With this change, we do our best to recover even a *partial* scan. As long as we're assuming the source hasn't changed, it's safe to skip any path we know we've already scanned on a prior run. This means even the filesystem scan can be incremental and restartable--as long as the user takes responsibility for knowing that the source data hasn't changed.
When rescanning with `--assume-unchanged-sources`, if we come across an empty directory, it could be a real empty directory, *or* it could be a directory we never had a chance to add children to. Assume it's the latter and rescan it. Worst case, we do a cheap read of an empty directory, but we definitely don't miss the files. Of course, if files were *added*, we could have >0 directory children and still have files missing, but here we've been told to assume the source hasn't changed.
5548cc2 to
b0541f1
Compare
Peeja
added a commit
that referenced
this pull request
Mar 9, 2026
Especially for indexes with many shards, this is dramatically faster. We were finding the shards in the index, and then for each shard, looping over the nodes/slices. Now we find all of the nodes in a single query. It doesn't even matter what order they come in in, as long as we see them all. #### PR Dependency Tree * **PR #360** 👈 * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** * **PR #375** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
Peeja
added a commit
that referenced
this pull request
Mar 9, 2026
These were never filled in, meaning shards from before the column was created had an incorrect `slice_count` of 0. #### PR Dependency Tree * **PR #361** 👈 * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
Peeja
added a commit
that referenced
this pull request
Mar 9, 2026
The commP hasher runs goroutines which have to be cleaned up. In the happy paths, they're already cleaned up, either by `marshal()` calling `Reset()` or `finalize()` calling `Sum()`. But in error cases, we return early and never get to the cleanup. This ensures the hasher is always cleaned up. #### PR Dependency Tree * **PR #363** 👈 * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
Peeja
added a commit
that referenced
this pull request
Mar 9, 2026
Useful to see in logging. #### PR Dependency Tree * **PR #364** 👈 * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
b0541f1 to
23648e5
Compare
Peeja
added a commit
that referenced
this pull request
Mar 9, 2026
Proof of life when scans run long (which they're prone to do). #### PR Dependency Tree * **PR #365** 👈 * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When
--assume-unchanged-sourcesis used, previously we'd look for an existing complete scan and use it if we found it, because we don't need to rescan if we assume the source hasn't changed. But if we didn't find a complete scan, we'd start one from scratch.With this change, we do our best to recover even a partial scan. As long as we're assuming the source hasn't changed, it's safe to skip any path we know we've already scanned on a prior run. This means even the filesystem scan can be incremental and restartable--as long as the user takes responsibility for knowing that the source data hasn't changed.
PR Dependency Tree
slice_countonshards#361This tree was auto-generated by Charcoal