Skip to content

feat: Faster scans with --assume-unchanged-sources#366

Open
Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
feat/faster-scans
Open

feat: Faster scans with --assume-unchanged-sources#366
Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
feat/faster-scans

Conversation

@Peeja
Copy link
Member

@Peeja Peeja commented Feb 25, 2026

When --assume-unchanged-sources is used, previously we'd look for an existing complete scan and use it if we found it, because we don't need to rescan if we assume the source hasn't changed. But if we didn't find a complete scan, we'd start one from scratch.

With this change, we do our best to recover even a partial scan. As long as we're assuming the source hasn't changed, it's safe to skip any path we know we've already scanned on a prior run. This means even the filesystem scan can be incremental and restartable--as long as the user takes responsibility for knowing that the source data hasn't changed.

PR Dependency Tree

This tree was auto-generated by Charcoal

Comment on lines +319 to +324
for _, path := range paths {
_, err := stmt.ExecContext(ctx, path, util.DbID(&sourceID), util.DbDID(&spaceDID))
if err != nil {
return fmt.Errorf("failed to delete FS entries for path %s: %w", path, err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should happen in a transaction, right? If something goes wrong in the middle of the operation and some ancestor is left, paths below it won't be re-scanned because it will seem it is there with its children.

Peeja added a commit that referenced this pull request Mar 2, 2026
I'm tired of seeing log messages like `failed to add blob
Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: failed to get reader for
blob Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: iterating nodes in
shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: failed to get sizes of
blocks in shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: context canceled`
and not knowing what actually canceled the context. Unfortunately, Go
doesn't currently make it terribly easy to see that. This pattern should
make it more obvious what's happening.

1. Rather than use `ctx.Err()` in our error messages, use
`ctxutil.CausedError()`. It's similar to `context.Cause()`, except that
simply returns the cause directly. This returns a wrapped error which
shows the `ctx.Err()` and that it was *caused* by the `context.Cause()`.

2. Rather than use errors coming back from library functions directly in
our wrapped errors, use `ctxutil.ErrorWithCause()`. This accounts for
library functions which themselves return the `ctx.Err()` rather than
noticing the `context.Cause()` (likely because they were written before
`context.Cause()` existed). It checks to see if the returned error was
the context's `ctx.Err()`, and if so, and if there's a `context.Cause()`
available, wraps them like `ctxutil.CausedError()` for readability.

This means touching *a lot* of call sites, but with a very clear
pattern, and hopefully valuable results.








#### PR Dependency Tree


* **PR #339** 👈
  * **PR #344**
    * **PR #358**
      * **PR #359**
        * **PR #360**
          * **PR #361**
            * **PR #362**
              * **PR #363**
                * **PR #364**
                  * **PR #365**
                    * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@Peeja Peeja force-pushed the feat/log-scan-progress branch from 551ac0b to 893a493 Compare March 2, 2026 23:24
@Peeja Peeja force-pushed the feat/faster-scans branch from b3aaa6b to 366bd05 Compare March 2, 2026 23:24
Peeja added a commit that referenced this pull request Mar 2, 2026
Adds an option to assume the filesystem hasn't changed when resuming,
skipping the filesystem rescan. This makes it much faster to retry when
you know the source hasn't changed since the last scan.








#### PR Dependency Tree


* **PR #344** 👈
  * **PR #358**
    * **PR #359**
      * **PR #360**
        * **PR #361**
          * **PR #362**
            * **PR #363**
              * **PR #364**
                * **PR #365**
                  * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
@Peeja Peeja force-pushed the feat/log-scan-progress branch from 893a493 to 5548cc2 Compare March 2, 2026 23:43
@Peeja Peeja force-pushed the feat/faster-scans branch from 366bd05 to b5fd4f9 Compare March 2, 2026 23:43
Peeja added a commit that referenced this pull request Mar 4, 2026
We were indexing shards (sometimes) before they were closed, potentially
when they were still empty, which meant an infinite number could fit in
one index. Then the shards would be filled up and the index would cover
too much.

Now, shards don't get indexed until after they're closed and can no
longer grow.









#### PR Dependency Tree


* **PR #359** 👈
  * **PR #360**
    * **PR #361**
      * **PR #362**
        * **PR #363**
          * **PR #364**
            * **PR #365**
              * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added 9 commits March 4, 2026 17:44
The commP hasher runs goroutines which have to be cleaned up. In the
happy paths, they're already cleaned up, either by `marshal()`
calling `Reset()` or `finalize()` calling `Sum()`. But in error cases,
we return early and never get to the cleanup. This ensures the hasher is
always cleaned up.
Proof of life when scans run long (which they're prone to do).
When `--assume-unchanged-sources` is used, previously we'd look for an
existing complete scan and use it if we found it, because we don't need
to rescan if we assume the source hasn't changed. But if we *didn't*
find a complete scan, we'd start one from scratch.

With this change, we do our best to recover even a *partial* scan. As
long as we're assuming the source hasn't changed, it's safe to skip any
path we know we've already scanned on a prior run. This means even the
filesystem scan can be incremental and restartable--as long as the user
takes responsibility for knowing that the source data hasn't changed.
When rescanning with `--assume-unchanged-sources`, if we come across an
empty directory, it could be a real empty directory, *or* it could be a
directory we never had a chance to add children to. Assume it's the
latter and rescan it. Worst case, we do a cheap read of an empty
directory, but we definitely don't miss the files.

Of course, if files were *added*, we could have >0 directory children
and still have files missing, but here we've been told to assume the
source hasn't changed.
@Peeja Peeja force-pushed the feat/log-scan-progress branch from 5548cc2 to b0541f1 Compare March 4, 2026 22:44
@Peeja Peeja force-pushed the feat/faster-scans branch from b5fd4f9 to 3f179b6 Compare March 4, 2026 22:44
Peeja added a commit that referenced this pull request Mar 9, 2026
Especially for indexes with many shards, this is dramatically faster. We
were finding the shards in the index, and then for each shard, looping
over the nodes/slices. Now we find all of the nodes in a single query.
It doesn't even matter what order they come in in, as long as we see
them all.











#### PR Dependency Tree


* **PR #360** 👈
  * **PR #361**
    * **PR #362**
      * **PR #363**
        * **PR #364**
          * **PR #365**
            * **PR #366**
  * **PR #375**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added a commit that referenced this pull request Mar 9, 2026
These were never filled in, meaning shards from before the column was
created had an incorrect `slice_count` of 0.














#### PR Dependency Tree


* **PR #361** 👈
  * **PR #362**
    * **PR #363**
      * **PR #364**
        * **PR #365**
          * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added a commit that referenced this pull request Mar 9, 2026
Provides some proof of life while sharding a large number of nodes.














#### PR Dependency Tree


* **PR #362** 👈
  * **PR #363**
    * **PR #364**
      * **PR #365**
        * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added a commit that referenced this pull request Mar 9, 2026
The commP hasher runs goroutines which have to be cleaned up. In the
happy paths, they're already cleaned up, either by `marshal()` calling
`Reset()` or `finalize()` calling `Sum()`. But in error cases, we return
early and never get to the cleanup. This ensures the hasher is always
cleaned up.














#### PR Dependency Tree


* **PR #363** 👈
  * **PR #364**
    * **PR #365**
      * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Peeja added a commit that referenced this pull request Mar 9, 2026
Useful to see in logging.

















#### PR Dependency Tree


* **PR #364** 👈
  * **PR #365**
    * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
@Peeja Peeja force-pushed the feat/log-scan-progress branch from b0541f1 to 23648e5 Compare March 9, 2026 19:32
Peeja added a commit that referenced this pull request Mar 9, 2026
Proof of life when scans run long (which they're prone to do).














#### PR Dependency Tree


* **PR #365** 👈
  * **PR #366**

This tree was auto-generated by
[Charcoal](https://github.com/danerwilliams/charcoal)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants