feat: Faster scans with --assume-unchanged-sources by Peeja · Pull Request #366 · storacha/guppy

Peeja · 2026-02-25T21:21:51Z

When --assume-unchanged-sources is used, previously we'd look for an existing complete scan and use it if we found it, because we don't need to rescan if we assume the source hasn't changed. But if we didn't find a complete scan, we'd start one from scratch.

With this change, we do our best to recover even a partial scan. As long as we're assuming the source hasn't changed, it's safe to skip any path we know we've already scanned on a prior run. This means even the filesystem scan can be incremental and restartable--as long as the user takes responsibility for knowing that the source data hasn't changed.

PR Dependency Tree

PR feat: Generate indexes *much* faster #360
- PR fix: Backfill slice_count on shards #361
  - PR feat: Log progress while sharding nodes #362
    - PR fix: Don't leak goroutines from hashers #363
      - PR feat: Log if blob PUT stalls #364
        
        PR feat: Log scan progress #365
        
        PR feat: Faster scans with --assume-unchanged-sources #366 👈

This tree was auto-generated by Charcoal

volmedo · 2026-02-27T12:05:11Z

pkg/preparation/sqlrepo/scans.go

+	for _, path := range paths {
+		_, err := stmt.ExecContext(ctx, path, util.DbID(&sourceID), util.DbDID(&spaceDID))
+		if err != nil {
+			return fmt.Errorf("failed to delete FS entries for path %s: %w", path, err)
+		}
+	}


I think this should happen in a transaction, right? If something goes wrong in the middle of the operation and some ancestor is left, paths below it won't be re-scanned because it will seem it is there with its children.

I'm tired of seeing log messages like `failed to add blob Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: failed to get reader for blob Shard[id=1cb59615-1d52-4821-a260-2b2840c043a7]: iterating nodes in shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: failed to get sizes of blocks in shard b0dc805a-7c2b-40db-9d4c-a4d869f66718: context canceled` and not knowing what actually canceled the context. Unfortunately, Go doesn't currently make it terribly easy to see that. This pattern should make it more obvious what's happening. 1. Rather than use `ctx.Err()` in our error messages, use `ctxutil.CausedError()`. It's similar to `context.Cause()`, except that simply returns the cause directly. This returns a wrapped error which shows the `ctx.Err()` and that it was *caused* by the `context.Cause()`. 2. Rather than use errors coming back from library functions directly in our wrapped errors, use `ctxutil.ErrorWithCause()`. This accounts for library functions which themselves return the `ctx.Err()` rather than noticing the `context.Cause()` (likely because they were written before `context.Cause()` existed). It checks to see if the returned error was the context's `ctx.Err()`, and if so, and if there's a `context.Cause()` available, wraps them like `ctxutil.CausedError()` for readability. This means touching *a lot* of call sites, but with a very clear pattern, and hopefully valuable results. #### PR Dependency Tree * **PR #339** 👈 * **PR #344** * **PR #358** * **PR #359** * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Adds an option to assume the filesystem hasn't changed when resuming, skipping the filesystem rescan. This makes it much faster to retry when you know the source hasn't changed since the last scan. #### PR Dependency Tree * **PR #344** 👈 * **PR #358** * **PR #359** * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

We were indexing shards (sometimes) before they were closed, potentially when they were still empty, which meant an infinite number could fit in one index. Then the shards would be filled up and the index would cover too much. Now, shards don't get indexed until after they're closed and can no longer grow. #### PR Dependency Tree * **PR #359** 👈 * **PR #360** * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

The commP hasher runs goroutines which have to be cleaned up. In the happy paths, they're already cleaned up, either by `marshal()` calling `Reset()` or `finalize()` calling `Sum()`. But in error cases, we return early and never get to the cleanup. This ensures the hasher is always cleaned up.

Proof of life when scans run long (which they're prone to do).

When `--assume-unchanged-sources` is used, previously we'd look for an existing complete scan and use it if we found it, because we don't need to rescan if we assume the source hasn't changed. But if we *didn't* find a complete scan, we'd start one from scratch. With this change, we do our best to recover even a *partial* scan. As long as we're assuming the source hasn't changed, it's safe to skip any path we know we've already scanned on a prior run. This means even the filesystem scan can be incremental and restartable--as long as the user takes responsibility for knowing that the source data hasn't changed.

When rescanning with `--assume-unchanged-sources`, if we come across an empty directory, it could be a real empty directory, *or* it could be a directory we never had a chance to add children to. Assume it's the latter and rescan it. Worst case, we do a cheap read of an empty directory, but we definitely don't miss the files. Of course, if files were *added*, we could have >0 directory children and still have files missing, but here we've been told to assume the source hasn't changed.

Especially for indexes with many shards, this is dramatically faster. We were finding the shards in the index, and then for each shard, looping over the nodes/slices. Now we find all of the nodes in a single query. It doesn't even matter what order they come in in, as long as we see them all. #### PR Dependency Tree * **PR #360** 👈 * **PR #361** * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** * **PR #375** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

These were never filled in, meaning shards from before the column was created had an incorrect `slice_count` of 0. #### PR Dependency Tree * **PR #361** 👈 * **PR #362** * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Provides some proof of life while sharding a large number of nodes. #### PR Dependency Tree * **PR #362** 👈 * **PR #363** * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

The commP hasher runs goroutines which have to be cleaned up. In the happy paths, they're already cleaned up, either by `marshal()` calling `Reset()` or `finalize()` calling `Sum()`. But in error cases, we return early and never get to the cleanup. This ensures the hasher is always cleaned up. #### PR Dependency Tree * **PR #363** 👈 * **PR #364** * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Useful to see in logging. #### PR Dependency Tree * **PR #364** 👈 * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Proof of life when scans run long (which they're prone to do). #### PR Dependency Tree * **PR #365** 👈 * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Peeja requested review from alanshaw, hannahhoward and volmedo as code owners February 25, 2026 21:21

This was referenced Feb 25, 2026

feat: Show cause on context cancel #339

Merged

feat: Add --assume-unchanged-sources #344

Merged

volmedo requested changes Feb 27, 2026

View reviewed changes

Peeja force-pushed the feat/log-scan-progress branch from 551ac0b to 893a493 Compare March 2, 2026 23:24

Peeja force-pushed the feat/faster-scans branch from b3aaa6b to 366bd05 Compare March 2, 2026 23:24

Peeja force-pushed the feat/log-scan-progress branch from 893a493 to 5548cc2 Compare March 2, 2026 23:43

Peeja force-pushed the feat/faster-scans branch from 366bd05 to b5fd4f9 Compare March 2, 2026 23:43

Peeja added 9 commits March 4, 2026 17:44

feat: Use single DB query to iterate for index

2aad49c

feat: Log during index generation

67683a6

fix: Backfill slice_count on shards

9b86571

feat: Log progress while sharding nodes

9b935e5

feat: Log if blob PUT stalls

7bfe831

feat: Log scan progress

b0541f1

Proof of life when scans run long (which they're prone to do).

Peeja force-pushed the feat/log-scan-progress branch from 5548cc2 to b0541f1 Compare March 4, 2026 22:44

Peeja force-pushed the feat/faster-scans branch from b5fd4f9 to 3f179b6 Compare March 4, 2026 22:44

Peeja added a commit that referenced this pull request Mar 9, 2026

feat: Log if blob PUT stalls (#364)

6bd890f

Useful to see in logging. #### PR Dependency Tree * **PR #364** 👈 * **PR #365** * **PR #366** This tree was auto-generated by [Charcoal](https://github.com/danerwilliams/charcoal)

Peeja force-pushed the feat/log-scan-progress branch from b0541f1 to 23648e5 Compare March 9, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Faster scans with --assume-unchanged-sources#366

feat: Faster scans with --assume-unchanged-sources#366
Peeja wants to merge 9 commits intofeat/log-scan-progressfrom
feat/faster-scans

Peeja commented Feb 25, 2026 •

edited

Loading

Uh oh!

volmedo Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Peeja commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Dependency Tree

Uh oh!

volmedo Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Peeja commented Feb 25, 2026 •

edited

Loading