Skip to content

Add an option to limit scope of an upload run #381

@Peeja

Description

@Peeja

(Description adapted from a Claude plan.)

Context

A customer has a very large dataset and wants to upload it in smaller chunks (subdirectories), but end up with a single root CID. They could upload pieces separately, but even within a space, each upload has its own shards and its own scans, so that wouldn't help. Instead, this approach is to work within a single upload, scoping each run to a subdirectory, then doing a final full run that builds the root.

How It Works

This builds on #336, in which --assume-unchanged-sources is extended to work for each FS Entry, not just at the root as a whole.

# Run 1: Upload only subdir1
guppy upload --only subdir1/ --assume-unchanged-sources

# Run 2: Upload only subdir2
guppy upload --only subdir2/ --assume-unchanged-sources

# Final run: Complete the upload (builds root, calls upload/add)
guppy upload --assume-unchanged-sources

Each --only run:

  1. Scans only the targeted subtree (walks FS from subdir1/ instead of "."; skips things already scanned, if --assume-unchanged-sources is used)
  2. DAG-scans only the files in that subtree (DAGScans created only for new FSEntries)
  3. Shards, indexes, and uploads those nodes to the network
  4. Skips upload/add because the root CID isn't known yet

The final run (no --only):

  1. Scans from "."SkipEntry (again, skips things already scanned, if --assume-unchanged-sources is used)
  2. Creates FSEntries/DAGScans only for the root directory and any new top-level files
  3. Root directory's DAGScan completes because HasIncompleteChildren finds all children's DAGScans already have CIDs from previous runs
  4. Shards, indexes, and uploads only the new node(s)
  5. Calls upload/add with ALL shards (from all runs — they're all the same upload)

Note that --assume-unchanged-sources is optional in each case, but useful if there's a lot of data to scan, which is generally when you'd use this.

Implementation details from Claude's plan

(I haven't closely reviewed this yet, it's mostly here to keep track of it. Feel free to ignore it while considering the proposal.)

Implementation

1. Add --only <path> flag to upload command

File: cmd/upload/root.go (or wherever upload start is defined)

Add a --only string flag that specifies a subdirectory path relative to the source root. Pass it through to the upload execution.

2. Scope the scan to the subtree

File: pkg/preparation/scans/scans.go

Change executeScan to accept an optional subtree path. Instead of always passing "." to WalkDir, pass the subtree path:

func (a API) executeScan(ctx context.Context, upload *uploadmodel.Upload, subtree string, fsEntryCb func(model.FSEntry) error) (model.FSEntry, error) {
    fsys, err := a.SourceAccessor(ctx, upload.SourceID())
    root := "."
    if subtree != "" {
        root = subtree
    }
    fsEntry, err := a.WalkerFn(fsys, root, visitor.NewScanVisitor(...))
    return fsEntry, nil
}

Key detail: When subtree is set, ExecuteScan should not set rootFSEntryID on the upload, because the returned FSEntry is for the subtree root, not the source root. Only set rootFSEntryID when doing a full scan (no --only).

3. Skip upload/add when root CID is not set

File: pkg/preparation/uploads/uploads.gorunPostProcessShardWorker finalize

Currently AddStorachaUploadForUpload is always called in the finalize step. Change to skip it when rootCID is unset:

// finalize
func() error {
    upload, err := api.Repo.GetUploadByID(ctx, uploadID)
    if err != nil { return err }
    if upload.RootCID() == cid.Undef {
        log.Infow("Skipping upload/add: root CID not yet set (partial upload)", "upload", uploadID)
        return nil
    }
    return api.AddStorachaUploadForUpload(ctx, uploadID, spaceDID)
}

4. Skip root CID finalization when subtree-only

File: pkg/preparation/uploads/uploads.gorunDAGScanWorker finalize

The finalize step currently sets the root CID by looking up CIDForFSEntry(upload.RootFSEntryID()). When rootFSEntryID is unset (partial run), skip this:

// finalize
func() error {
    upload, err := api.Repo.GetUploadByID(ctx, uploadID)
    if !upload.HasRootFSEntryID() {
        log.Infow("Skipping root CID finalization: no root FS entry (partial upload)", "upload", uploadID)
        close(nodeUploadsAvailable)
        return nil
    }
    // ... existing root CID logic ...
}

5. Handle the --assume-unchanged-sources check

File: pkg/preparation/uploads/uploads.gorunScanWorker

Currently skips the entire scan if HasRootFSEntryID(). With --only, rootFSEntryID isn't set after partial runs, so the check already does the right thing — it runs the scan because there's no root entry yet.

On the final full run, rootFSEntryID is still unset (from partial runs), so the scan runs. Combined with --assume-unchanged-sources, SkipEntry skips already-scanned subdirectories.

No changes needed here.

6. Pass subtree path through the API

Files:

  • pkg/preparation/uploads/uploads.goExecuteUpload and API struct need a Subtree string field
  • pkg/preparation/scans/scans.goAPI struct or ExecuteScan needs the subtree parameter
  • Thread --only flag value from CLI → upload API → scan API

Files to Modify

File Change
cmd/upload/root.go Add --only flag
pkg/preparation/scans/scans.go Accept optional subtree path, pass to WalkDir
pkg/preparation/uploads/uploads.go Thread subtree through; skip root CID finalization and upload/add for partial runs
pkg/preparation/storacha/storacha.go (Maybe) Make AddStorachaUploadForUpload gracefully handle missing root CID

Edge Cases

  • Overlapping subtrees: If user runs --only a/ then --only a/b/, the second run would find existing FSEntries for a/b/ and its contents via FindOrCreate. DAGScans already exist too. This is safe — duplicate creation is idempotent.
  • Subtree doesn't exist: Walker would error on stat. Standard filesystem error handling.
  • Running final without all subtrees: Works fine — HasIncompleteChildren would block the root directory's DAGScan until all children are complete. The user would need to run again after uploading missing subtrees. (Or: the pipeline would process whatever's complete and skip directories with incomplete children, just like today.)
  • Concurrent partial runs: Two --only runs on different subtrees could run concurrently. Shard creation is already per-upload with open shard tracking. Concurrent runs could contend on the same open shard — this is an existing concern, not new. Could document as "run subtrees sequentially."

Verification

  1. Create a test directory: mkdir -p testdata/{a,b,c} with files in each
  2. Add source: guppy upload source add test ./testdata
  3. Run partial: guppy upload start --only a/ --assume-unchanged-sources
    • Verify: FSEntries and DAGScans created only for a/
    • Verify: Shards created and uploaded
    • Verify: No upload/add call (no root CID)
  4. Run partial: guppy upload start --only b/ --assume-unchanged-sources
    • Same verifications
  5. Run full: guppy upload start --assume-unchanged-sources
    • Verify: Only root dir and c/ are scanned (a/ and b/ skipped)
    • Verify: Root CID is set
    • Verify: upload/add is called with shards from all runs
  6. Run make test

Depends On

  • feat/faster-scans branch (for SkipEntry / per-entry scan skipping)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions