-
Notifications
You must be signed in to change notification settings - Fork 17
Description
(Description adapted from a Claude plan.)
Context
A customer has a very large dataset and wants to upload it in smaller chunks (subdirectories), but end up with a single root CID. They could upload pieces separately, but even within a space, each upload has its own shards and its own scans, so that wouldn't help. Instead, this approach is to work within a single upload, scoping each run to a subdirectory, then doing a final full run that builds the root.
How It Works
This builds on #336, in which --assume-unchanged-sources is extended to work for each FS Entry, not just at the root as a whole.
# Run 1: Upload only subdir1
guppy upload --only subdir1/ --assume-unchanged-sources
# Run 2: Upload only subdir2
guppy upload --only subdir2/ --assume-unchanged-sources
# Final run: Complete the upload (builds root, calls upload/add)
guppy upload --assume-unchanged-sources
Each --only run:
- Scans only the targeted subtree (walks FS from
subdir1/instead of"."; skips things already scanned, if--assume-unchanged-sourcesis used) - DAG-scans only the files in that subtree (DAGScans created only for new FSEntries)
- Shards, indexes, and uploads those nodes to the network
- Skips
upload/addbecause the root CID isn't known yet
The final run (no --only):
- Scans from
"."—SkipEntry(again, skips things already scanned, if--assume-unchanged-sourcesis used) - Creates FSEntries/DAGScans only for the root directory and any new top-level files
- Root directory's DAGScan completes because
HasIncompleteChildrenfinds all children's DAGScans already have CIDs from previous runs - Shards, indexes, and uploads only the new node(s)
- Calls
upload/addwith ALL shards (from all runs — they're all the same upload)
Note that --assume-unchanged-sources is optional in each case, but useful if there's a lot of data to scan, which is generally when you'd use this.
Implementation details from Claude's plan
(I haven't closely reviewed this yet, it's mostly here to keep track of it. Feel free to ignore it while considering the proposal.)
Implementation
1. Add --only <path> flag to upload command
File: cmd/upload/root.go (or wherever upload start is defined)
Add a --only string flag that specifies a subdirectory path relative to the source root. Pass it through to the upload execution.
2. Scope the scan to the subtree
File: pkg/preparation/scans/scans.go
Change executeScan to accept an optional subtree path. Instead of always passing "." to WalkDir, pass the subtree path:
func (a API) executeScan(ctx context.Context, upload *uploadmodel.Upload, subtree string, fsEntryCb func(model.FSEntry) error) (model.FSEntry, error) {
fsys, err := a.SourceAccessor(ctx, upload.SourceID())
root := "."
if subtree != "" {
root = subtree
}
fsEntry, err := a.WalkerFn(fsys, root, visitor.NewScanVisitor(...))
return fsEntry, nil
}Key detail: When subtree is set, ExecuteScan should not set rootFSEntryID on the upload, because the returned FSEntry is for the subtree root, not the source root. Only set rootFSEntryID when doing a full scan (no --only).
3. Skip upload/add when root CID is not set
File: pkg/preparation/uploads/uploads.go — runPostProcessShardWorker finalize
Currently AddStorachaUploadForUpload is always called in the finalize step. Change to skip it when rootCID is unset:
// finalize
func() error {
upload, err := api.Repo.GetUploadByID(ctx, uploadID)
if err != nil { return err }
if upload.RootCID() == cid.Undef {
log.Infow("Skipping upload/add: root CID not yet set (partial upload)", "upload", uploadID)
return nil
}
return api.AddStorachaUploadForUpload(ctx, uploadID, spaceDID)
}4. Skip root CID finalization when subtree-only
File: pkg/preparation/uploads/uploads.go — runDAGScanWorker finalize
The finalize step currently sets the root CID by looking up CIDForFSEntry(upload.RootFSEntryID()). When rootFSEntryID is unset (partial run), skip this:
// finalize
func() error {
upload, err := api.Repo.GetUploadByID(ctx, uploadID)
if !upload.HasRootFSEntryID() {
log.Infow("Skipping root CID finalization: no root FS entry (partial upload)", "upload", uploadID)
close(nodeUploadsAvailable)
return nil
}
// ... existing root CID logic ...
}5. Handle the --assume-unchanged-sources check
File: pkg/preparation/uploads/uploads.go — runScanWorker
Currently skips the entire scan if HasRootFSEntryID(). With --only, rootFSEntryID isn't set after partial runs, so the check already does the right thing — it runs the scan because there's no root entry yet.
On the final full run, rootFSEntryID is still unset (from partial runs), so the scan runs. Combined with --assume-unchanged-sources, SkipEntry skips already-scanned subdirectories.
No changes needed here.
6. Pass subtree path through the API
Files:
pkg/preparation/uploads/uploads.go—ExecuteUploadandAPIstruct need aSubtree stringfieldpkg/preparation/scans/scans.go—APIstruct orExecuteScanneeds the subtree parameter- Thread
--onlyflag value from CLI → upload API → scan API
Files to Modify
| File | Change |
|---|---|
cmd/upload/root.go |
Add --only flag |
pkg/preparation/scans/scans.go |
Accept optional subtree path, pass to WalkDir |
pkg/preparation/uploads/uploads.go |
Thread subtree through; skip root CID finalization and upload/add for partial runs |
pkg/preparation/storacha/storacha.go |
(Maybe) Make AddStorachaUploadForUpload gracefully handle missing root CID |
Edge Cases
- Overlapping subtrees: If user runs
--only a/then--only a/b/, the second run would find existing FSEntries fora/b/and its contents viaFindOrCreate. DAGScans already exist too. This is safe — duplicate creation is idempotent. - Subtree doesn't exist: Walker would error on stat. Standard filesystem error handling.
- Running final without all subtrees: Works fine —
HasIncompleteChildrenwould block the root directory's DAGScan until all children are complete. The user would need to run again after uploading missing subtrees. (Or: the pipeline would process whatever's complete and skip directories with incomplete children, just like today.) - Concurrent partial runs: Two
--onlyruns on different subtrees could run concurrently. Shard creation is already per-upload with open shard tracking. Concurrent runs could contend on the same open shard — this is an existing concern, not new. Could document as "run subtrees sequentially."
Verification
- Create a test directory:
mkdir -p testdata/{a,b,c}with files in each - Add source:
guppy upload source add test ./testdata - Run partial:
guppy upload start --only a/ --assume-unchanged-sources- Verify: FSEntries and DAGScans created only for
a/ - Verify: Shards created and uploaded
- Verify: No
upload/addcall (no root CID)
- Verify: FSEntries and DAGScans created only for
- Run partial:
guppy upload start --only b/ --assume-unchanged-sources- Same verifications
- Run full:
guppy upload start --assume-unchanged-sources- Verify: Only root dir and
c/are scanned (a/ and b/ skipped) - Verify: Root CID is set
- Verify:
upload/addis called with shards from all runs
- Verify: Only root dir and
- Run
make test
Depends On
feat/faster-scansbranch (forSkipEntry/ per-entry scan skipping)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status