Skip to content

Conversation

@lalinsky
Copy link
Member

@lalinsky lalinsky commented Oct 5, 2025

Summary

Implements async index restoration from remote tar snapshots via HTTP/HTTPS URLs.

  • Added restore_from parameter to CreateIndexRequest
  • Indexes marked as being_restored during async download/restore
  • Background task uses Scheduler.runOnce() for fire-and-forget execution
  • Returns ready: false in API response when restoration in progress
  • Safety checks prevent operations on indexes being restored
  • Cluster mode rejects restore (incompatible with NATS synchronization)

Implementation Details

MultiIndex.zig:

  • New being_restored flag and restoration_complete condition variable
  • restoreIndexTask() handles async download, restore, and error cleanup
  • waitForIndexReady() allows blocking wait with timeout
  • Safety assertion in borrowIndex() prevents race conditions

snapshot.zig:

  • Refactored to work with pre-initialized Index instances
  • downloadAndRestoreSnapshot() downloads and extracts into existing index

HTTP API:

  • Returns {ready: false} for restore operations
  • IndexBeingRestored error mapped to 404 (GET) or 409 (DELETE)
  • Cluster mode returns 400 for restore attempts

Test Coverage

All existing unit tests pass. Integration tests for the full restore flow would be beneficial but are not included in this PR.

Implements async index restoration via restore_from parameter in
CreateIndexRequest. Indexes can now be created from remote tar
snapshots accessible via HTTP/HTTPS URLs.

Key changes:
- Added restore_from field to CreateIndexRequest and IndexOptions
- New being_restored flag and restoration_complete condition variable
- Background restoration using Scheduler.runOnce() for async execution
- waitForIndexReady() method for clients to wait on restoration
- Safety checks to prevent operations on restoring indexes
- Cluster mode explicitly rejects restore (not compatible with NATS sync)
- Refactored snapshot.zig to work with pre-initialized Index instances

HTTP API returns ready: false when restoration is in progress.
Clients must either poll or use waitForIndexReady() until complete.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 5, 2025

Walkthrough

Adds restore-from support for single-node indices, surfaces restoration state in index lifecycle, refactors snapshot restore to operate on pre-initialized Index objects, extends the create-index API with restore_from, and makes cluster-mode creation reject restore_from. Server error handling updated to expose restoration-related errors.

Changes

Cohort / File(s) Summary of edits
Restore lifecycle & index state
src/MultiIndex.zig
Added IndexOptions.restore_from; added restoration state to IndexRef (being_restored, restore_error, restoration_complete); introduced restoreIndexTask, waitForIndexReady; created/borrow/delete/list/get flows updated to treat being-restored indices specially and return IndexBeingRestored where appropriate; createNewIndex updated to accept restore_from.
Cluster-mode guard
src/ClusterMultiIndex.zig
Added early guard in createIndex to reject request.restore_from != null with error.RestoreNotSupportedInClusterMode.
API request model
src/api.zig
Added optional restore_from: ?[]const u8 = null to pub const CreateIndexRequest.
Server error mapping & handlers
src/server.zig
Map IndexBeingRestored into 404/409 paths where appropriate; new handling in handlePutIndex for RestoreNotSupportedInClusterMode (400) and IndexBeingRestored (409); handleDeleteIndex treats IndexBeingRestored as a 409 conflict.
Snapshot restore refactor
src/snapshot.zig
restoreSnapshot and download helper changed to accept an existing *Index and return void (restoreSnapshot(reader, index, allocator) / downloadAndRestoreSnapshot(url, index, allocator)); extraction now writes into index.dir; callers updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately describes the main feature of the changeset—adding support for restoring indexes from HTTP tar snapshots—so it clearly reflects the primary change introduced by the PR.
Description Check ✅ Passed The description directly outlines the implementation of asynchronous index restoration, details key components like the new restore_from parameter and background task behavior, and ties closely to the actual code changes, making it relevant and on-topic.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-restore-from-url

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb32e8d and 86a40a6.

📒 Files selected for processing (1)
  • src/MultiIndex.zig (12 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/MultiIndex.zig

📄 CodeRabbit inference engine (CLAUDE.md)

Support managing multiple named indexes via the service

Files:

  • src/MultiIndex.zig
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:57:04.144Z
Learnt from: lalinsky
PR: acoustid/acoustid-index#114
File: src/server.zig:520-536
Timestamp: 2025-08-22T19:57:04.144Z
Learning: In src/server.zig handlePutIndex: PUT /index without restore options should be idempotent (not fail if index exists), but PUT /index with restore options should fail if index already exists (return 409). The fail_if_exists flag should only be set when restore options are provided.

Applied to files:

  • src/MultiIndex.zig
📚 Learning: 2025-08-17T10:37:12.955Z
Learnt from: CR
PR: acoustid/acoustid-index#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T10:37:12.955Z
Learning: Applies to src/MultiIndex.zig : Support managing multiple named indexes via the service

Applied to files:

  • src/MultiIndex.zig
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: fpindex
🔇 Additional comments (10)
src/MultiIndex.zig (10)

34-34: LGTM!

The restore_from field in IndexOptions is properly defined as optional and integrates cleanly with the existing options structure.


63-66: LGTM!

The restoration state fields in IndexRef are well-designed: being_restored for lifecycle state, restore_error for error propagation, and restoration_complete for signaling. This provides a solid foundation for async restoration coordination.


253-276: LGTM!

The restoration initialization correctly addresses the previous critical review: the errdefer ref.being_restored = false at line 256 ensures the flag is reset if scheduler.runOnce() or any subsequent operation fails. The duplication of strings for the background task is proper memory management for fire-and-forget execution.


329-337: LGTM with a caveat.

The deinit logic now waits for all restore tasks to complete by blocking on restoration_complete, which addresses the previous critical concern about dangling pointers after shutdown.

Note: If a download hangs indefinitely (network stall, no timeout in the snapshot module), deinit will block forever. This is probably acceptable behavior—you don't want to tear down while data is in-flight—but it's worth documenting or considering whether the snapshot download should enforce its own timeout.


498-566: LGTM!

The restoreIndexTask implementation is robust:

  • Line 519 includes the safety assertion for index.has_value (addressing the previous review).
  • Lock is dropped during the long-running download (line 521) to avoid blocking other operations.
  • The index pointer remains valid because deleteIndex rejects restoring indexes (lines 721-723) and deinit waits for completion (lines 329-337).
  • Error handling properly cleans up state and broadcasts on both failure (lines 527-537) and redirect-write failure (lines 549-557).
  • Success path writes the redirect file and broadcasts completion (lines 560-563).

584-584: LGTM!

The assertion prevents borrowing an index that's still being restored, catching logic bugs where operations might attempt to use an incomplete index.


605-607: LGTM!

Both exit points correctly return IndexBeingRestored error—once when checking an existing active index (605-607) and once after creating a new index for restoration (640-642). This ensures callers can't accidentally operate on an index mid-restore.

Also applies to: 637-642


651-693: LGTM!

The waitForIndexReady implementation provides a blocking wait with timeout for restore completion. The logic is sound:

  • Checks if already ready and returns immediately (659-663).
  • Waits on restoration_complete with timeout tracking (666-678).
  • Validates the index still exists after waking (681-688).
  • The empty catch block at line 676 is fine—timedWait errors (timeout) just continue the loop, and the elapsed-time check handles the actual timeout logic.

One minor point: there's a small window between unlocking (line 661) and calling getIndex (line 662), but if the index is deleted in that window, getIndex will correctly return IndexNotFound.


721-723: LGTM!

Rejecting deletion while a restore is in progress prevents the race where restoreIndexTask holds an index pointer that gets freed. Good coordination with the restore lifecycle.


908-930: LGTM!

The restore-specific creation path addresses the previous critical review concern: it checks under lock whether the index already exists and is live (lines 914-922), returning IndexBeingRestored or IndexAlreadyExists appropriately. This prevents overwriting a live index or starting duplicate restores. The response correctly returns ready: false (line 927) to signal async restoration.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3147b and 06d4530.

📒 Files selected for processing (5)
  • src/ClusterMultiIndex.zig (1 hunks)
  • src/MultiIndex.zig (11 hunks)
  • src/api.zig (1 hunks)
  • src/server.zig (3 hunks)
  • src/snapshot.zig (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/MultiIndex.zig

📄 CodeRabbit inference engine (CLAUDE.md)

Support managing multiple named indexes via the service

Files:

  • src/MultiIndex.zig
🧠 Learnings (3)
📚 Learning: 2025-08-17T10:37:12.955Z
Learnt from: CR
PR: acoustid/acoustid-index#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T10:37:12.955Z
Learning: Applies to src/MultiIndex.zig : Support managing multiple named indexes via the service

Applied to files:

  • src/ClusterMultiIndex.zig
  • src/MultiIndex.zig
📚 Learning: 2025-08-22T19:57:04.144Z
Learnt from: lalinsky
PR: acoustid/acoustid-index#114
File: src/server.zig:520-536
Timestamp: 2025-08-22T19:57:04.144Z
Learning: In src/server.zig handlePutIndex: PUT /index without restore options should be idempotent (not fail if index exists), but PUT /index with restore options should fail if index already exists (return 409). The fail_if_exists flag should only be set when restore options are provided.

Applied to files:

  • src/ClusterMultiIndex.zig
  • src/server.zig
  • src/MultiIndex.zig
📚 Learning: 2025-09-14T10:39:11.033Z
Learnt from: lalinsky
PR: acoustid/acoustid-index#134
File: src/ClusterMultiIndex.zig:245-261
Timestamp: 2025-09-14T10:39:11.033Z
Learning: In src/ClusterMultiIndex.zig, the stopping flag effectively prevents map mutations during graceful shutdown - handleMetaMessage and handleUpdateMessage both check stopping.load(.acquire) early and return, preventing processMetaOperation calls that would add/remove updaters from index_updaters map, making iterator-across-unlock pattern safe.

Applied to files:

  • src/ClusterMultiIndex.zig
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: fpindex

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/MultiIndex.zig (1)

982-1013: Consider exposing restoration state in list.

listIndexes() currently returns deleted status but not being_restored. Since CreateIndexResponse exposes ready: false for restoring indexes, exposing restoration state here would help clients track restore progress across the fleet without polling individual index info endpoints.

Add a being_restored field to IndexInfo and populate it:

 pub const IndexInfo = struct {
     name: []const u8,
     generation: u64,
     deleted: bool,
+    being_restored: bool = false,
 };

Then in listIndexes():

         try result.append(IndexInfo{
             .name = name,
             .generation = generation,
             .deleted = deleted,
+            .being_restored = index_ref.being_restored,
         });
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06d4530 and cb32e8d.

📒 Files selected for processing (1)
  • src/MultiIndex.zig (11 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/MultiIndex.zig

📄 CodeRabbit inference engine (CLAUDE.md)

Support managing multiple named indexes via the service

Files:

  • src/MultiIndex.zig
🧠 Learnings (2)
📚 Learning: 2025-08-22T19:57:04.144Z
Learnt from: lalinsky
PR: acoustid/acoustid-index#114
File: src/server.zig:520-536
Timestamp: 2025-08-22T19:57:04.144Z
Learning: In src/server.zig handlePutIndex: PUT /index without restore options should be idempotent (not fail if index exists), but PUT /index with restore options should fail if index already exists (return 409). The fail_if_exists flag should only be set when restore options are provided.

Applied to files:

  • src/MultiIndex.zig
📚 Learning: 2025-08-17T10:37:12.955Z
Learnt from: CR
PR: acoustid/acoustid-index#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T10:37:12.955Z
Learning: Applies to src/MultiIndex.zig : Support managing multiple named indexes via the service

Applied to files:

  • src/MultiIndex.zig
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: fpindex
🔇 Additional comments (7)
src/MultiIndex.zig (7)

34-34: LGTM!

Clean addition of the restore_from parameter to IndexOptions.


63-66: LGTM!

Restoration state fields are well-chosen for coordinating async restoration. The condition variable enables blocking waits via waitForIndexReady().


572-572: LGTM!

The assertion ensures no code path can borrow an index that's being restored, which would violate the restoration invariant.


589-595: LGTM!

Proper guards prevent operations on indexes being restored. The check after createNewIndex at lines 628-630 correctly handles the restore-from case where the index isn't ready to borrow yet.

Also applies to: 625-630


639-681: LGTM!

Solid implementation of a blocking wait with timeout. The re-check at lines 669-676 after each wakeup correctly handles concurrent index deletion or recreation, preventing confusion from identically named but different IndexRefs.


709-711: LGTM!

Correct guard prevents deletion of indexes undergoing restoration.


897-918: Guard against concurrent restore correctly implemented.

The past review comment flagged the risk of overwriting a live IndexRef during restore. Lines 902-910 now correctly prevent this by checking if the index exists and is live before calling createNewIndex, returning IndexBeingRestored for ongoing restores or IndexAlreadyExists for completed indexes. This aligns with the learning that PUT with restore_from should not be idempotent.

Based on learnings

Validates the invariant that index_ref.index.has_value is true before
dereferencing index_ref.index.value when being_restored is true.
Ensures being_restored is cleared if index name/URL duplication or
scheduler.runOnce fails during index creation with restore_from.
Ensures all in-progress restore operations finish before cleaning up
indexes to prevent use-after-free or incomplete restore state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant