You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
4.a. If any part of this transfer fails fails due to network issues, Transfer returns an error, requeueing the transfer task. This is good, we want to make sure we pull the data we are supposed to replicate.
Once the transfer of data is complete, the node creates a receipt for the transfer invocation, and fires it off to the upload service:
Problems arise at step 5.
Once the node successfully performs the transfer, and accepts (blob/accept) the piece is now stored in Piri. However, if a failure occurs while accepting the blob, or issuing a receipt to the upload service, and error is returned, and the entire transfer operation starts again. This can lead to multiple redundant transfers occurring when failures post-transfer occur. We actually saw this play out in our replication tests, some nodes repeatedly allocated the same blob multiple times when communicating with the upload service.
This is a classic idempotency problem. The transfer operation could instead be made idempotent. That way, if the data has already been transferred successfully, we shouldn't transfer it again just because the receipt delivery failed, or the blob accept operation failed.
I think there are (at least) two issue here:
blobhandler.Accept is not idempotent, when called N times with the same blob it will create N allocations - I believe this is intended by design.
Transfers (moving the bytes between nodes) may occur multiple times during failure cases.
Together these issues compound into a larger problem, e.g.
First attempt: Transfer succeeds, Accept succeeds, but receipt sending fails
Retry:
Transfer happens AGAIN (wasteful but succeeds)
Accept is called AGAIN (might fail or create duplicates)
Receipt sending is attempted again
The goal here is to discuss solutions to these issues.
Should blobhandler.Accept be made idempotent, such that it doesn't actually "store" the same blob twice?
How can the entire Transfer operation be made idempotent, such that it doesn't perform repeated transfers between nodes, and don't allocate blobs that have already been allocated?
bugSomething isn't workingquestionFurther information is requested
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
When Piri receives a UCAN invocation to allocate a replica, here is what happens:
A transfer invocation and the node creates an async task to allocate a replica:
piri/pkg/service/storage/ucan/replica_allocate.go
Lines 128 to 137 in c50ea6b
The transfer request is enqueued into a replication job queue:
piri/pkg/service/replicator/replicate.go
Lines 86 to 88 in c50ea6b
2.a. Note, this queue is configured to re-try the request up to 10 times before giving up:
piri/pkg/service/replicator/replicate.go
Line 63 in c50ea6b
2.b. This queue will allow as many parallel transfer requests as the node has CPUs:
piri/pkg/service/replicator/replicate.go
Line 64 in c50ea6b
When the the node has capacity to execute the transfer request the
Transfermethod is called:piri/pkg/service/replicator/replicate.go
Lines 69 to 82 in c50ea6b
The transfer method pulls the data from the "Source" node it is replicating the data from:
piri/pkg/service/storage/handlers/replica/transfer.go
Lines 135 to 172 in c50ea6b
4.a. If any part of this transfer fails fails due to network issues,
Transferreturns an error, requeueing the transfer task. This is good, we want to make sure we pull the data we are supposed to replicate.Once the transfer of data is complete, the node creates a receipt for the transfer invocation, and fires it off to the upload service:
piri/pkg/service/storage/handlers/replica/transfer.go
Lines 174 to 235 in c50ea6b
Problems arise at step 5.
Once the node successfully performs the transfer, and accepts (blob/accept) the piece is now stored in Piri. However, if a failure occurs while accepting the blob, or issuing a receipt to the upload service, and error is returned, and the entire transfer operation starts again. This can lead to multiple redundant transfers occurring when failures post-transfer occur. We actually saw this play out in our replication tests, some nodes repeatedly allocated the same blob multiple times when communicating with the upload service.
This is a classic idempotency problem. The transfer operation could instead be made idempotent. That way, if the data has already been transferred successfully, we shouldn't transfer it again just because the receipt delivery failed, or the blob accept operation failed.
I think there are (at least) two issue here:
blobhandler.Acceptis not idempotent, when calledNtimes with the same blob it will createNallocations - I believe this is intended by design.Together these issues compound into a larger problem, e.g.
The goal here is to discuss solutions to these issues.
blobhandler.Acceptbe made idempotent, such that it doesn't actually "store" the same blob twice?Transferoperation be made idempotent, such that it doesn't perform repeated transfers between nodes, and don't allocate blobs that have already been allocated?Beta Was this translation helpful? Give feedback.
All reactions