Skip to content

Add pull flow testing to validate SP piece retrieval from external sources #300

@rvagg

Description

@rvagg

DealBot should test the new SP "pull" pathway: the flow where an SP fetches a piece from an external URL rather than receiving it via direct upload. This is a critical path for multi-copy durability (where a secondary SP pulls from the primary or another source) and needs validation alongside existing data storage checks.

Background

The current data storage check tests the direct upload path: DealBot uploads a piece to an SP, then verifies onchain confirmation, discoverability, and retrieval. But the SDK's multi-copy durability flow uses a different Curio pathway where the SP is told to pull a piece from a URL. This path has distinct failure modes and performance characteristics that aren't currently covered.

Curio Pull Implementation

Curio exposes POST /pdp/piece/pull (see curio/pdp/handlers_pull.go). Key details:

Request format:

{
  "extraData": "0x...",
  "dataSetId": 0,
  "recordKeeper": "0x...",
  "pieces": [
    {
      "pieceCid": "baga...",
      "sourceUrl": "https://example.com/api/piece/baga..."
    }
  ]
}

Synapse should have utilities for this (soon):

There is some additional source URL validation (curio/pdp/pull_types.go):

  • Path must end in /piece/{pieceCid}
  • CID in path must match the pieceCid field in the request

The API returns per-piece status with a progression of: pending => inProgress => retrying => complete => failed

You can poll the API by repeating the same POST /pdp/piece/pull request (it's idempotent via (service, sha256(extraData), dataSetId, recordKeeper) key). The SP returns current status for each piece. So you can hit it repeatedly and wait for success. This is built in to synapse-core with a "waitFor" function.

Pull task mechanics (curio/tasks/pdp/task_pull_piece.go):

  • Background task polls every 10s for pending items
  • Downloads piece from source URL
  • Computes CommP and verifies against expected CID
  • Stores in StashStore, creates parked_pieces entry, StorePiece task then moves to "long-term" storage
  • Max 5 retries with exponential backoff (10s base, 2x factor, 5min cap)
  • 1 hour download timeout

Discrete Testable Units

The pull flow decomposes into three independently testable operations:

  1. SP can receive + park an uploaded piece: already tested by data storage check
  2. SP can pull from a URL + park: new, this issue
  3. SP can do add-pieces on chain for a parked piece: already tested by data storage check

Unit 2 is the gap. Once a piece is parked via pull, the add-pieces flow is identical to direct upload.

Possible Approach: DealBot-Hosted Piece Endpoint

(From chat with @SgtPooki) DealBot hosts a temporary piece retrieval endpoint on the existing backend.

  • Serve via /api/piece/{pieceCid}. This routes through Caddy to the Node backend already (the Caddy container can't dynamically serve new assets, but /api/*
    forwards to the backend)
  • Generate the random piece, convert to CAR, compute pieceCID
  • Enable the endpoint for a specific pieceCID, limited time
  • Tell the SP to pull from https://dealbot.filoz.org/api/piece/{pieceCid}
  • Poll the SP for status until complete or failed
  • Clean up the endpoint + stored piece data after confirmation

Lastly we want to verify the piece is actually there: this is tricky because the SP can claim it has the piece but do we believe it? We have two options: either download the piece back from the SP and check that the bytes are what we want (hash or just byte compare to original), or just by proceeding to AddPieces and make the SP prove it we get a level of assurance that's probably acceptable.

An alternative flow exists where we could use an existing SP and do a proper SP-to-SP pull. We don't even need to do an AddPieces, we just use a second SP as a staging ground and expect it to GC the piece. But this has a few problems:

  • Ambiguity: if pull fails, the SP is the problem, not the source
  • No dependency on another SP being healthy
  • Tighter metrics: DealBot could measure time-to-first-byte fetched and throughput from its own side

Metrics

We coudl add new metrics for pull checks, some ideas:

  • pullRequestMs: DealBot => SP pull request latency (initial POST
  • pullCompletionMs: Pull request => SP reports complete
  • pullFirstByteMs: Time from SP connecting to DealBot endpoint to first byte served (if DealBot-hosted)
  • pullThroughputBps: Bytes served / time (if DealBot-hosted)
  • pullStatus: Counter with value label: complete, failed, retrying (would need deeper control than the higher level Synapse APIs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions