Skip to content

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

Open
@rvagg

Description

@rvagg

This started as a Notion document but it's kind of hidden there, although I've now published it for public viewing and commenting. I don't want it to get lost as I start to focus on other things. I did intend to code up a version of this in go-car and work up a spec PR but I'm worried that it'll get lost in the wash so I'm recording this here for permanency and hopefully a bit more engagement with other folks.

Why Not Both? Packing Content for IPLD vs Piece (IntactPack)

TL;DR

Instead of packing and storing file / blob data like this:

Untitled

Pack and store it like this:

Untitled 1

Background

There currently exists two views, and two separate uses of Filecoin storage of content: IPLD block focused and Filecoin as an opaque block device. These exist as lenses that are used to view the nature of data stored on Filecoin but also impact the technology choices as we move up the stack.

Vision 1: IPLD block focused

  1. Data preparation: packed into CARs, blocks are the typical ~1MB maximum for compatibility and incremental verifiability, file data is encoded as UnixFS (or similar) @
  2. Retrievals: Bitswap (per-block) or HTTP Trustless (per-block or a broad selector description for an IPLD DAG)

Vision 2: Filecoin is an opaque block-device

  1. Data preparation: however you like, but currently typical to still pack into standard CARs, but it is very important for clients to store metadata about the layouts of their pieces so they can fetch blocks or ranges that they care about straight out of pieces.
  2. Retrievals: HTTP Piece retrieval with range requests to slice and dice

Competing visions of Filecoin as an IPLD-block storage layer vs a opaque byte storage is older than the network. Opaque bytes make everything on the Storage Provider side significantly simpler and cheaper. Yet deep IPLD integration provides significant value to the network (making this case is beyond the scope of this document, but this is a strongly held conviction by some).

Planning for rearchitecting miner and markets software takes the cost and complexity view that the IPLD layer is a value-add and should be considered an optional extra, to be added if/when it’s needed/demanded. Storing and proving pieces is the critical activity, retrieving pieces is significantly simpler and more efficient (in many definitions of “efficient”) and indexing at the IPLD-block level is complexity that been persistently difficult to solve and is one of the largest costs and risks for scalability.

Finding “Why Not Both?” Approaches

The ideal for storage providers is a choose-your-complexity model, where the simplest form of the software stack has the lowest complexity and cost, but complexity can be added to increase the value-add to a storage provider’s customers where it is needed.

Untitled 2

CommPact is one attempt at describing a world where piece data can also be viewed through an IPLD lens, by re-purposing the piece commitment proving tree over the raw piece data. This path is one form of in-between (”why not both?”) approach. In this form, the data is piece-first, in that it is prepared with the perspective of it being stored in a piece, and the IPLD lens can be applied after-the-fact as required (i.e. the IPLD lens is optional).

Why Not Both: IPLD-first

To serve both worlds, we could take an approach where data preparation still involves packing data in a CAR, including (some) UnixFS encoding, but lays out that data in such a way that it can also be retrieved in whole (or large) portions via piece retrieval. Markets software that is aware of this format can serve content both as standard, small, IPLD blocks, and/or as raw pieces. A storage provider who does not want or need to index IPLD blocks can still serve the content to clients that have stored metadata about content location, while a storage provider that wants to value-add with full Bitswap and HTTP retrieval can extract and serve blocks using the minimal metadata included in the packed form. Most importantly, this proposal does not break existing specifications for UnixFS or the CAR format, but it does add some challenges for existing use-cases making assumptions about what they contain.

IntactPack

Input data can be opaque bytes, or it can be file data. In both cases, it is UnixFS encoded to the reference layer. i.e. this “reference layer” is the UnixFS DAG is prepared above the raw byte leaves, as if those leaves are also going to be stored in the CAR. This reference layer is then packed in a CAR, and then the whole byte blob is then added to the CAR as a single (potentially very large) IPLD block, complete with its own CID prefix.

This leaves us with two separate parts to a CAR: the reference layer which has links to blocks that don’t technically exist as sections in the CAR, and the large byte block, which has a CID that is not referenced by anything.

The process can be repeated for any number of blobs in order to fill up a CAR to the required size. Likewise, blobs that are larger than the required size can be split up in to smaller chunks and treated as separate blobs (note there are some complexities in this case which require us to apply some special handling, this is described below).

Standard UnixFS encoding and packing files / blobs into CAR format introduces challenges for Piece retrievals:

  1. File is chunked into small pieces, each piece is length and CID prefixed in the CAR
  2. Raw leaves are packed interleaved with reference layer blocks that hold the graph together

Standard UnixFS encoding of a file / blob: Raw byte blocks chunked at bottom layer, reference layer above to contain the leaves in a graph.

Standard UnixFS encoding of a file / blob: Raw byte blocks chunked at bottom layer, reference layer above to contain the leaves in a graph.

Standard UnixFS file / blob packed in order in CAR format.

Standard UnixFS file / blob packed in order in CAR format.

A full CAR with multiple files / blobs

A full CAR with multiple files / blobs

Retrieving a file / blob using HTTP piece protocol involves:

  • Keeping reference to the location of the start of the file in the piece
  • Retrieving and decoding as UnixFS
  • Partial retrievals require either indexing individual block locations and their file offsets, difficulty increases with the size of the range required

UnixFS encoding of a file / blob with the reference layer only, leaves are virtual and the UnixFS blocks contain the details of where they exist in the original stream of bytes

UnixFS encoding of a file / blob with the reference layer only, leaves are virtual and the UnixFS blocks contain the details of where they exist in the original stream of bytes

UnixFS reference layer blocks packed together in a CAR followed by single raw block containing original (non-chunked) bytes

UnixFS reference layer blocks packed together in a CAR followed by single raw block containing original (non-chunked) bytes

A full CAR with multiple files / blobs with non-standard format, original files are fully contiguous

A full CAR with multiple files / blobs with non-standard format, original files are fully contiguous

Retrieving a file / blob using this modified UnixFS packing requires less accounting (in minimal form) and a simpler process:

  • In a minimal form, only the start of the file, or CAR section, is required to perform full-file fetches
  • Decoding using UnixFS is optional: incremental verification is still possible if the reference layer is also fetched or is stored locally; likewise the full file is still content-addressed with its own CID for the full series of bytes and can be post-verified where incremental verification isn’t as critical
  • Partial retrievals are simpler, offsets are easily calculated, incremental verification is still possible if the reference layer is fetched or stored
  • The reference layer only can be easily fetched separately as a “proving tree”, which can be useful for many purposes

Handling Very-large Blobs

Where packing within Filecoin sector constraints, CARs must be under the ~32G (or ~64G) threshold to make a storable piece. Data that is larger than this threshold can be handled by splitting the bytes into roughly sector sized chunks and storing those chunks prefixed by a full, valid reference layer (only) for that chunk.

The first chunk CAR is straightforward to read and understand. Understanding subsequent chunk CARs requires inspecting the initial portion of the reference layer to find the first leaf block’s file offset. The length of the chunk can be either inferred from the raw block’s length or from the reference layer by also inspecting the last block’s offset and length.

Tooling and Protocol Integration

This packing format does not strictly break any existing specifications:

  1. There currently exists within the IPFS and Filecoin stack no strict requirement for block ordering in CARs, except on the HTTP Trustless transport protocol (see below)
    1. Likewise: there currently exists no strict requirement for a CAR to contain a complete DAG
  2. The CAR format doesn’t have any strict limits on block sizes allowed to be stored, it is theoretically possible to store a single 1TB IPLD block in a CAR and it be considered valid
  3. Filecoin deals are optionally CARs which are optionally indexed by storage providers; this format will still be successfully indexed by current Boost implementations

CAR Format Integration

This format does not strictly require changes to the CAR format, however in some cases it is useful to signal that a CAR is packed using this format, with reference layers packed before the full contiguous bytes of a file / blob. The CARv2 format offers a lightweight “capabilities” signalling bitmap in its prefix which can be used to provide a boolean “is IntactPack format” signal to consumers. This minimally just requires a CARv2 header prefixing the CARv1 payload.

It is theoretically possible to determine if a file is packed in this format by reading its contents, determining that the order of blocks is as expected and verifying the raw block chunks match the expected raw leaves contained in the reference layer. However, this requires both a full ingest of the CAR and a hashing of many blocks so is less optimal than capability signalling.x

Storage Provider Integration

A minimum requirement of HTTP piece retrievals is necessary, but a storage provider may optionally index and serve the individual raw leaf IPLD blocks of files within these CARs. Boost can be IntactPack aware by:

  1. Reading CARs and looking for the IntactPack signalling capability in the CARv2 header.
  2. Indexing the reference layer blocks as they exist in the CAR, mapping byte offsets within the piece
  3. Indexing the virtual raw leaf chunk blocks as they are referenced in the reference layer, mapping byte offsets within the piece

Retrieval Tooling Integration

Retrievals can take multiple forms with this format, each of these can be built in to Lassie or other retrieval clients:

  1. Piece retrievals using client-side knowledge of the piece layout: a client with a copy of the reference layer (or some form of it) may retrieve full or partial pieces of the byte blobs using its knowledge of:
    1. The piece offset of the start of the file / blob data
    2. HTTP range requests to fetch the bytes required
    3. Incremental verification of bytes as they are retrieved using the reference layer to verify the virtual blocks as they are received
  2. Piece retrievals with minimal client-side knowledge can bootstrap by fetching the beginning of a piece payload and decoding the reference layer as it is streamed; because this data is at the beginning of a payload (even for payloads with directories), a client can move on to retrieval as per the mode described above.
  3. Plain IPLD retrievals can be supported with Storage Provider integration described above, where the Storage Provider’s stack is aware of the virtual blocks and can serve them as plain IPLD blocks, via any supported existing protocol (Bitswap, Graphsync, HTTP Trustless)
  4. Client-side awareness of this packing can also speed up retrievals: by extending the HTTP Trustless protocol, with signalling via Accept and Content-Type headers to indicate that this packing format is used, a client could fetch the reference layer before the file / blob data and incrementally verify it using its knowledge of the UnixFS mapping of reference layer to the following bytes.
    1. Note: There already exists a “skip raw leaves” proposal for HTTP Trustless retrievals that describes a mode that would transfer what is described here as the “reference layer”. There is likely significant overlap in the implementation of that proposal and this proposal. IPIP-0445: Option to Skip Raw Blocks in Gateway Responses ipfs/specs#445

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions