FRC Proposal: Self-Describing Data Aggregation (Data Segment Index v2) #1216
Replies: 8 comments 5 replies
-
|
Initial thoughts after a first pass:
We could consider an approach similar to PieceCIDv2 where we do raw size as a relationship to size rather than just having two complete size values taking up ~64 bits, if space is a concern at all.
since we have 64 bits, maybe we shouldn't bother with uvarint encoding
Can you elaborate on how you imagine this being useful in the future? If we don't have a good imagined usecase for this then it might end up being as useless as the "capabilities" field in CARv2.
I'm not sure we have enough space here for robust ACL implementations, although I could imagine ACL multicodecs telling us that a certain other piece contains the extended ACL information, so maybe these fields can be used as a redirect there. e.g. the ACL node tells us that the ACL is defined by UCAN and has a CID for the UCAN, then one of the other pieces packaged is identified by a |
Beta Was this translation helpful? Give feedback.
-
|
Still reading it and processing but I will start commenting on things that jump out.
These properties are not verifiable by the client, as they receive only the inclusion proof into their index entry. So while we can say that implementations MUST follow this, the reader implementation MUST NOT refuse to use these indexes. |
Beta Was this translation helpful? Give feedback.
-
|
Slides from FDS BA: https://docs.google.com/presentation/d/14jgME8p9O2FLx44930cOhE7riNDcWMV8jdrMHRp-Re0/edit?usp=sharing Misaligned CommP inclusion proofs:
Red nodes are the path which needs to be proven, purple nodes are other data in a sector and zero-commitments in CommP computation - red merkle path just needs to prove top-level node of each blue tree to prove inclusion of the entire misaligned piece, meaning the cost of the entire inclusion proof is just ~2x of a normal single subtree inclusion proof. That comes with the obvious downside that the client must receive offset information to compute final commp. |
Beta Was this translation helpful? Give feedback.
-
|
CommPat hasher code now in filecoin-project/go-fil-commp-hashhash#30, works under the assumption that |
Beta Was this translation helpful? Give feedback.
-
|
Other interesting use of the Proof of Data Segment Inclusion v2 is that it can be applied to existing data packed in .car files to prove that cids stored in a car-file are actually stored in a given sector (possibly very interesting to FIDL) |
Beta Was this translation helpful? Give feedback.
-
Notes from a call about this:
|
Beta Was this translation helpful? Give feedback.
-
|
The current proposal for CommPv2 also has the ability to work in a semi-tightly packed mode, where the data start offset is aligned to start-of-data - giving users CommP which matches CommP v1. Note that the output CommP from filecoin-project/go-fil-commp-hashhash#30 for This is visible in the (slightly jank) visualizer on https://ipfs.io/ipfs/bafkreiacuhuhzbfvplw7d3lzei5vg4uyczoebpq3ihhffplij4he3opply on the bottom visualization when sliding the 'Segment Offset' slider |
Beta Was this translation helpful? Give feedback.
-
|
Alternate idea which makes client hash/cid calculation much simpler and not requiring communication, coming from the insight that:
The structure would go from: To: With that the format becomes spiritually a .car file, except without a header which breaks data alignment and no metadata woven into data which obliterates alignment + a real useful index at the end of the payload. As easy to handle as a .car since most tooling can easily be adjusted and fundamentally the CIDs are just most normal IPLD CIDs |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Starting this as a discussion so that it's easier to gather and respond to feedback. The overarching goals for this proposal are to support more advanced aggregation strategies and unlock innovation for onramps which have those needs.
Self-Describing Data Aggregation (Data Segment Index v2)
Simple Summary
This FRC extends FRC-0058 Verifiable Data Aggregation by adding content type information, raw data size, access control signaling, and flexible alignment to Data Segment Index entries. It also introduces Proof of Data Segment Inclusion v2 which supports arbitrary alignments and enables tighter data packing while maintaining verifiability.
Abstract
FRC-0058 established a Data Segment Index format for proving inclusion of client data within aggregated deals. This proposal extends the index entry format to include:
The proposal also introduces Proof of Data Segment Inclusion v2, which uses dual inclusion proofs (leftmost and rightmost leaves) to support various alignment strategies while maintaining compatibility with v1 for traditionally aligned segments.
Change Motivation
The original FRC-0058 Data Segment Index has several limitations:
By extending the index format and proof mechanism, we enable:
Specification
Data Segment Index Entry v2
The Data Segment Index Entry v2 extends the v1 format by expanding to 256 bytes post-Fr32-padding (4 nodes). All offset and size fields represent pre-Fr32-padding byte positions and lengths.
Each entry consists of 256 bytes after Fr32 bit padding, providing 1016 bits of usable space across four 254-bit nodes. Each entry is aligned to a 256-byte boundary after Fr32 bit padding.
Entry structure (1016 bits total):
Field Details
Offset (64 bits)
Offset + SizeSize (64 bits)
RawSize (62 bits)
OffsettoOffset + RawSizeRawSize <= SizeSize - RawSize)Multicodec (64 bits)
0x55: Raw binary data0x0202: CAR format (IPLD)MulticodecDependent (254 bits)
0x55) and CAR (0x0202) codecs: MUST be set to zeroACLType (8 bits)
0: No ACL, data is publicly retrievableACLData (64 bits)
Reserved (56 bits)
Checksum (126 bits)
Index Entry Validation
A Data Segment Index entry is defined as valid if:
start of indexto end of deal)RawSize <= SizeOffset + Sizedoes not exceed the start of the indexAlignment Recommendations
While this specification allows arbitrary alignment, the following is RECOMMENDED:
Offset alignment: 127 bytes (pre-Fr32-padding)
Size padding: Minimal padding to maintain TreeD structure
Storage Provider Data Processing
After receiving deal data, a Storage Provider SHOULD:
When serving retrievals:
Proof of Data Segment Inclusion v2
Proof of Data Segment Inclusion v2 (PoDSIv2) extends the original PoDSI to support flexible alignment while maintaining verifiability. The proof structure adapts based on segment alignment characteristics.
Proof Structure
The proof consists of:
Proof Components
The aggregator provides:
Verification Algorithm
The client possesses:$\mathrm{CommDS}$ , $\mathrm{rawSize}$ , and optionally the raw data.
Verification steps:
Verify index entry inclusion:
Verify data inclusion (method depends on alignment):
Case A: Power-of-two aligned with v1-compatible padding$\mathrm{size} = 2^{\lceil \log_2(\mathrm{rawSize}) \rceil}$ and offset is suitably aligned:
If
Case B: Power-of-two aligned without padding$\mathrm{rawSize} < \mathrm{size}$ :
If offset is power-of-two aligned but
Case C: Arbitrary alignment
For non-power-of-two aligned segments:
Verify size consistency:
Verify on-chain:
Piece CID v2 Computation
With the RawSize field, clients can compute Piece CID v2 (FRC-0069):
Where:
CommDSis the commitment from the indexrawSizedetermines the data sizepaddingis the null bytes addedtreeHeightis derived from the padded size after Fr32 conversionCompatibility with v1
PoDSIv2 maintains backward compatibility with v1:
Design Rationale
Why 4 Nodes (256 bytes)?
Expanding from 2 nodes (64 bytes) to 4 nodes (256 bytes) provides:
Why Multicodec-Dependent Space?
The 254-bit multicodec-dependent field (Node 3) enables:
Why ACL Signal (9 bytes)?
The ACL signal fields (ACLType + ACLData) provide:
Why Recommend 127-byte Alignment?
127 bytes (pre-Fr32-padding):
Storage Efficiency Trade-offs
Backwards Compatibility
Reading v1 Indexes
v2-aware implementations MUST support reading v1 indexes:
RawSize = Size,Multicodec = 0x55,ACLType = 0,ACLData = 0,Reserved = 0Index Version Detection
Implementations can detect index version by:
Test Cases
Test Case 1: v1-Compatible Segment
Expected: PoDSIv2 should produce functionally equivalent proof to v1
Test Case 2: Tightly Packed Segment
Expected: Leftmost and rightmost proofs verify correct boundaries, Piece CID v2 computable
Test Case 3: Arbitrary Alignment
Expected: Boundary proofs verify segment extent, no specific Piece CID v2 proof
Test Case 4: Unknown ACL Type
Expected: Entry is valid but Storage Provider MUST refuse retrievals for this piece
Security Considerations
Increased Entry Size
Expanding entries from 64 bytes to 256 bytes (4× increase):
Malicious Padding
An aggregator could set
RawSize < Sizeto hide data in padding:MulticodecDependent Field
The 254-bit multicodec-dependent field could be misused:
ACL Signal Fields
The ACL signal fields introduce access control considerations:
Proof Complexity
More complex proofs (arbitrary alignment) could introduce verification bugs:
Incentive Considerations
Storage Provider Benefits
Client Benefits
Product Considerations
Aggregator Implementation
Aggregators implementing v2 should:
Retrieval Integration
Retrieval systems can:
Implementation
Reference implementations:
Required implementations in ecosystem:
Future Work
ACL Type Definitions
Future FRCs can define specific ACL types and their semantics:
Index Sections
Special data sections which are indexes over data in other data sections
EC Section
Erasure coding sections enable efficient storage and retrieval of data with built-in redundancy:
Cross-Sector Striping: EC can be distributed across multiple sectors, allowing a single piece to contain erasure coding stripes from multiple related deals:
EC sections would specify:
Copyright
Copyright and related rights waived via CC0.
Beta Was this translation helpful? Give feedback.
All reactions