Description
It looks like you are implementing what looks like the Kubo defaults, they are nearing 10 years and lack the newest features we support, I thus want to change thoses so I am poking around where people rely on thoses defaults in the ecosystem.
https://github.com/ethereum/solidity/blob/develop/libsolutil/IpfsHash.cpp
Unixfs is an open format which allows for multiple writer implementations to implement their own linking logic such as append logs, content aware chunking (cutting around logical boundries in the content, such as iframes in video files, content in archive formats, ...), more packed representation, ... while all of thoses are automatically compatible with all reader implementations.
This as designed lead to a inconsistent hashes in the ecosystem, examples with implementations that produce different CIDs:
github.com/Jorropo/linux2ipfs
use 2MiB raw leaves with 2MiB roots (instead of 174 links).github.com/ipld/go-car/cmd/car
use a different TSize logic.github.com/ipfs/boxo/mfs
(which is available inKubo
withipfs files ...
) has different defaults and can produce identical files with different CIDs if you use a different list of copy, write, append, ... operations.github.com/filecoin-project/lotus
(I belive) uses raw leaves with 1MiB chunks and 1024 links with some variant of blake2web3.storage
&nft.storage
use raw leaves with 1MiB chunksgithub.com/bmwiedemann/ipfs-iso-jigsaw
chunk each file in an ISO separately and then concatenate the resulting files with the ISO metadata in a unixfs root allowing different versions of similar isos to share the blocks for the unchanged files (incremental file updates).- ... more I don't know on the top of my head
Hopefully this serves as a demonstration that unixfs is good at tailoring for usecases, not repeatable hashing of data.
I see 3 potential fixes:
- Add an option to the compiler to output a
.car
file, basically instead of relying onipfs add
magically perfectly outputing the same CID, you do not run 2 chunkers, the solc chunker would output the blocks in an archive and then the user couldipfs dag import
(which read blocks for blocks instead of chunking). This is how chunkers are meant to work (this or using some other transport than car). - Write a proposal and make a new spec for repeatable unixfs chunkers inside
ipfs/specs
and implement it, you could then use a single link inline CID with metadata to embed that into the CID. So the CIDs would encodeunixfs-balanced-chunksize-256KiB-dag-pb-leaves-...
and could be fed into an other implementation to have it the same. - Replace all the multiblock and dagpb logic with a
raw-blake3
CID. The reason we use the unixfs merkle dag format is unlike plain sha256 it supports for easy incremental verification, seeking (downloading random parts of the file without having to download the full file) and has very high exponential fanout (allows to do parallel multipeer downloads).
All of thoses features are available builtin in well specified hash functions blake3 being one of them, this removes support for the most esoteric one like custom chunking, but instead adding the same files multiple times.
Blake3 is also used by default by the newgithub.com/n0-computer/iroh
implementation.
TL;DR:
You implement unixfs which is not a specified repeatable hash function (the same input can hash to different hashes depending on how the internal merkle-datastructure is built which is usecase dependent).
Given your usecase is simple usually small text files I belive you should switch to use plain blake3 instead which is a well fixed merkletree (instead of the loose merkledag unixfs is).
Note 0
out of all the IPFS implementations I know only iroh knows how to handle blake3 incremental verification yet, other Kubo & friends supports blake3 but as dumb hashes, so it still uses unixfs + blake3 to handle files above the block limit 1~4MiB, we are intrested in adding this capability in the future.
Note 1
Even tho there is a one to many file bytes → CID
unixfs relationship, assuming cryptographically secure hash functions there always is a unique CID → bytes
relationship.
Note 2
Blake3 might not be the best sollution, what I am sure is that relying on random unspecified behaviours of some old piece of software is definitely wrong. :)
Metadata
Metadata
Assignees
Type
Projects
Status