Skip to content

IPFS hash feature use non-specified algorithm which is not widely compatible in the ecosystem #14389

Open
@Jorropo

Description

@Jorropo

It looks like you are implementing what looks like the Kubo defaults, they are nearing 10 years and lack the newest features we support, I thus want to change thoses so I am poking around where people rely on thoses defaults in the ecosystem.

https://github.com/ethereum/solidity/blob/develop/libsolutil/IpfsHash.cpp

Unixfs is an open format which allows for multiple writer implementations to implement their own linking logic such as append logs, content aware chunking (cutting around logical boundries in the content, such as iframes in video files, content in archive formats, ...), more packed representation, ... while all of thoses are automatically compatible with all reader implementations.
This as designed lead to a inconsistent hashes in the ecosystem, examples with implementations that produce different CIDs:

Hopefully this serves as a demonstration that unixfs is good at tailoring for usecases, not repeatable hashing of data.

I see 3 potential fixes:

  1. Add an option to the compiler to output a .car file, basically instead of relying on ipfs add magically perfectly outputing the same CID, you do not run 2 chunkers, the solc chunker would output the blocks in an archive and then the user could ipfs dag import (which read blocks for blocks instead of chunking). This is how chunkers are meant to work (this or using some other transport than car).
  2. Write a proposal and make a new spec for repeatable unixfs chunkers inside ipfs/specs and implement it, you could then use a single link inline CID with metadata to embed that into the CID. So the CIDs would encode unixfs-balanced-chunksize-256KiB-dag-pb-leaves-... and could be fed into an other implementation to have it the same.
  3. Replace all the multiblock and dagpb logic with a raw-blake3 CID. The reason we use the unixfs merkle dag format is unlike plain sha256 it supports for easy incremental verification, seeking (downloading random parts of the file without having to download the full file) and has very high exponential fanout (allows to do parallel multipeer downloads).
    All of thoses features are available builtin in well specified hash functions blake3 being one of them, this removes support for the most esoteric one like custom chunking, but instead adding the same files multiple times.
    Blake3 is also used by default by the new github.com/n0-computer/iroh implementation.

TL;DR:

You implement unixfs which is not a specified repeatable hash function (the same input can hash to different hashes depending on how the internal merkle-datastructure is built which is usecase dependent).
Given your usecase is simple usually small text files I belive you should switch to use plain blake3 instead which is a well fixed merkletree (instead of the loose merkledag unixfs is).

Note 0

out of all the IPFS implementations I know only iroh knows how to handle blake3 incremental verification yet, other Kubo & friends supports blake3 but as dumb hashes, so it still uses unixfs + blake3 to handle files above the block limit 1~4MiB, we are intrested in adding this capability in the future.

Note 1

Even tho there is a one to many file bytes → CID unixfs relationship, assuming cryptographically secure hash functions there always is a unique CID → bytes relationship.

Note 2

Blake3 might not be the best sollution, what I am sure is that relying on random unspecified behaviours of some old piece of software is definitely wrong. :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    To do

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions