diff --git a/src/architecture/principles.md b/src/architecture/principles.md index ea1367902..e85f0ab6c 100644 --- a/src/architecture/principles.md +++ b/src/architecture/principles.md @@ -12,10 +12,9 @@ editors: url: https://berjon.com/ github: darobin twitter: robinberjon - mastodon: "@robin@mastodon.social" affiliation: - name: Protocol Labs - url: https://protocol.ai/ + name: IPFS Foundation + url: https://ipfsfoundation.org/ tags: ['architecture'] order: 0 xref: diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md new file mode 100644 index 000000000..ace3081cb --- /dev/null +++ b/src/data-formats/cid.md @@ -0,0 +1,231 @@ +--- +title: CID (Content IDentifier) +description: > + Self-describing content-addressed identifiers for distributed systems +date: 2026-03-12 +maturity: permanent +editors: + - name: Marcin Rataj + github: lidel + affiliation: + name: Interplanetary Shipyard + url: https://ipshipyard.com/ + - name: Robin Berjon + email: robin@berjon.com + url: https://berjon.com/ + github: darobin + twitter: robinberjon + affiliation: + name: IPFS Foundation + url: https://ipfsfoundation.org/ +former_editors: + - name: Juan Benet + github: jbenet +thanks: + - name: Steven Allen + github: Stebalien + - name: Rod Vagg + github: rvagg + - name: bumblefudge + github: bumblefudge + - name: Volker Mische + github: vmx + - name: Joel Thorstensson + github: oed + - name: Oli Evans + github: olizilla + +tags: ['data-formats'] +order: 1 +--- + +**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.tech). +It leverages [content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage), +[cryptographic hashing](https://simple.wikipedia.org/wiki/Cryptographic_hash_function), and +[self-describing formats](https://github.com/multiformats/multiformats). +It is the core identifier used by [IPFS](https://ipfs.tech) and [IPLD](https://ipld.io). +It uses a [multicodec](https://github.com/multiformats/multicodec) to indicate its version, making it fully self describing. + +## What is it? + +A CID is a self-describing content-addressed identifier. +It uses cryptographic hashes to achieve content addressing. It uses several +[multiformats](https://github.com/multiformats/multiformats) to achieve flexible self-description, namely: + +1. [multihash](https://github.com/multiformats/multihash) for content-addressed hashing, and +2. [multicodec](https://github.com/multiformats/multicodec) to type that addressed content, +to form a binary self-contained identifier, and optionally also +3. [multibase](https://github.com/multiformats/multibase) to encode that binary CID as a string. + +Concretely, it's a *typed* content address: a tuple of `(content-type, content-address)`. + +## How does it work? + +Current version: CIDv1. + +CIDv1 is a **binary** format composed of [unsigned varints](https://github.com/multiformats/unsigned-varint) +prefixing a hash digest to form a self-describing "content address": + +```text + ::= +# or, expanded: + ::= <`0x01`, the code for `CIDv1`> +``` + +Where + +- `` is a [multicodec](https://github.com/multiformats/multicodec) representing the version of CID, here for upgradability purposes. +- `` is a [multicodec](https://github.com/multiformats/multicodec) code representing the content type or format of the data being addressed. +- `` is a [multihash](https://github.com/multiformats/multihash) value, which uses a registry of hash function abbreviations to prefix a cryptographic hash of the content being addressed, thus making it self-describing. + +## Variant - Stringified Form + +Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded for different consumers or transports. +In such applications, CIDs are expressed as a Unicode *string* with a [multibase](https://github.com/multiformats/multibase) prefix. +The multibase prefix identifies the string encoding but is not part of the CID itself -- the same binary CID may be represented in different bases depending on context and needs such as string length and case-sensitivity. +The full string form is: + +```text + ::= )> +``` + +Where + +- `` is a [multibase prefix](https://github.com/multiformats/multibase/blob/master/multibase.csv) (1 Unicode code point) that makes the string self-describing for conversion back to binary. + +IPFS implementations SHOULD support at minimum `base58btc` (`z`), `base32` (`b`), `base16` (`f`), and `base36` (`k`, for ed25519 keys in [IPNS Records](https://specs.ipfs.tech/ipns/ipns-record/)). + +## Variant - Human-Readable Form + +It is often advantageous to translate a CID, which is already modular and self-describing, into a *human-readable* expansion of its self-describing parts, for purposes such as debugging, unit testing, and documentation. +We can easily transform a Stringified CID to a "Human-Readable CID" by translating and segmenting its constituent parts as follows: + +```text + ::= "-" "-" "-" +``` +Where each sub-component is replaced with its own human-readable form from the relevant registry: + +- `` is the name of the multibase code (eg `z`--> `base58btc`) +- `` is the name of the multicodec for the version of CID used (eg `0x01` --> `cidv1`) +- `` is the name of the multicodec code (eg `0x51` --> `cbor`) +- `` is the name of the multihash code (eg `sha2-256-256`) followed by a final dash and the hash itself `-abcdef0123456789...`) + +For example: + +```text +# example CID +zb2rhe5P4gXftAwvA4eXQ5HJwsER2owDyS9sKaQRRVQPn93bA +# corresponding human readable CID +base58btc - cidv1 - raw - sha2-256-256-6e6ff7950a36187a801613426e858dce686cd7d7e3c0fc42ee0330072d245c95 +``` +See: https://cid.ipfs.io/#zb2rhe5P4gXftAwvA4eXQ5HJwsER2owDyS9sKaQRRVQPn93bA + +## Design Considerations + +CIDs design takes into account many difficult tradeoffs encountered while building [IPFS](https://ipfs.tech). These are mostly coming from the multiformats project. + +- Compactness: CIDs are binary in nature to ensure these are as compact as possible, as they're meant to be part of longer path identifiers or URIs. +- Transport friendliness (or "copy-pastability"): CIDs are encoded with multibase to allow choosing the best base for transporting. For example, CIDs can be encoded into base58btc to yield shorter and easily-copy-pastable hashes. +- Versatility: CIDs are meant to be able to represent values of any format with any cryptographic hash. +- Avoid Lock-in: CIDs prevent lock-in to old, potentially-outdated decisions. +- Upgradability: CIDs encode a version to ensure the CID format itself can evolve. + +## Versions + +### CIDv0 + +CIDv0 is a backwards-compatible version, where: +- the `multibase` of the string representation is always `base58btc` and implicit (prefix `z` not present) +- the `multicodec` is always `dag-pb` (`0x70`) and implicit (not written) +- the `cid-version` is always `cidv0` (`0`) and implicit (not written) +- the `multihash` is written as is but is always a full (length 32) `sha2-256` (`0x12`) hash. + +```text +cidv0 ::= +``` + +### CIDv1 + +See the section: [How does it work?](#how-does-it-work) + +```text + ::= +``` + +## Decoding Algorithm + +To decode a CID, follow this algorithm: + +1. If it's a string (ASCII/UTF-8): + * If it is 46 characters long and starts with `Qm`, it's a CIDv0. Decode it as base58btc and continue to step 2. + * Otherwise, decode it according to the multibase spec and: + * If the first decoded byte is `0x12`, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (`0x12` = 18) to prevent ambiguity with decoded CIDv0s. + * Otherwise, you now have a binary CID. Continue to step 2. +2. Given a (binary) CID (`cid`): + * If the first two bytes are `[0x12, 0x20]` (the `sha2-256` multihash function code followed by digest length 32), it's a CIDv0. + * The CID's multihash is `cid` (34 bytes: 2-byte prefix + 32-byte digest). + * The CID's multicodec is `dag-pb` (`0x70`), implicit. + * The CID's version is 0. + * Otherwise, read the first varint in `cid`. This is the CID's version. + * If `0x01` (CIDv1): + * The CID's multicodec is the second varint in `cid`. + * The CID's multihash is the rest of `cid` (after the second varint). + * The CID's version is 1. + * Otherwise, the CID is malformed. + +# Appendices + +:::warning +These sections provide additional context. This is not part of specification, +and is provided here only for extra context. +::: + + + +## FAQ + +> **Q. I have questions on multicodec, multibase, or multihash.** + +Please check their repositories: [multicodec](https://github.com/multiformats/multicodec), [multibase](https://github.com/multiformats/multibase), [multihash](https://github.com/multiformats/multihash). + +> **Q. Why does CID exist?** + +IPFS originally used base58btc-encoded multihashes, but the need to support multiple data formats via IPLD revealed limitations of bare multihashes as identifiers. +CIDs were created to provide a self-describing, versioned, typed content address. +The history of this format is documented at: https://github.com/ipfs/specs/issues/130 + +> **Q. Is the use of multicodec similar to file extensions?** + +Yes. Like a file extension, the multicodec in a CID tells consumers how to interpret the bytes. +And just like file extensions, most users will never change it, but it is technically possible to swap the codec to change how the same bytes behind a CID are parsed. + +> **Q. What formats (multicodec codes) does CID support?** + +CID can reference content of any type registered in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv). +In practice, IPFS primarily uses [`dag-pb`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-pb/spec/) (`0x70`), [`raw`](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw) (`0x55`), [`dag-cbor`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-cbor/spec/) (`0x71`), [`dag-json`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-json/spec/) (`0x0129`), and [`libp2p-key`](https://github.com/libp2p/specs/blob/4e2c796bc77a2639136b277224468b7c48b9fff1/RFC/0001-text-peerid-cid.md) (`0x72`). + +> **Q. What is the process for updating CID specification (e.g., adding a new version)?** + +CIDs are a well established standard. +IPFS uses CIDs for content-addressing and IPNS. +Making changes to such key protocol requires a careful review which should include feedback from implementers and stakeholders across ecosystem. + +Due to this, changes to CID specification MUST be submitted as an improvement proposal to [ipfs/specs](https://github.com/ipfs/specs/tree/main/IPIP) repository (PR with [IPIP document](https://github.com/ipfs/specs/blob/main/IPIP/0000-template.md)), and follow the IPIP process described there. + +## Historical Design Decisions + +You can read an [in-depth discussion on why this format was needed in IPFS](https://github.com/ipfs/specs/issues/130) and the [original CIDv1 proposal](https://github.com/multiformats/cid/blob/f638ca68390758f0d4c7f90ac843091d3973cd02/original-rfc.md). diff --git a/src/unixfs.md b/src/unixfs.md index 2232bdb7b..9e18be0ca 100644 --- a/src/unixfs.md +++ b/src/unixfs.md @@ -66,7 +66,7 @@ thanks: github: bumblefudge tags: ['data-formats'] -order: 1 +order: 2 --- # Node Types