From 66c80d907d03136a99751508595652cc59d9ed11 Mon Sep 17 00:00:00 2001 From: Robin Berjon Date: Thu, 12 Mar 2026 10:57:54 +0100 Subject: [PATCH 01/10] adding CIDs --- src/architecture/principles.md | 5 +- src/data-formats/cid.md | 214 +++++++++++++++++++++++++++++++++ src/unixfs.md | 2 +- 3 files changed, 217 insertions(+), 4 deletions(-) create mode 100644 src/data-formats/cid.md diff --git a/src/architecture/principles.md b/src/architecture/principles.md index ea1367902..e85f0ab6c 100644 --- a/src/architecture/principles.md +++ b/src/architecture/principles.md @@ -12,10 +12,9 @@ editors: url: https://berjon.com/ github: darobin twitter: robinberjon - mastodon: "@robin@mastodon.social" affiliation: - name: Protocol Labs - url: https://protocol.ai/ + name: IPFS Foundation + url: https://ipfsfoundation.org/ tags: ['architecture'] order: 0 xref: diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md new file mode 100644 index 000000000..ce9a0d57b --- /dev/null +++ b/src/data-formats/cid.md @@ -0,0 +1,214 @@ +--- +title: CID (Content IDentifier) +description: > + Self-describing content-addressed identifiers for distributed systems +date: 2026-03-12 +maturity: permanent +editors: + - name: Marcin Rataj + github: lidel + affiliation: + name: Interplanetary Shipyard + url: https://ipshipyard.com/ + - name: Robin Berjon + email: robin@berjon.com + url: https://berjon.com/ + github: darobin + twitter: robinberjon + affiliation: + name: IPFS Foundation + url: https://ipfsfoundation.org/ +former_editors: + - name: Juan Benet + github: jbenet + +tags: ['data-formats'] +order: 1 +--- + +**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.io). +It leverages [content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage), +[cryptographic hashing](https://simple.wikipedia.org/wiki/Cryptographic_hash_function), and +[self-describing formats](https://github.com/multiformats/multiformats). +It is the core identifier used by [IPFS](https://ipfs.io) and [IPLD](https://ipld.io). +It uses a [multicodec](https://github.com/multiformats/multicodec) to indicate its version, making it fully self describing. + +## What is it? + +A CID is a self-describing content-addressed identifier. +It uses cryptographic hashes to achieve content addressing. It uses several +[multiformats](https://github.com/multiformats/multiformats) to achieve flexible self-description, namely: + +1. [multihash](https://github.com/multiformats/multihash) to hash content addressed, and +2. [multicodec](https://github.com/multiformats/multicodec) to type that addressed content, +to form a binary self-contained identifier, and optionally also +3. [multibase](https://github.com/multiformats/multibase) to encode that binary CID as a string. + +Concretely, it's a *typed* content address: a tuple of `(content-type, content-address)`. + +## How does it work? + +Current version: CIDv1. + +CIDv1 is a **binary** format composed of [unsigned varints](https://github.com/multiformats/unsigned-varint) +prefixing a hash digest to form a self-describing "content address": + +```text + ::= +# or, expanded: + ::= <`0x01`, the code for `CIDv1`> +``` + +Where + +- `` is a [multicodec](https://github.com/multiformats/multicodec) representing the version of CID, here for upgradability purposes. +- `` is a [multicodec](https://github.com/multiformats/multicodec) code representing the content type or format of the data being addressed. +- `` is a [multihash](https://github.com/multiformats/multihash) value, which uses a registry of hash function abbreviations to prefix a cryptographic hash of the content being addressed, thus making it self-describing. + +## Variant - Stringified Form + +Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded multiple ways for different consumers or for different transports. +In such applications, CIDs are often expressed as a Unicode *string* rather than a bytestring, which adds a single code-point prefix. +In these contexts, then, the full string form is: + +```text + ::= )> +``` + +Where + +- `` is a [multibase](https://github.com/multiformats/multibase) prefix (1 Unicode code point in length) that renders the base-encoded unicode string following it self-describing for simpler conversion back to binary. + +## Variant - Human-Readable Form + +It is often advantageous to translate a CID, which is already modular and self-describing, into a *human-readable* expansion of its self-describing parts, for purposes such as debugging, unit testing, and documentation. +We can easily transform a Stringified CID to a "Human-Readable CID" by translating and segmenting its constituent parts as follows: + +```text + ::= "-" "-" "-" +``` +Where each sub-component is replaced with its own human-readable form from the relevant registry: + +- `` is the name of the multibase code (eg `z`--> `base58btc`) +- `` is the name of the multicodec for the version of CID used (eg `0x01` --> `cidv1`) +- `` is the name of the multicodec code (eg `0x51` --> `cbor`) +- `` is the name of the multihash code (eg `sha2-256-256`) followed by a final dash and the hash itself `-abcdef0123456789...`) + +For example: + +```text +# example CID +zb2rhe5P4gXftAwvA4eXQ5HJwsER2owDyS9sKaQRRVQPn93bA +# corresponding human readable CID +base58btc - cidv1 - raw - sha2-256-256-6e6ff7950a36187a801613426e858dce686cd7d7e3c0fc42ee0330072d245c95 +``` +See: https://cid.ipfs.io/#zb2rhe5P4gXftAwvA4eXQ5HJwsER2owDyS9sKaQRRVQPn93bA + +## Design Considerations + +CIDs design takes into account many difficult tradeoffs encountered while building [IPFS](https://ipfs.tech). These are mostly coming from the multiformats project. + +- Compactness: CIDs are binary in nature to ensure these are as compact as possible, as they're meant to be part of longer path identifiers or URIs. +- Transport friendliness (or "copy-pastability"): CIDs are encoded with multibase to allow choosing the best base for transporting. For example, CIDs can be encoded into base58btc to yield shorter and easily-copy-pastable hashes. +- Versatility: CIDs are meant to be able to represent values of any format with any cryptographic hash. +- Avoid Lock-in: CIDs prevent lock-in to old, potentially-outdated decisions. +- Upgradability: CIDs encode a version to ensure the CID format itself can evolve. + +## Versions + +### CIDv0 + +CIDv0 is a backwards-compatible version, where: +- the `multibase` of the string representation is always `base58btc` and implicit (not written) +- the `multicodec` is always `dag-pb` and implicit (not written) +- the `cid-version` is always `cidv0` and implicit (not written) +- the `multihash` is written as is but is always a full (length 32) sha256 hash. + +```text +cidv0 ::= +``` + +### CIDv1 + +See the section: [How does it work?](#how-does-it-work) + +```text + ::= +``` + +## Decoding Algorithm + +To decode a CID, follow the following algorithm: + +1. If it's a string (ASCII/UTF-8): + * If it is 46 characters long and starts with `Qm...`, it's a CIDv0. Decode it as base58btc and continue to step 2. + * Otherwise, decode it according to the multibase spec and: + * If the first decoded byte is 0x12, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (0x12 = 18) to prevent ambiguity with decoded CIDv0s. + * Otherwise, you now have a binary CID. Continue to step 2. +2. Given a (binary) CID (`cid`): + * If it's 34 bytes long with the leading bytes `[0x12, 0x20, ...]`, it's a CIDv0. + * The CID's multihash is `cid`. + * The CID's multicodec is DagProtobuf + * The CID's version is 0. + * Otherwise, let `N` be the first varint in `cid`. This is the CID's version. + * If `N == 0x01` (CIDv1): + * The CID's multicodec is the second varint in `cid` + * The CID's multihash is the rest of the `cid` (after the second varint). + * The CID's version is 1. + * If `N == 0x02` (CIDv2), or `N == 0x03` (CIDv3), the CID version is reserved. + * If `N` is equal to some other multicodec, the CID is malformed. + +# Appendices + +:::warning +These sections provide additional context. This is not part of specification, +and is provided here only for extra context. +::: + +## Implementations + +- [go-cid](https://github.com/ipfs/go-cid) +- [java-cid](https://github.com/ipld/java-cid) +- [js-multiformats](https://github.com/multiformats/js-multiformats) +- [rust-cid](https://github.com/multiformats/rust-cid) +- [py-multiformats-cid](https://github.com/pinnaculum/py-multiformats-cid) +- [elixir-cid](https://github.com/nocursor/ex-cid) +- [dart_cid](https://github.com/dwyl/dart_cid) +- [zig_cid](https://github.com/zen-eth/multiformats-zig) +- [Add yours today!](https://github.com/multiformats/cid/edit/master/README.md) + +## FAQ + +> **Q. I have questions on multicodec, multibase, or multihash.** + +Please check their repositories: [multicodec](https://github.com/multiformats/multicodec), [multibase](https://github.com/multiformats/multibase), [multihash](https://github.com/multiformats/multihash). + +> **Q. Why does CID exist?** + +We were using base58btc encoded multihashes in IPFS, and then we needed to switch formats to IPLD. +We struggled with lots of problems of addressing data with different formats until we created CIDs. +You can read the history of this format here: https://github.com/ipfs/specs/issues/130 + +> **Q. Is the use of multicodec similar to file extensions?** + +Yes, kind of! like a file extension, the multicodec identifier establishes the format of the data. +Unlike file extensions, these are in the middle of the identifier and not meant to be changed by users. +There is also a short table of supported formats. + +> **Q. What formats (multicodec codes) does CID support?** + +We are figuring this out at this time. +It will likely be a subset of [multicodecs](https://github.com/multiformats/multicodec/blob/master/table.csv) for secure distributed systems. +So far, we want to address IPFS's UnixFS and raw blocks ([`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/), [`raw`](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw)), IPNS's [`libp2p-key`](https://github.com/libp2p/specs/blob/master/RFC/0001-text-peerid-cid.md), and IPLD's [`dag-json`](https://ipld.io/specs/codecs/dag-json/spec/)/[`dag-cbor`](https://ipld.io/specs/codecs/dag-cbor/spec/) formats. + +> **Q. What is the process for updating CID specification (e.g., adding a new version)?** + +CIDs are a well established standard. +IPFS uses CIDs for content-addressing and IPNS. +Making changes to such key protocol requires a careful review which should include feedback from implementers and stakeholders across ecosystem. + +Due to this, changes to CID specification MUST be submitted as an improvement proposal to [ipfs/specs](https://github.com/ipfs/specs/tree/main/IPIP) repository (PR with [IPIP document](https://github.com/ipfs/specs/blob/main/IPIP/0000-template.md)), and follow the IPIP process described there. + +## Historical Design Decisions + +You can read an [in-depth discussion on why this format was needed in IPFS](https://github.com/ipfs/specs/issues/130). diff --git a/src/unixfs.md b/src/unixfs.md index 2232bdb7b..9e18be0ca 100644 --- a/src/unixfs.md +++ b/src/unixfs.md @@ -66,7 +66,7 @@ thanks: github: bumblefudge tags: ['data-formats'] -order: 1 +order: 2 --- # Node Types From b957285fe3157f9e0c44e6c5272ff29916505c83 Mon Sep 17 00:00:00 2001 From: Robin Berjon Date: Thu, 12 Mar 2026 11:58:58 +0100 Subject: [PATCH 02/10] lints --- src/data-formats/cid.md | 66 ++++++++++++++++++++--------------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index ce9a0d57b..b0cba5137 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -26,17 +26,17 @@ tags: ['data-formats'] order: 1 --- -**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.io). -It leverages [content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage), -[cryptographic hashing](https://simple.wikipedia.org/wiki/Cryptographic_hash_function), and -[self-describing formats](https://github.com/multiformats/multiformats). -It is the core identifier used by [IPFS](https://ipfs.io) and [IPLD](https://ipld.io). +**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.io). +It leverages [content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage), +[cryptographic hashing](https://simple.wikipedia.org/wiki/Cryptographic_hash_function), and +[self-describing formats](https://github.com/multiformats/multiformats). +It is the core identifier used by [IPFS](https://ipfs.io) and [IPLD](https://ipld.io). It uses a [multicodec](https://github.com/multiformats/multicodec) to indicate its version, making it fully self describing. ## What is it? -A CID is a self-describing content-addressed identifier. -It uses cryptographic hashes to achieve content addressing. It uses several +A CID is a self-describing content-addressed identifier. +It uses cryptographic hashes to achieve content addressing. It uses several [multiformats](https://github.com/multiformats/multiformats) to achieve flexible self-description, namely: 1. [multihash](https://github.com/multiformats/multihash) to hash content addressed, and @@ -50,7 +50,7 @@ Concretely, it's a *typed* content address: a tuple of `(content-type, content-a Current version: CIDv1. -CIDv1 is a **binary** format composed of [unsigned varints](https://github.com/multiformats/unsigned-varint) +CIDv1 is a **binary** format composed of [unsigned varints](https://github.com/multiformats/unsigned-varint) prefixing a hash digest to form a self-describing "content address": ```text @@ -67,8 +67,8 @@ Where ## Variant - Stringified Form -Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded multiple ways for different consumers or for different transports. -In such applications, CIDs are often expressed as a Unicode *string* rather than a bytestring, which adds a single code-point prefix. +Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded multiple ways for different consumers or for different transports. +In such applications, CIDs are often expressed as a Unicode *string* rather than a bytestring, which adds a single code-point prefix. In these contexts, then, the full string form is: ```text @@ -81,7 +81,7 @@ Where ## Variant - Human-Readable Form -It is often advantageous to translate a CID, which is already modular and self-describing, into a *human-readable* expansion of its self-describing parts, for purposes such as debugging, unit testing, and documentation. +It is often advantageous to translate a CID, which is already modular and self-describing, into a *human-readable* expansion of its self-describing parts, for purposes such as debugging, unit testing, and documentation. We can easily transform a Stringified CID to a "Human-Readable CID" by translating and segmenting its constituent parts as follows: ```text @@ -141,22 +141,22 @@ See the section: [How does it work?](#how-does-it-work) To decode a CID, follow the following algorithm: 1. If it's a string (ASCII/UTF-8): - * If it is 46 characters long and starts with `Qm...`, it's a CIDv0. Decode it as base58btc and continue to step 2. - * Otherwise, decode it according to the multibase spec and: - * If the first decoded byte is 0x12, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (0x12 = 18) to prevent ambiguity with decoded CIDv0s. - * Otherwise, you now have a binary CID. Continue to step 2. +* If it is 46 characters long and starts with `Qm...`, it's a CIDv0. Decode it as base58btc and continue to step 2. +* Otherwise, decode it according to the multibase spec and: + * If the first decoded byte is 0x12, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (0x12 = 18) to prevent ambiguity with decoded CIDv0s. + * Otherwise, you now have a binary CID. Continue to step 2. 2. Given a (binary) CID (`cid`): - * If it's 34 bytes long with the leading bytes `[0x12, 0x20, ...]`, it's a CIDv0. - * The CID's multihash is `cid`. - * The CID's multicodec is DagProtobuf - * The CID's version is 0. - * Otherwise, let `N` be the first varint in `cid`. This is the CID's version. - * If `N == 0x01` (CIDv1): - * The CID's multicodec is the second varint in `cid` - * The CID's multihash is the rest of the `cid` (after the second varint). - * The CID's version is 1. - * If `N == 0x02` (CIDv2), or `N == 0x03` (CIDv3), the CID version is reserved. - * If `N` is equal to some other multicodec, the CID is malformed. +* If it's 34 bytes long with the leading bytes `[0x12, 0x20, ...]`, it's a CIDv0. + * The CID's multihash is `cid`. + * The CID's multicodec is DagProtobuf + * The CID's version is 0. +* Otherwise, let `N` be the first varint in `cid`. This is the CID's version. + * If `N == 0x01` (CIDv1): + * The CID's multicodec is the second varint in `cid` + * The CID's multihash is the rest of the `cid` (after the second varint). + * The CID's version is 1. + * If `N == 0x02` (CIDv2), or `N == 0x03` (CIDv3), the CID version is reserved. + * If `N` is equal to some other multicodec, the CID is malformed. # Appendices @@ -185,29 +185,29 @@ Please check their repositories: [multicodec](https://github.com/multiformats/mu > **Q. Why does CID exist?** -We were using base58btc encoded multihashes in IPFS, and then we needed to switch formats to IPLD. -We struggled with lots of problems of addressing data with different formats until we created CIDs. +We were using base58btc encoded multihashes in IPFS, and then we needed to switch formats to IPLD. +We struggled with lots of problems of addressing data with different formats until we created CIDs. You can read the history of this format here: https://github.com/ipfs/specs/issues/130 > **Q. Is the use of multicodec similar to file extensions?** -Yes, kind of! like a file extension, the multicodec identifier establishes the format of the data. +Yes, kind of! like a file extension, the multicodec identifier establishes the format of the data. Unlike file extensions, these are in the middle of the identifier and not meant to be changed by users. There is also a short table of supported formats. > **Q. What formats (multicodec codes) does CID support?** -We are figuring this out at this time. -It will likely be a subset of [multicodecs](https://github.com/multiformats/multicodec/blob/master/table.csv) for secure distributed systems. +We are figuring this out at this time. +It will likely be a subset of [multicodecs](https://github.com/multiformats/multicodec/blob/master/table.csv) for secure distributed systems. So far, we want to address IPFS's UnixFS and raw blocks ([`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/), [`raw`](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw)), IPNS's [`libp2p-key`](https://github.com/libp2p/specs/blob/master/RFC/0001-text-peerid-cid.md), and IPLD's [`dag-json`](https://ipld.io/specs/codecs/dag-json/spec/)/[`dag-cbor`](https://ipld.io/specs/codecs/dag-cbor/spec/) formats. > **Q. What is the process for updating CID specification (e.g., adding a new version)?** -CIDs are a well established standard. +CIDs are a well established standard. IPFS uses CIDs for content-addressing and IPNS. Making changes to such key protocol requires a careful review which should include feedback from implementers and stakeholders across ecosystem. -Due to this, changes to CID specification MUST be submitted as an improvement proposal to [ipfs/specs](https://github.com/ipfs/specs/tree/main/IPIP) repository (PR with [IPIP document](https://github.com/ipfs/specs/blob/main/IPIP/0000-template.md)), and follow the IPIP process described there. +Due to this, changes to CID specification MUST be submitted as an improvement proposal to [ipfs/specs](https://github.com/ipfs/specs/tree/main/IPIP) repository (PR with [IPIP document](https://github.com/ipfs/specs/blob/main/IPIP/0000-template.md)), and follow the IPIP process described there. ## Historical Design Decisions From e9db8534ca21c83307eaa3f91427236b1a82ca95 Mon Sep 17 00:00:00 2001 From: Robin Berjon Date: Thu, 12 Mar 2026 12:04:52 +0100 Subject: [PATCH 03/10] annoying lint --- src/data-formats/cid.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index b0cba5137..76cabe873 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -145,7 +145,7 @@ To decode a CID, follow the following algorithm: * Otherwise, decode it according to the multibase spec and: * If the first decoded byte is 0x12, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (0x12 = 18) to prevent ambiguity with decoded CIDv0s. * Otherwise, you now have a binary CID. Continue to step 2. -2. Given a (binary) CID (`cid`): +1. Given a (binary) CID (`cid`): * If it's 34 bytes long with the leading bytes `[0x12, 0x20, ...]`, it's a CIDv0. * The CID's multihash is `cid`. * The CID's multicodec is DagProtobuf From 8344694c611c31dbebffb4158b7fc8af1c15b7b5 Mon Sep 17 00:00:00 2001 From: Robin Berjon Date: Thu, 12 Mar 2026 07:17:00 -0400 Subject: [PATCH 04/10] Update src/data-formats/cid.md Co-authored-by: Volker Mische --- src/data-formats/cid.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index 76cabe873..49d839b29 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -133,7 +133,7 @@ cidv0 ::= See the section: [How does it work?](#how-does-it-work) ```text - ::= + ::= ``` ## Decoding Algorithm From 3582173b746d5198043c0ed9c164a3021f9971a4 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 21:58:43 +0100 Subject: [PATCH 05/10] fix: improve decoding algorithm clarity and list formatting - lead binary CIDv0 detection with two-byte check [0x12, 0x20] instead of "34 bytes long with leading bytes" - add explicit dag-pb (0x70) multicodec for CIDv0 - fix step numbering (both were "1.", text referenced "step 2") - fix sub-bullet indentation for CommonMark nesting - drop CIDv2/CIDv3 reservation, simplify to "otherwise malformed" --- src/data-formats/cid.md | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index 49d839b29..8d0a86175 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -141,22 +141,21 @@ See the section: [How does it work?](#how-does-it-work) To decode a CID, follow the following algorithm: 1. If it's a string (ASCII/UTF-8): -* If it is 46 characters long and starts with `Qm...`, it's a CIDv0. Decode it as base58btc and continue to step 2. -* Otherwise, decode it according to the multibase spec and: - * If the first decoded byte is 0x12, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (0x12 = 18) to prevent ambiguity with decoded CIDv0s. - * Otherwise, you now have a binary CID. Continue to step 2. -1. Given a (binary) CID (`cid`): -* If it's 34 bytes long with the leading bytes `[0x12, 0x20, ...]`, it's a CIDv0. - * The CID's multihash is `cid`. - * The CID's multicodec is DagProtobuf - * The CID's version is 0. -* Otherwise, let `N` be the first varint in `cid`. This is the CID's version. - * If `N == 0x01` (CIDv1): - * The CID's multicodec is the second varint in `cid` - * The CID's multihash is the rest of the `cid` (after the second varint). - * The CID's version is 1. - * If `N == 0x02` (CIDv2), or `N == 0x03` (CIDv3), the CID version is reserved. - * If `N` is equal to some other multicodec, the CID is malformed. + * If it is 46 characters long and starts with `Qm`, it's a CIDv0. Decode it as base58btc and continue to step 2. + * Otherwise, decode it according to the multibase spec and: + * If the first decoded byte is `0x12`, return an error. CIDv0 CIDs may not be multibase encoded and there will be no CIDv18 (`0x12` = 18) to prevent ambiguity with decoded CIDv0s. + * Otherwise, you now have a binary CID. Continue to step 2. +2. Given a (binary) CID (`cid`): + * If the first two bytes are `[0x12, 0x20]` (the `sha2-256` multihash function code followed by digest length 32), it's a CIDv0. + * The CID's multihash is `cid` (34 bytes: 2-byte prefix + 32-byte digest). + * The CID's multicodec is `dag-pb` (`0x70`), implicit. + * The CID's version is 0. + * Otherwise, read the first varint in `cid`. This is the CID's version. + * If `0x01` (CIDv1): + * The CID's multicodec is the second varint in `cid`. + * The CID's multihash is the rest of `cid` (after the second varint). + * The CID's version is 1. + * Otherwise, the CID is malformed. # Appendices From 8ec3f066ae6e8fff29765a45a23c772ec8da4b6e Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 22:16:32 +0100 Subject: [PATCH 06/10] fix: update stale content from multiformats/cid migration - clarify multibase is not part of CID, list SHOULD-support bases - update FAQ tone from informal README to spec language - add multicodec hex values for commonly used IPFS codecs - use web.archive.org links for ipld.io specs - pin libp2p spec link to specific commit - link original CIDv1 RFC in historical design decisions - comment out implementations list pending conformance review --- src/data-formats/cid.md | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index 8d0a86175..2fc5fb24f 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -79,6 +79,10 @@ Where - `` is a [multibase](https://github.com/multiformats/multibase) prefix (1 Unicode code point in length) that renders the base-encoded unicode string following it self-describing for simpler conversion back to binary. +A CID is fundamentally a binary value. +The multibase prefix identifies the string encoding but is not part of the CID itself -- the same CID may be encoded in different bases for different contexts. +IPFS implementations SHOULD support at minimum `base58btc`, `base32`, `base16`, and `base36` (the latter for ed25519 keys in [IPNS Records](https://specs.ipfs.tech/ipns/ipns-record/)). + ## Variant - Human-Readable Form It is often advantageous to translate a CID, which is already modular and self-describing, into a *human-readable* expansion of its self-describing parts, for purposes such as debugging, unit testing, and documentation. @@ -164,6 +168,9 @@ These sections provide additional context. This is not part of specification, and is provided here only for extra context. ::: + ## FAQ @@ -184,21 +192,19 @@ Please check their repositories: [multicodec](https://github.com/multiformats/mu > **Q. Why does CID exist?** -We were using base58btc encoded multihashes in IPFS, and then we needed to switch formats to IPLD. -We struggled with lots of problems of addressing data with different formats until we created CIDs. -You can read the history of this format here: https://github.com/ipfs/specs/issues/130 +IPFS originally used base58btc-encoded multihashes, but the need to support multiple data formats via IPLD revealed limitations of bare multihashes as identifiers. +CIDs were created to provide a self-describing, versioned, typed content address. +The history of this format is documented at: https://github.com/ipfs/specs/issues/130 > **Q. Is the use of multicodec similar to file extensions?** -Yes, kind of! like a file extension, the multicodec identifier establishes the format of the data. -Unlike file extensions, these are in the middle of the identifier and not meant to be changed by users. -There is also a short table of supported formats. +Yes. Like a file extension, the multicodec in a CID tells consumers how to interpret the bytes. +And just like file extensions, most users will never change it, but it is technically possible to swap the codec to change how the same bytes behind a CID are parsed. > **Q. What formats (multicodec codes) does CID support?** -We are figuring this out at this time. -It will likely be a subset of [multicodecs](https://github.com/multiformats/multicodec/blob/master/table.csv) for secure distributed systems. -So far, we want to address IPFS's UnixFS and raw blocks ([`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/), [`raw`](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw)), IPNS's [`libp2p-key`](https://github.com/libp2p/specs/blob/master/RFC/0001-text-peerid-cid.md), and IPLD's [`dag-json`](https://ipld.io/specs/codecs/dag-json/spec/)/[`dag-cbor`](https://ipld.io/specs/codecs/dag-cbor/spec/) formats. +CID can reference content of any type registered in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv). +In practice, IPFS primarily uses [`dag-pb`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-pb/spec/) (`0x70`), [`raw`](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw) (`0x55`), [`dag-cbor`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-cbor/spec/) (`0x71`), [`dag-json`](https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-json/spec/) (`0x0129`), and [`libp2p-key`](https://github.com/libp2p/specs/blob/4e2c796bc77a2639136b277224468b7c48b9fff1/RFC/0001-text-peerid-cid.md) (`0x72`). > **Q. What is the process for updating CID specification (e.g., adding a new version)?** @@ -210,4 +216,4 @@ Due to this, changes to CID specification MUST be submitted as an improvement pr ## Historical Design Decisions -You can read an [in-depth discussion on why this format was needed in IPFS](https://github.com/ipfs/specs/issues/130). +You can read an [in-depth discussion on why this format was needed in IPFS](https://github.com/ipfs/specs/issues/130) and the [original CIDv1 proposal](https://github.com/multiformats/cid/blob/f638ca68390758f0d4c7f90ac843091d3973cd02/original-rfc.md). From 93197461b6db6f092d9fac96c97932e789313ec1 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 22:18:03 +0100 Subject: [PATCH 07/10] fix: typos, grammar, and URL consistency - fix "multihash to hash content addressed" grammar - remove double spaces (lines 45, 80) - standardize ipfs.io to ipfs.tech - rename stringified form production to to avoid shadowing the binary definition --- src/data-formats/cid.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index 2fc5fb24f..5c1349974 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -26,11 +26,11 @@ tags: ['data-formats'] order: 1 --- -**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.io). +**CID** is a format for referencing content in distributed information systems, like [IPFS](https://ipfs.tech). It leverages [content addressing](https://en.wikipedia.org/wiki/Content-addressable_storage), [cryptographic hashing](https://simple.wikipedia.org/wiki/Cryptographic_hash_function), and [self-describing formats](https://github.com/multiformats/multiformats). -It is the core identifier used by [IPFS](https://ipfs.io) and [IPLD](https://ipld.io). +It is the core identifier used by [IPFS](https://ipfs.tech) and [IPLD](https://ipld.io). It uses a [multicodec](https://github.com/multiformats/multicodec) to indicate its version, making it fully self describing. ## What is it? @@ -39,10 +39,10 @@ A CID is a self-describing content-addressed identifier. It uses cryptographic hashes to achieve content addressing. It uses several [multiformats](https://github.com/multiformats/multiformats) to achieve flexible self-description, namely: -1. [multihash](https://github.com/multiformats/multihash) to hash content addressed, and +1. [multihash](https://github.com/multiformats/multihash) for content-addressed hashing, and 2. [multicodec](https://github.com/multiformats/multicodec) to type that addressed content, to form a binary self-contained identifier, and optionally also -3. [multibase](https://github.com/multiformats/multibase) to encode that binary CID as a string. +3. [multibase](https://github.com/multiformats/multibase) to encode that binary CID as a string. Concretely, it's a *typed* content address: a tuple of `(content-type, content-address)`. @@ -72,12 +72,12 @@ In such applications, CIDs are often expressed as a Unicode *string* rather than In these contexts, then, the full string form is: ```text - ::= )> + ::= )> ``` Where -- `` is a [multibase](https://github.com/multiformats/multibase) prefix (1 Unicode code point in length) that renders the base-encoded unicode string following it self-describing for simpler conversion back to binary. +- `` is a [multibase](https://github.com/multiformats/multibase) prefix (1 Unicode code point in length) that renders the base-encoded unicode string following it self-describing for simpler conversion back to binary. A CID is fundamentally a binary value. The multibase prefix identifies the string encoding but is not part of the CID itself -- the same CID may be encoded in different bases for different contexts. From 71286a697ee6e91c90a310ab61945bea60c2f533 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 22:19:38 +0100 Subject: [PATCH 08/10] fix: credit multiformats/cid contributors add thanks entries for people who made substantive spec content changes in the original multiformats/cid repo --- src/data-formats/cid.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index 5c1349974..a88003f3e 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -21,6 +21,19 @@ editors: former_editors: - name: Juan Benet github: jbenet +thanks: + - name: Steven Allen + github: Stebalien + - name: Rod Vagg + github: rvagg + - name: bumblefudge + github: bumblefudge + - name: Volker Mische + github: vmx + - name: Joel Thorstensson + github: oed + - name: Oli Evans + github: olizilla tags: ['data-formats'] order: 1 From a41631eb1ff1673abeceb89c6365cb74647d1974 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 22:40:43 +0100 Subject: [PATCH 09/10] fix: rewrite stringified form section for clarity - integrate multibase-is-not-part-of-CID note into section flow - remove redundant "CID is fundamentally binary" restatement - rename to in grammar - link multibase.csv table for prefix reference - add explicit prefix characters for SHOULD-support bases - note base choice depends on string length and case-sensitivity --- src/data-formats/cid.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index a88003f3e..ec150635b 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -80,21 +80,20 @@ Where ## Variant - Stringified Form -Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded multiple ways for different consumers or for different transports. -In such applications, CIDs are often expressed as a Unicode *string* rather than a bytestring, which adds a single code-point prefix. -In these contexts, then, the full string form is: +Since CIDs have many applications outside of binary-only contexts, a given CID may need to be base-encoded for different consumers or transports. +In such applications, CIDs are expressed as a Unicode *string* with a [multibase](https://github.com/multiformats/multibase) prefix. +The multibase prefix identifies the string encoding but is not part of the CID itself -- the same binary CID may be represented in different bases depending on context and needs such as string length and case-sensitivity. +The full string form is: ```text - ::= )> + ::= )> ``` Where -- `` is a [multibase](https://github.com/multiformats/multibase) prefix (1 Unicode code point in length) that renders the base-encoded unicode string following it self-describing for simpler conversion back to binary. +- `` is a [multibase prefix](https://github.com/multiformats/multibase/blob/master/multibase.csv) (1 Unicode code point) that makes the string self-describing for conversion back to binary. -A CID is fundamentally a binary value. -The multibase prefix identifies the string encoding but is not part of the CID itself -- the same CID may be encoded in different bases for different contexts. -IPFS implementations SHOULD support at minimum `base58btc`, `base32`, `base16`, and `base36` (the latter for ed25519 keys in [IPNS Records](https://specs.ipfs.tech/ipns/ipns-record/)). +IPFS implementations SHOULD support at minimum `base58btc` (`z`), `base32` (`b`), `base16` (`f`), and `base36` (`k`, for ed25519 keys in [IPNS Records](https://specs.ipfs.tech/ipns/ipns-record/)). ## Variant - Human-Readable Form From 538bc6f56c6749ca0c702441443abed9fdf631a4 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Mon, 16 Mar 2026 22:49:48 +0100 Subject: [PATCH 10/10] fix: add numeric values for codecs and fix sha256 naming MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - sha256 → sha2-256 to match multicodec table - add numeric values: dag-pb (0x70), sha2-256 (0x12), cidv0 (0) - clarify base58btc prefix z is not present in CIDv0 strings - fix "follow the following" stutter --- src/data-formats/cid.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/data-formats/cid.md b/src/data-formats/cid.md index ec150635b..ace3081cb 100644 --- a/src/data-formats/cid.md +++ b/src/data-formats/cid.md @@ -135,10 +135,10 @@ CIDs design takes into account many difficult tradeoffs encountered while buildi ### CIDv0 CIDv0 is a backwards-compatible version, where: -- the `multibase` of the string representation is always `base58btc` and implicit (not written) -- the `multicodec` is always `dag-pb` and implicit (not written) -- the `cid-version` is always `cidv0` and implicit (not written) -- the `multihash` is written as is but is always a full (length 32) sha256 hash. +- the `multibase` of the string representation is always `base58btc` and implicit (prefix `z` not present) +- the `multicodec` is always `dag-pb` (`0x70`) and implicit (not written) +- the `cid-version` is always `cidv0` (`0`) and implicit (not written) +- the `multihash` is written as is but is always a full (length 32) `sha2-256` (`0x12`) hash. ```text cidv0 ::= @@ -154,7 +154,7 @@ See the section: [How does it work?](#how-does-it-work) ## Decoding Algorithm -To decode a CID, follow the following algorithm: +To decode a CID, follow this algorithm: 1. If it's a string (ASCII/UTF-8): * If it is 46 characters long and starts with `Qm`, it's a CIDv0. Decode it as base58btc and continue to step 2.