Skip to content

Conversation

@kalbasit
Copy link
Member

Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit rough around the edges (I know it's draft!), but seems like a good starting point. Do you plan to validate this?

Comment on lines 76 to 81
Nix store paths follow the format:
```
/nix/store/<hash>-<name>
```

Where `<hash>` is a 32-character string using Nix's custom base32 alphabet (`0123456789abcdfghijklmnpqrsvwxyz` — notably excluding `e`, `o`, `u`, `t`). This encodes 160 bits of a truncated SHA-256 digest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can keep this brief.

Suggested change
Nix store paths follow the format:
```
/nix/store/<hash>-<name>
```
Where `<hash>` is a 32-character string using Nix's custom base32 alphabet (`0123456789abcdfghijklmnpqrsvwxyz` — notably excluding `e`, `o`, `u`, `t`). This encodes 160 bits of a truncated SHA-256 digest.
[Nix store paths](https://nix.dev/manual/nix/latest/store/store-path.html) generally follow the format `/nix/store/<hash>-<name>`, where the `<hash>` part is a sufficient identifier for the whole store object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absence of e, o, u and t is not mentioned in the documentation you linked to. I think it's important for the implementers of this RFC to be aware of this to avoid trying to compute hashes in the Golomb-rice that includes these chars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to keep it brief. I've shortened the general description but kept the alphabet specification since it's not in the linked documentation and is critical for correct implementation—using the standard base32 alphabet would produce incorrect index lookups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fail to parse.
Also note the possibly unexpected byte order.


The manifest is a JSON file at a well-known path that describes the index topology:

**Path**: `/nix-cache-index/manifest.json`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could fetch this info with no extra requests by making it part of nix-cache-info, and consider separately to move to a nicer JSON-only format.

Alternatively, we could consider this Path to be an example, and have a field in nix-cache-info to indicate its presence. This also solves the forward compatibility problem of supporting a new cache index format in the future, and/or multiple simultaneously for other purposes (if that even makes sense beyond migrations).

Copy link
Member

@roberth roberth Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense. @Mic92 gives another circumstance that could motivate having multiple.

1.5G sounds okay for downstream caches, but too large for all clients especially clients that only have a short live time i.e. github actions. But I think reduce the size significantly by limiting how many years we include in this index for cache.nixos.org i.e. 2-3 years instead of everything since 2015.

#195 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the nix-cache-info is that it requires a custom parser in all languages. JSON is well-defined format and most/all languages have standard libraries to parse it. One could argue that we already do this with the current nix-cache-info, case in point ncps, but I just feel that it makes the whole thing fragile.

Additionally, this RFC assumes that the manifest ETAG is the way clients could tell if the cache they have is up-to-date or not, something that existing caches might not implement for the existing nix-cache-info and that may be an issue for them depending on how they server it (static vs generated).

Finally, I think standardizing on well-known file path is reasonable here, I'm not sure what would the benefit be from having nix-cache-info convey where it is, but I am happy to adopt this if you think it opens the door for a future change. I sort of designed for that change with the version in the manifest to allow for future changes and updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right regarding not hardcoding the path now that I had a moment to think this through...

This raises some design questions I'd like input on. For the nix-cache-info field, we have a few options:

  1. Single index-url: Simple, one index per cache. The CI use case (smaller/recent-only index) would be handled by using a different substituter pointing to a separate cache.

  2. Multiple index-urls: Supports multiple indices per cache, but raises semantic questions:

    • Are they mirrors/alternatives (client picks one)?
    • Are they different scopes to be merged (client downloads all)?
    • How does a client know which index to use for its use case?
  3. Single index-url now, defer multi-index to future work: Get real-world experience first, then design multi-index semantics based on actual needs.

The temporal coverage approach (index advertises 'only covers items from last N years') has issues since store path hashes don't encode creation time, and old paths can be re-added.

What are your thoughts on the right level of complexity for v1 of this protocol?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that while it would have been better for nix-cache-info to be in a more standard format I think it is still better to stick with it rather than adding yet another manifest.

As @roberth said it is possible that support for a version of the entire manifest in another format is still on the table. But it is best to do that separately and then these fields will go along for the ride rather than having skew likely forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point — I'll inline the manifest fields directly into nix-cache-info rather than having a separate manifest.json. This eliminates an HTTP request and avoids adding yet another file format.

I'll use an Index prefix for all fields to namespace them clearly (e.g., IndexVersion, IndexFormat, IndexShardsBase, etc.). The nested structures in the current manifest (like sharding.depth) would flatten to IndexShardingDepth.

Does this approach work? Any preferences on the field naming convention?

Example nix-cache-info

StoreDir: /nix/store
WantMassQuery: 1
Priority: 40
IndexVersion: 1
IndexFormat: hlssi
IndexItemCount: 1200000000
IndexShardingDepth: 2
IndexShardingAlphabet: 0123456789abcdfghijklmnpqrsvwxyz
IndexEncodingType: golomb-rice
IndexEncodingParameter: 8
IndexHashBits: 160
IndexPrefixBits: 10
IndexJournalBase: https://cache.example.com/nix-cache-index/journal/
IndexShardsBase: https://cache.example.com/nix-cache-index/shards/
IndexDeltasBase: https://cache.example.com/nix-cache-index/deltas/
IndexJournalCurrentSegment: 1705147200
IndexJournalRetentionCount: 12
IndexEpochCurrent: 42
IndexEpochPrevious: 41
IndexDeltasEnabled: true
IndexDeltasOldestBase: 35
IndexDeltasCompression: zstd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'll wait for consensus on #195 (comment) and update the RFC with this change as well.


## 4. Layer 1: Journal (Hot Layer)

The journal captures recent mutations with minimal latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about minimal.

This seems to be determined by segment_duration_seconds, and it requires many individual requests to catch up with the log.

Perhaps with HTTP range requests the number of requests could be reduced, turning this into a small number of bulk downloads.

Long polling could be an implementation strategy to make this even more realtime, without the added complexity of a push protocol.
When doing range requests instead of relying on split files, you'd still want a time interval parameter, but instead of journal.segment_duration_seconds it would be journal.segment_query_interval. Set to 0 for long polling.
If it's a dumb bucket, set a high value to reduce unnecessary / inefficient traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that 'minimal' oversells it. I'll soften the language. Regarding optimizations: range requests for catch-up and long polling for real-time are good implementation strategies, but I'm inclined to keep them out of the spec itself since they're optimizations that servers/clients can adopt independently without protocol changes. The protocol just needs to not preclude them. Would a note in the implementation considerations section acknowledging these optimization opportunities be sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to incorporate those HTTP features in order to simplify the protocol.
By appending to a large journal file and relying on this feature, you may both reduce the spec complexity and improve performance, in terms of latency and number of requests.

Unless we have an overriding reason to provide this inefficient multi-file scheme, I think we'd be better off treating an append only log as an append only log at the HTTP level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about how to implement a single-file journal over dumb storage (S3 + CDN), and I don't see a clean path. S3 objects are immutable, so "appending" requires re-uploading the entire file. The multi-segment approach lets writers upload small files without touching existing data.

For smart servers (with actual append support or a proxy layer), a single-file journal with range requests would indeed be simpler and more efficient. But I want the baseline protocol to work with just static file hosting.

I see two options:

Option A: Define both modes in the spec

  • Add a field to indicate journal mode (segments vs single)
  • Clients implement both code paths
  • More flexible, but adds complexity to every client implementation

Option B: Segments as the only mode

  • Keep segments as the baseline (works with dumb storage)
  • Smart servers like ncps could still optimize internally but serve the segment format for compatibility
  • Simpler client implementations

Given that cache.nixos.org (the largest cache) runs on dumb S3, I'm leaning toward Option B. But I'm open to Option A if you think the range-request efficiency is worth the added client complexity.

What's your preference?

Total: 64 bytes
```

**Implementation Note**: The header is designed to avoid struct padding issues. All multi-byte integers are little-endian. Implementations in C/Rust should use explicit byte-level serialization or `#pragma pack(1)` / `#[repr(packed)]` to ensure correct layout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you not specify big-endian above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's an overriding reason, pick little endian. Big endian is just not how computers work anymore.

{big-endian above} had an overriding reason, which is the correspondence between lexicographic sorting and the ordering of number sorting. But that's part of the domain, whereas this here is just a trivial implementation-level detail.
For comparison, we wouldn't pick big/little endian because e.g. accounting applications have Arabic numerals which are big endian.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify what might be confusing: the RFC uses both, intentionally:

  • Section 2.1 (hash interpretation): Big-endian so that lexicographic string ordering equals numeric ordering - required for prefix-based sharding to work correctly.
  • Section 5.1 (header integers): Little-endian for uint64 fields like item count and offsets - just binary serialization matching modern CPUs.

These aren't contradictory; they serve different purposes. I'll add a note to Section 5.1 clarifying why header integers use little-endian while hash interpretation uses big-endian.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, Nix32 encoding follows the lexicographic sort of the reversed byte string.
So prefix-based sharding for Nix32 paths is - behind the scenes - suffix-based sharding as it relates to native hash bytes and the base-16 encoding.
See docs pr NixOS/nix#15004 referenced earlier.

I feel uneasy to perpetuate this syntactic quirk.
If I understand correctly, it causes the byte sequences in this spec to be reverse of the native hash bytes. That is very very ugly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this is an unfortunate consequence of Nix's non-standard base32 encoding. The big-endian interpretation in Section 2.1 is required for the protocol to work correctly—it ensures lexicographic string ordering equals numeric ordering, which is essential for prefix-based sharding and delta encoding.
The alternative would be to have the protocol convert Nix32 → native bytes → use native byte order, but this would:

  1. Add a conversion step on every operation
  2. Break the correspondence between string prefixes and shard assignment
  3. Add complexity without functional benefit

I can add a note acknowledging that the 160-bit integers in the index are byte-reversed relative to the native hash representation, but I don't see a way to avoid this without significantly complicating the protocol. Is there a specific problem you foresee this causing?

I considered reversing the string so that shard prefixes correspond to native hash byte prefixes, but this would break the intuitive correspondence between a hash's visible prefix (b6gv...) and its shard location (b6/). Operators debugging cache issues would need to mentally reverse hashes to find the right shard.

The current design's 'ugliness' is confined to the internal byte representation, which most implementers won't encounter directly (they'll use libraries like go-nix). The reverse approach would surface the ugliness to every user interaction.

I'm open to other suggestions, but I think preserving hash_prefix == shard_name is worth the internal byte-order quirk.

### 10.1 The Bandwidth Problem

For cache.nixos.org with ~1.06 billion items (as of January 2026):
- Full index size: ~1.5 GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5G sounds okay for downstream caches, but too large for all clients especially clients that only have a short live time i.e. github actions. But I think reduce the size significantly by limiting how many years we include in this index for cache.nixos.org i.e. 2-3 years instead of everything since 2015.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about ephemeral CI. A few thoughts:

  1. The sharded design already helps here - a CI runner only needs to fetch shards for prefixes in its closure, not all 1024 shards. For a typical closure of ~2000 paths, this might be 200-400 shards (~300-600MB) rather than the full 1.5GB.
  2. CI runners could also skip the index entirely and use HTTP probing - the index is additive, not required.
  3. The temporal filtering idea is tricky since store path hashes don't encode creation time.

Should I add guidance in the RFC about partial shard fetching strategies for resource-constrained clients?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need probably to measure first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource constrained clients should probably just use a nearby proxy, such as the improved cache.nixos.org CDN, volunteer mirrors, a mirror/"proxy" in the local network, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — let's wait for benchmark data before adding more prescriptive guidance. The RFC already notes that resource-constrained clients can fetch only the shards they need, and the index is purely additive (clients can always fall back to HTTP probing).

The reference implementation in ncps is in progress (kalbasit/ncps#552 and subsequent PRs). Once it's ready, I'd appreciate help gathering numbers for various scenarios including CI-like workloads (ephemeral client, partial closure lookups). If you'd like to participate in benchmarking, please reach out!

Comment on lines 180 to 181
- `journal.current_segment`: Unix timestamp of the active journal segment
- `journal.segment_duration_seconds`: How often segments rotate (e.g., 300 = 5 minutes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using time in a distributed system always makes me uneasy. I think the protocol mostly doesn't actually depend on the time (as the client uses journal.current_segment) but I wonder if it is better to just make these abstract integers to make it clear that time isn't relevant.

I think the main complication is to know when to avoid replaying the journal and just start fresh, but IMHO just a 404 on the next journal segment that you would need to read to catch up will effectively do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this particular field journal.segment_duration_seconds is not helpful. I was debating with myself wether to include this or not; The main reason I included it because I thought it would help other caches decide on how long to cache the manifest before making a call to check if the etag has changed. Happy to remove it.

With that said, I still think an epoch is better here than a number, it's not really for the "time" of it, it just makes things easier on the server-side. Instead of read current value, increment, write back it would be just stick in the current epoch. However, if you believe that the time may end up confusing implementers, I'm not opposed to reverting to an integer here since the added complexity of what I described is quite minimal compared to the rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed journal.segment_duration_seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, I still think an epoch is better here than a number, it's not really for the "time" of it, it just makes things easier on the server-side. Instead of read current value, increment, write back it would be just stick in the current epoch.

It seems like the server should still be doing something like this to ensure that it doesn't end up reusing a timestamp in the face of time jumps or other issues.

It sounds to me that the protocol can say that this is an opaque monotonically increasing number. If the implementation wants to use a timestamp that would be fine (as long as it makes sure it is increasing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — I'll update the spec to describe current_segment (now IndexJournalCurrentSegment) as an opaque monotonically increasing identifier rather than explicitly a Unix timestamp. Implementations can still use timestamps if convenient, but the protocol only requires monotonic ordering.

I'll update the field description accordingly.

@kalbasit
Copy link
Member Author

Thank you @kevincox, @Mic92, @roberth for your quick review and suggestions. I have resolved some of your comments and I will attend to the rest in a few hours after work. Thank you again 🙏🏼

Refine binary cache index protocol with manifest URL discovery, structured base URLs, zstd compression for shards, and clarified format details.
Copy link
Member Author

@kalbasit kalbasit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the review. I have addressed all your comments. I wasn't sure about the etiquette regarding resolving threads—should I leave them open for you to resolve if you are satisfied, or should I resolve them? I did go ahead and resolve the obvious code changes since I adopted those directly.

Comment on lines 76 to 81
Nix store paths follow the format:
```
/nix/store/<hash>-<name>
```

Where `<hash>` is a 32-character string using Nix's custom base32 alphabet (`0123456789abcdfghijklmnpqrsvwxyz` — notably excluding `e`, `o`, `u`, `t`). This encodes 160 bits of a truncated SHA-256 digest.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to keep it brief. I've shortened the general description but kept the alphabet specification since it's not in the linked documentation and is critical for correct implementation—using the standard base32 alphabet would produce incorrect index lookups.


The manifest is a JSON file at a well-known path that describes the index topology:

**Path**: `/nix-cache-index/manifest.json`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right regarding not hardcoding the path now that I had a moment to think this through...

This raises some design questions I'd like input on. For the nix-cache-info field, we have a few options:

  1. Single index-url: Simple, one index per cache. The CI use case (smaller/recent-only index) would be handled by using a different substituter pointing to a separate cache.

  2. Multiple index-urls: Supports multiple indices per cache, but raises semantic questions:

    • Are they mirrors/alternatives (client picks one)?
    • Are they different scopes to be merged (client downloads all)?
    • How does a client know which index to use for its use case?
  3. Single index-url now, defer multi-index to future work: Get real-world experience first, then design multi-index semantics based on actual needs.

The temporal coverage approach (index advertises 'only covers items from last N years') has issues since store path hashes don't encode creation time, and old paths can be re-added.

What are your thoughts on the right level of complexity for v1 of this protocol?


## 4. Layer 1: Journal (Hot Layer)

The journal captures recent mutations with minimal latency.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that 'minimal' oversells it. I'll soften the language. Regarding optimizations: range requests for catch-up and long polling for real-time are good implementation strategies, but I'm inclined to keep them out of the spec itself since they're optimizations that servers/clients can adopt independently without protocol changes. The protocol just needs to not preclude them. Would a note in the implementation considerations section acknowledging these optimization opportunities be sufficient?

18 8 Sparse index offset from start of file (uint64, little-endian)
26 8 Sparse index entry count (uint64, little-endian)
34 8 XXH64 checksum of encoded data section (uint64, little-endian)
42 22 Reserved for future use (must be zeros)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The intent is lenient: clients SHOULD ignore the reserved bytes to allow minor, backward-compatible additions without breaking old clients. Breaking changes would bump the magic number (e.g., NIXIDX02) or the manifest version field. I'll clarify this in the spec - something like: 'Clients MUST ignore non-zero values in reserved bytes to allow backward-compatible extensions. Incompatible format changes will use a new magic number.

Total: 64 bytes
```

**Implementation Note**: The header is designed to avoid struct padding issues. All multi-byte integers are little-endian. Implementations in C/Rust should use explicit byte-level serialization or `#pragma pack(1)` / `#[repr(packed)]` to ensure correct layout.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify what might be confusing: the RFC uses both, intentionally:

  • Section 2.1 (hash interpretation): Big-endian so that lexicographic string ordering equals numeric ordering - required for prefix-based sharding to work correctly.
  • Section 5.1 (header integers): Little-endian for uint64 fields like item count and offsets - just binary serialization matching modern CPUs.

These aren't contradictory; they serve different purposes. I'll add a note to Section 5.1 clarifying why header integers use little-endian while hash interpretation uses big-endian.


**Caching**: Servers SHOULD use the `Cache-Control` HTTP header to specify the caching duration of the manifest. Clients SHOULD respect this header to allow the server to control how long the manifest is cached. Revalidation using `If-Modified-Since` or `ETag` SHOULD also be used.

**Integrity Verification**: Clients SHOULD verify manifest integrity using HTTP-level mechanisms (`ETag`, `Content-MD5`). Cryptographic signing of index files is deferred to future work (see Future Work: Index Signing and Trust).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ETag provides no verification capabilities as it as an opaque string.

Mentioning Content-MD5 is fine but seems not worth it to me:

  1. It is super rare.
  2. TLS provides some form of verification anyways.
    But I guess it doesn't hurt.


The manifest is a JSON file at a well-known path that describes the index topology:

**Path**: `/nix-cache-index/manifest.json`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants