Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement new package hash format: $name-$semver-$hash #22994

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Feb 24, 2025

Legacy format is also supported.

closes #20178

  • Tested building ffmpeg with a cache of old packages
  • Tested with a wiped cache
  • Tested updating only one dependency to new hash format
  • Tested deleting hash field and getting the "expected hash" error
  • Tested name using string literal valid and invalid, enum literal with and without @
  • Tested zig fetch on legacy package

Although this change allows the previous hash format, it is breaking because it institutes the new package naming rules outlined in the linked issue. In practice, I expect this to affect only those who put "-" in the package name.

Followup Issues

  • investigate case sensitivity on Windows. I think this is justified to be a follow-up issue because even with case insensitivity, it only means 118 bits rather than 200 (200 * ((64-26) / 64)). Issue is fixable with an additional check at worst, marking the directory case sensitive at best.
  • allow modifying name, version, paths for naked package URLs (this will affect the hash)
  • allow omitting paths field, making makes the package unfetchable. Then disallow .paths = .{""}. Add basic globbing support.
  • allow omitting name, nonce, and version fields when paths field is omitted (unfetchable package). Do this for zig's own build.zig.zon

@andrewrk andrewrk added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. release notes This PR should be mentioned in the release notes. labels Feb 24, 2025
@alexrp alexrp added this to the 0.14.0 milestone Feb 24, 2025
if (p.allow_name_string and node_tags[node] == .string_literal) {
const name = try parseString(p, node);
if (!std.zig.isValidId(name))
return fail(p, main_token, "name must be a valid bare zig identifier (hint: switch from string to enum literal)", .{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something here, but why? We have @"" specifically for these usecases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, fair enough!

@andrewrk andrewrk changed the title new package hash format implement new package hash format: $name-$semver-$hash Feb 24, 2025
@xdBronch
Copy link
Contributor

Legacy format is also supported.

indefinitely? maybe soft deprecate with a tool to migrate zon files automatically?

@andrewrk
Copy link
Member Author

I will integrate it with #22898 when that patchset lands.

Copy link
Member

@ifreund ifreund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that using - as the separator here is a mistake. Since the semver spec also uses - as a separator, splitting the $name-$semver-$hash string on - is a bug while it wouldn't have to be if we chose a different separator that is neither allowed in bare zig identifiers nor semver versions.

One possible separator is the tilde character, ~ which is safe for use in file names on Linux, MacOS, and Windows at the very least (as long as a tilde is not the first character, which is not the case here).

Other options that come to mind are the comma ,, the octothorpe #, the percent sign %, and the carat ^.

Aesthetically, I personally find either the tilde ~ or the comma , most pleasing, though I would prefer any of the options I listed over - from a technical standpoint.

@cryptocode
Copy link
Contributor

Since the semver spec also uses - as a separator, splitting the $name-$semver-$hash string on - is a bug while it wouldn't have to be if we chose a different separator that is neither allowed in bare zig identifiers nor semver versions.

The base64 hash part may also contain - and _

But what's the use-case for parsing this, especially since components are truncated? The ambiguity in the format might be a feature so people don't start depending on a specific format.

src/Package.zig Outdated
/// * sizedhash is the following 9-byte array, base64 encoded using -_ to make
/// it filesystem safe:
/// - (4 bytes) LE u32 total decompressed size in bytes
/// - (5 bytes) truncated SHA-256 of hashed files of the package
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is truncating the sha256 hash to 40 bits really ok? My understanding is that while truncating the result of a "good" hash function like sha256 does not reduce the "strength" of the hash (i.e. difficulty of finding the pre-image for a given hash), it does significantly increase the likelihood of collisions.

I am by no means an expert, but it seems like truncating to 40 bits puts a Birthday attack in the realm of possibility.

Either way, I think it is important to have someone who has a much smaller set of cryptography "unknown unknowns" than you or I think about this.

Regardless of whether or not it is ok to use only 40 bits, we should really document the reason of why it is ok or why it is not ok in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Even when considering name, semver and decompressed size as part of the hash, this doesn't make it much harder to find collisions.

Suppose someone generates 2^20 good versions of a package, e.g. by modifying code comments, changing variable names or whatever, but keeping name, semver and total size the same. With 40 bit hashes, there will be few collisions, so there will be about 2^20 distinct hashes. Now they make a malicious change to the package again while keeping name, semver, decompressed size the same. They generate 2^20 malicious versions and now the chance of a good version and a malicious version sharing the same hash is about 1-(1-2^(-20))^(2^20), which is around 63%. They could ship the good version first and switch to the malicious version after any code reviews are done.

I doesn't take that long to generate and compute the hashes of two million minor modifications of a package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these arguments convincing, and, thinking about it, now believe that we probably shouldn't be truncating the hash at all. It's an added risk for at best dubious benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that Windows is case-insensitive, so the likelihood of running into bugs from conflicting base64url-encoded hashes is even higher.

Copy link
Contributor

@rohlem rohlem Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@castholm Case-insensitivity applies to access, but case information is still stored on all supported types of file systems, right?
So zig can specifically check the case-sensitive name of the directory in the path after opening the directory handle, safely retrieve/depend on that information, and error on case mismatch, right?
This does require extra logic (and potentially an extra syscall) though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about sha-256 truncated to 192 bits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about sha-256 truncated to 192 bits?

I think 192 bits should be sufficient to make the birthday attack described above infeasible. This changes the order of magnitude of the variants that would need to be brute-force checked from 2^20 to 2^96.

A few million is a small number for a computer, a few octillion is definitely not small. To put than in perspective, if a 5GHz CPU core checked one variant per cycle it would take 500 billion years to check 2^96 variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but if you still end up using Base32, then truncating to a multiple of 5 bits makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a multiple of 5 bits

For instance, 200 bits are a whole number of bytes and a multiple of 5.

@castholm
Copy link
Contributor

I still think that using - as the separator here is a mistake. Since the semver spec also uses - as a separator, splitting the $name-$semver-$hash string on - is a bug while it wouldn't have to be if we chose a different separator that is neither allowed in bare zig identifiers nor semver versions.

This is only a problem if being able to split on a single character is an explicit design goal (e.g. using std.mem.splitScalar(u8, hash, '-')). Both hash formats introduced in this PR can be unambiguously told apart and broken down into their respective components:

  • If the hash contains ., it's a named hash, otherwise, it's naked.
  • For named hashes, everything up until the first - is the name, the last 9 bytes is the sizedhash, and hash[(name.len + 1)..(hash.len - 10)] is the semver.
  • For naked hashes, the last 9 bytes is the sizedhash, and hash[0..(hash.len - 12)] is the hashiname.

src/Package.zig Outdated
Comment on lines 66 to 77
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must
/// be a valid zig identifier
/// * semver is the version field from build.zig.zon, truncated at 32 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why truncate the package name and version instead of hard-limiting them to 32 bytes and checking the lengths when validating the manifest?

@andrewrk
Copy link
Member Author

Where were all these great points during the issue discussion in June! 🙂

@nektro
Copy link
Contributor

nektro commented Feb 24, 2025

#20178 (comment)
#20178 (comment)

@andrewrk
Copy link
Member Author

ehh the disagreement about - is not important. This is a hash - a one way function, by definition - it is not meant to be parsed, as pointed out by @cryptocode. Even so, if it makes @ifreund happy then I don't see any reason not to use ~. That's a trivial change that does not make me do any extra work, really.

I think the main challenge here is the file system case insensitivity on Windows. I can't believe I didn't consider that until now. Yeesh.

@cbilz
Copy link
Contributor

cbilz commented Feb 24, 2025

I think the main challenge here is the file system case insensitivity on Windows.

Base 32 encoding is just 20% longer, but I'm not sure about what other constraints there are.

@andrewrk
Copy link
Member Author

I feel like I tried this once but unfortunately I don't remember the results of my investigation. It looks like you can omit OBJ_CASE_INSENSITIVE when dealing with files on Windows.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 24, 2025

>>> 4 * 'm' + 64 * 'h' # status quo length
'mmmmhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil((32 * 8) / 4) * 'h' # new max length with hex-encoded hash, no size
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil((32 * 8) / 5) * 'h' # new max length with base32-encoded hash, no size
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil((32 * 8) / 6) * 'h' # new max length with base64-encoded hash, no size
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil(((32+4) * 8) / 4) * 'h' # new max length with hex-encoded hash, size included
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil(((32+4) * 8) / 5) * 'h' # new max length with base32-encoded hash, size included
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> 32 * 'n' + '-' + 32 * 'v' + '-' + ceil(((32+4) * 8) / 6) * 'h' # new max length with base64-encoded hash, size included
'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn-vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil(((32+4) * 8) / 4) * 'h' # real world example, hex-encoded hash, size included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil(((32+4) * 8) / 5) * 'h' # real world example, base32-encoded hash, size included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil(((32+4) * 8) / 6) * 'h' # real world example, base64-encoded hash, size included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil((192 + 32 + 16) / 4) * 'h' # real world example, hex-encoded hash, 192bit truncate, size and id included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil((192 + 32 + 16) / 5) * 'h' # real world example, base32-encoded hash, 192bit truncate, size and id included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'
>>> "nasm-2.16.1-3-" + ceil((192 + 32 + 16) / 6) * 'h' # real world example, base64-encoded hash, 192bit truncate, size and id included
'nasm-2.16.1-3-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh'

Anything you can't see above without scrolling makes me real uncomfortable, even though technically all of these will easily fit in the 255 limit for path components.

@andrewrk
Copy link
Member Author

I'm going to push back against the anti-dash argument.

  1. these are hashes, they do not need to be parsed
  2. even so, they can always be correctly parsed by parsing the name first, then the hash which has a fixed size, then the semver last.
  3. tilde is a pain in the ass to type on some keyboards. dash is always easy for everybody.
  4. $name-$version is a commonly seen combination in software development. users will experience cognitive dissonance if this pattern is broken.

@castholm
Copy link
Contributor

I feel like I tried this once but unfortunately I don't remember the results of my investigation. It looks like you can omit OBJ_CASE_INSENSITIVE when dealing with files on Windows.

On Windows 10 version 1803 and later (April 2018) you can use NtSetInformationFile to specify FileCaseSensitiveInformation and mark a directory as case-sensitive. I have no idea how well this would work in practice, and seeing as the difference between base64 and base32 is only ~10 characters it's probably not worth the hassle. There's also the caveat that there are still some LTSC versions from before 1803 that are officially supported through 2026.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 25, 2025

If you're participating in this discussion, please read the commit messages of the newly pushed commits, particularly the one that introduces id.

Here are all the discussion points thus far:

### `id`

Together with name, this represents a globally unique package identifier. This
field should be initialized with a 16-bit random number when the package is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "should" is too vague when other fields have a definitive "required" or "optional".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere, the word "must" and "required" indicate that an error will occur. Here, I think the word "should" is appropriate because there is no error for not doing the thing, instead there are social consequences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewrk
Copy link
Member Author

OBJ_CASE_INSENSITIVE

Does nothing. On Windows 10.0.19045, I tested creating and opening files via NtCreateFile without OBJ_CASE_INSENSITIVE enabled. In both cases, the file system operated in a case-insensitive manner. I observed the NtCreateFile call in procmon.exe and noted that the string used inside the application (i.e. "hello" vs "HELLO") was not reflected in the path reported by procmon.exe. procmon.exe always reported a path string matching what was already on disk, not what was sent to the NtCreateFile call, unless there was no file on disk, in which case the string would match the data from the application.

I did not experiment with FileCaseSensitiveInformation yet.

Comment on lines 649 to 651
.allow_missing_paths_field = f.allow_missing_paths_field,
.allow_missing_id = f.allow_missing_paths_field,
.allow_name_string = f.allow_missing_paths_field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried building the branch and running zig fetch to re-fetch and re-hash packages and it failed with an "expected enum literal" error.

cmdFetch sets .allow_missing_paths_field = false, so if the intention is for users to be able to fetch legacy packages until they are wholly deprecated, you might want to break this out into a separate field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find; fixed in a commit to be pushed shortly.

src/main.zig Outdated
else => '_',
});
for (bytes[1..]) |byte| switch (byte) {
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte),
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte),
'-', '.' => '_',

- and . are somewhat common in repo/project names, so substituting them with _ instead of ignoring them probably gets you closer to what the user actually wants.

src/Package.zig Outdated
Comment on lines 75 to 77
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must
/// be a valid zig identifier
/// * semver is the version field from build.zig.zon, truncated at 32 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must
/// be a valid zig identifier
/// * semver is the version field from build.zig.zon, truncated at 32 bytes
/// * name is the name field from build.zig.zon, limited to 32 bytes and must
/// be a valid zig identifier
/// * semver is the version field from build.zig.zon, limited to 32 bytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm changing it to use the words "assert" and "assume" in accordance with the doc comment guidance

@castholm
Copy link
Contributor

Regarding id, it might be worth revisiting @mlugg's comments in #20183 (comment) about unintentional id duplication.

Inevitably, inexperienced users are going to copy/paste ids from references or tutorials, or have LLMs generate them, or hand-pick "cute" values like 0x0420 or 0xcafe, so there's a non-zero chance that conflicting package names and ids will happen in the wild even without any malicious intent from the authors. Baking a checksum into the id as suggested in the discussion in the linked issue helps a bit, e.g. a user following a tutorial will likely change the name field, which will trigger an error and force them to refresh the id.

@ifreund
Copy link
Member

ifreund commented Feb 25, 2025

I'm going to push back against the anti-dash argument.

I've been thinking about this more and I'm OK with dashes now, I hereby retract the anti-dash argument, let's focus on the more important things.

One thing I think is important that I haven't seen brought up yet is forwards-compatibility/extensibility. I think we can spare a single character somewhere in the format to act as a "format version number." I'd argue there's a non-zero chance that we end up wanting to change something about this format in the future, and I'm sure we would be very happy to have a version number if/when that happens.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 26, 2025

Regarding ID checksum: it took me a while to understand the workflow @mlugg is trying to protect against, since changing the name does create a new logical id (logical id is name + id).

The problematic workflow is:

  • Someone creates a template with build.zig.zon, id field included (note that zig init does not create this problem since it generates fresh id every time it runs).
  • User A uses the template, changing package name to "example" but not id field.
  • User B uses the same template, changing package name also to "example", also not changing the id field.

Now both packages have unintentional conflicting logical ids.

The checksum idea is reasonable, if I understand it correctly. It would only possibly cause a manifest validation error, while the actual ID bytes would still be a u16.

Given that this field of the manifest is technically not an id, but in fact:

  • one component of the tuple that composes the id
  • includes a checksum mixed in

I think it could potentially be named better. Maybe "nonce" actually works pretty well?

Example error message:

error: invalid nonce: 0xcafebabe
note: if this is a new or forked package, use this nonce: 0xda8b2180

@andrewrk
Copy link
Member Author

andrewrk commented Feb 26, 2025

forwards-compatibility/extensibility. I think we can spare a single character somewhere in the format to act as a "format version number."

I thought about this, and concluded that the future format can add such character with equal utility. For instance, it could put a character outside the base64 charset at package_hash[package_hash.len - 43].

@ifreund
Copy link
Member

ifreund commented Feb 26, 2025

I hate to open another bikeshed, but I don't think "nonce" is an accurate name. It has a very specific meaning in cryptography (wikipedia). The defining characteristic of a nonce is that it is ephemeral and can only be used once.

Since this package identifier is persistent and used repeatedly to identify the package, it is not a nonce.

I personally don't see any issue with using "id" here, it's definitely more accurate than nonce. If you want to convey the fact that it is a unique id in the name, perhaps "uid" or similar would make sense.

Note that the existing and widely used UUID format is named universally unique identifier not universally unique nonce.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 26, 2025

I agree it's important to consider seriously these names that would require breakage to change.

I agree that a cryptographic nonce has a very specific meaning, and that it has to do with being ephemeral and only used once. Regardless of any arguments I make here, I recognize that even just the fact that it seemed wrong to you is a fault against the name. However, I think there is actually a solid argument to make that the word is appropriate.

a nonce is an arbitrary number that can be used just once in a cryptographic communication

If you consider the "communication" to be the entire lifetime of the package, then it's a good fit. But I get that "communication" is typically measured in seconds, minutes, or days at most. Furthermore, it's partially composed of a checksum of the name. I'm afraid this is a new concept that does not have a premade nice name for us.

In this case it might be justified to create one. Cryptography had to create the word "nonce", "salt", "hash", etc.

The reason I don't want to use "id" - although I'm not strictly vetoing the name - is that there is already the concept of a package id. It's whatever this thing is combined with name. It's what I called a "logical id" earlier. However, the Zig programmer does not specify package id directly; they specify the name component, and then rely on tooling to autogenerate the random bits component. It would be nice when talking about package ids to not have to disambiguate between id field and logical package id.

In conclusion, I'm fully open to name counter-proposals, however, I also think that, while not perfect, "nonce" is defensible.

Edit: some names that my graphics card came up with:

  • token
  • entropy
  • fingerprint
  • salt (I mentioned this above; not technically in agreement with wikipedia one-sentence summary of "random data fed as an additional input to a one-way function that hashes data, a password or passphrase")

Final edit:

I still standby nonce as a viable name. However, I offer this "new concept" name: lift

It's short for "liftoff" which is short for "takeoff every zig", and also matches @kristoff-it's book title ("Zig Liftoff").

The name indicates that it is appropriate when "launching". I.e. new project, or a fork.

tests should use the API, not only verify compilation succeeds.
legacy format is also supported.

closes #20178
This branch regressed from master by switching to binary rather than hex
digest, allowing null bytes to end up in identifiers in the zig file.

This commit fixes it by changing the "hash" to be literally equal to the
sub_path (with a prefix '/' to indicate "global") if it can fit. If it
is too long then it is actually hashed, and that value used instead.
Introduces the `id` field to `build.zig.zon`.

Together with name, this represents a globally unique package
identifier. This field should be initialized with a 16-bit random number
when the package is first created, and then *never change*. This allows
Zig to unambiguously detect when one package is an updated version of
another.

When forking a Zig project, this id should be regenerated with a new
random number if the upstream project is still maintained. Otherwise,
the fork is *hostile*, attempting to take control over the original
project's identity.

`0x0000` is invalid because it obviously means a random number wasn't
used.

`0xffff` is reserved to represent "naked" packages.

Tracking issue #14288

Additionally:

* Fix bad path in error messages regarding build.zig.zon file.
* Manifest validates that `name` and `version` field of build.zig.zon
  are maximum 32 bytes.
* Introduce error for root package to not switch to enum literal for
  name.
* Introduce error for root package to omit `id`.
* Update init template to generate `id`
* Update init template to populate `minimum_zig_version`.
* New package hash format changes:
  - name and version limited to 32 bytes via error rather than truncation
  - truncate sha256 to 192 bits rather than 40 bits
  - include the package id

This means that, given only the package hashes for a complete dependency
tree, it is possible to perform version selection and know the final
size on disk, without doing any fetching whatsoever. This prevents
wasted bandwidth since package versions not selected do not need to be
fetched.
Adhere to the new rules: 32 byte limit + must be a valid bare zig
identifier
mainly this addresses the following use case:

1. Someone creates a template with build.zig.zon, id field included
   (note that zig init does not create this problem since it generates
   fresh id every time it runs).
2. User A uses the template, changing package name to "example" but not
   id field.
3. User B uses the same template, changing package name also to
   "example", also not changing the id field.

Here, both packages have unintentional conflicting logical ids.

By making the field a combination of name checksum + random id, this
accident is avoided. "nonce" is an OK name for this.

Also relaxes errors on remote packages when using `zig fetch`.
and to make the base64 round even, bump sha256 to 200 bits (up from 192)
@rohlem
Copy link
Contributor

rohlem commented Feb 26, 2025

I quite like salt, entropy, and fingerprint.
I personally think lift is the option that is the least self-explanatory, and will require the most people / everyone to look up and remember its meaning, which in my eyes is a slight negative.
There's also the option of a combination, like lift-entropy, but if id = name + x already, then "package's id salt" already seems descriptive enough.

@ifreund
Copy link
Member

ifreund commented Feb 26, 2025

I'm quite happy with fingerprint, thanks for tolerating my bikeshed with an open mind :)

@jedisct1
Copy link
Contributor

I like fingerprint as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. release notes This PR should be mentioned in the release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new package hash format: $name-$semver-$hash