-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement new package hash format: $name-$semver-$hash
#22994
base: master
Are you sure you want to change the base?
Conversation
if (p.allow_name_string and node_tags[node] == .string_literal) { | ||
const name = try parseString(p, node); | ||
if (!std.zig.isValidId(name)) | ||
return fail(p, main_token, "name must be a valid bare zig identifier (hint: switch from string to enum literal)", .{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something here, but why? We have @""
specifically for these usecases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, fair enough!
$name-$semver-$hash
indefinitely? maybe soft deprecate with a tool to migrate zon files automatically? |
I will integrate it with #22898 when that patchset lands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that using -
as the separator here is a mistake. Since the semver spec also uses -
as a separator, splitting the $name-$semver-$hash
string on -
is a bug while it wouldn't have to be if we chose a different separator that is neither allowed in bare zig identifiers nor semver versions.
One possible separator is the tilde character, ~
which is safe for use in file names on Linux, MacOS, and Windows at the very least (as long as a tilde is not the first character, which is not the case here).
Other options that come to mind are the comma ,
, the octothorpe #
, the percent sign %
, and the carat ^
.
Aesthetically, I personally find either the tilde ~
or the comma ,
most pleasing, though I would prefer any of the options I listed over -
from a technical standpoint.
The base64 hash part may also contain But what's the use-case for parsing this, especially since components are truncated? The ambiguity in the format might be a feature so people don't start depending on a specific format. |
src/Package.zig
Outdated
/// * sizedhash is the following 9-byte array, base64 encoded using -_ to make | ||
/// it filesystem safe: | ||
/// - (4 bytes) LE u32 total decompressed size in bytes | ||
/// - (5 bytes) truncated SHA-256 of hashed files of the package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is truncating the sha256 hash to 40 bits really ok? My understanding is that while truncating the result of a "good" hash function like sha256 does not reduce the "strength" of the hash (i.e. difficulty of finding the pre-image for a given hash), it does significantly increase the likelihood of collisions.
I am by no means an expert, but it seems like truncating to 40 bits puts a Birthday attack in the realm of possibility.
Either way, I think it is important to have someone who has a much smaller set of cryptography "unknown unknowns" than you or I think about this.
Regardless of whether or not it is ok to use only 40 bits, we should really document the reason of why it is ok or why it is not ok in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. Even when considering name, semver and decompressed size as part of the hash, this doesn't make it much harder to find collisions.
Suppose someone generates 2^20 good versions of a package, e.g. by modifying code comments, changing variable names or whatever, but keeping name, semver and total size the same. With 40 bit hashes, there will be few collisions, so there will be about 2^20 distinct hashes. Now they make a malicious change to the package again while keeping name, semver, decompressed size the same. They generate 2^20 malicious versions and now the chance of a good version and a malicious version sharing the same hash is about 1-(1-2^(-20))^(2^20), which is around 63%. They could ship the good version first and switch to the malicious version after any code reviews are done.
I doesn't take that long to generate and compute the hashes of two million minor modifications of a package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find these arguments convincing, and, thinking about it, now believe that we probably shouldn't be truncating the hash at all. It's an added risk for at best dubious benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that Windows is case-insensitive, so the likelihood of running into bugs from conflicting base64url-encoded hashes is even higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@castholm Case-insensitivity applies to access, but case information is still stored on all supported types of file systems, right?
So zig can specifically check the case-sensitive name of the directory in the path after opening the directory handle, safely retrieve/depend on that information, and error on case mismatch, right?
This does require extra logic (and potentially an extra syscall) though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about sha-256 truncated to 192 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about sha-256 truncated to 192 bits?
I think 192 bits should be sufficient to make the birthday attack described above infeasible. This changes the order of magnitude of the variants that would need to be brute-force checked from 2^20 to 2^96.
A few million is a small number for a computer, a few octillion is definitely not small. To put than in perspective, if a 5GHz CPU core checked one variant per cycle it would take 500 billion years to check 2^96 variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor point, but if you still end up using Base32, then truncating to a multiple of 5 bits makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a multiple of 5 bits
For instance, 200 bits are a whole number of bytes and a multiple of 5.
This is only a problem if being able to split on a single character is an explicit design goal (e.g. using
|
src/Package.zig
Outdated
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must | ||
/// be a valid zig identifier | ||
/// * semver is the version field from build.zig.zon, truncated at 32 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why truncate the package name and version instead of hard-limiting them to 32 bytes and checking the lengths when validating the manifest?
Where were all these great points during the issue discussion in June! 🙂 |
ehh the disagreement about I think the main challenge here is the file system case insensitivity on Windows. I can't believe I didn't consider that until now. Yeesh. |
Base 32 encoding is just 20% longer, but I'm not sure about what other constraints there are. |
I feel like I tried this once but unfortunately I don't remember the results of my investigation. It looks like you can omit |
Anything you can't see above without scrolling makes me real uncomfortable, even though technically all of these will easily fit in the 255 limit for path components. |
I'm going to push back against the anti-dash argument.
|
On Windows 10 version 1803 and later (April 2018) you can use |
If you're participating in this discussion, please read the commit messages of the newly pushed commits, particularly the one that introduces Here are all the discussion points thus far:
|
doc/build.zig.zon.md
Outdated
### `id` | ||
|
||
Together with name, this represents a globally unique package identifier. This | ||
field should be initialized with a 16-bit random number when the package is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "should" is too vague when other fields have a definitive "required" or "optional".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere, the word "must" and "required" indicate that an error will occur. Here, I think the word "should" is appropriate because there is no error for not doing the thing, instead there are social consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nothing. On Windows 10.0.19045, I tested creating and opening files via I did not experiment with |
src/Package/Fetch.zig
Outdated
.allow_missing_paths_field = f.allow_missing_paths_field, | ||
.allow_missing_id = f.allow_missing_paths_field, | ||
.allow_name_string = f.allow_missing_paths_field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried building the branch and running zig fetch
to re-fetch and re-hash packages and it failed with an "expected enum literal" error.
cmdFetch
sets .allow_missing_paths_field = false
, so if the intention is for users to be able to fetch legacy packages until they are wholly deprecated, you might want to break this out into a separate field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find; fixed in a commit to be pushed shortly.
src/main.zig
Outdated
else => '_', | ||
}); | ||
for (bytes[1..]) |byte| switch (byte) { | ||
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte), | |
'_', 'a'...'z', 'A'...'Z', '0'...'9' => try result.append(arena, byte), | |
'-', '.' => '_', |
-
and .
are somewhat common in repo/project names, so substituting them with _
instead of ignoring them probably gets you closer to what the user actually wants.
src/Package.zig
Outdated
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must | ||
/// be a valid zig identifier | ||
/// * semver is the version field from build.zig.zon, truncated at 32 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// * name is the name field from build.zig.zon, truncated at 32 bytes and must | |
/// be a valid zig identifier | |
/// * semver is the version field from build.zig.zon, truncated at 32 bytes | |
/// * name is the name field from build.zig.zon, limited to 32 bytes and must | |
/// be a valid zig identifier | |
/// * semver is the version field from build.zig.zon, limited to 32 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm changing it to use the words "assert" and "assume" in accordance with the doc comment guidance
Regarding Inevitably, inexperienced users are going to copy/paste ids from references or tutorials, or have LLMs generate them, or hand-pick "cute" values like |
I've been thinking about this more and I'm OK with dashes now, I hereby retract the anti-dash argument, let's focus on the more important things. One thing I think is important that I haven't seen brought up yet is forwards-compatibility/extensibility. I think we can spare a single character somewhere in the format to act as a "format version number." I'd argue there's a non-zero chance that we end up wanting to change something about this format in the future, and I'm sure we would be very happy to have a version number if/when that happens. |
Regarding ID checksum: it took me a while to understand the workflow @mlugg is trying to protect against, since changing the name does create a new logical id (logical id is name + id). The problematic workflow is:
Now both packages have unintentional conflicting logical ids. The checksum idea is reasonable, if I understand it correctly. It would only possibly cause a manifest validation error, while the actual ID bytes would still be a u16. Given that this field of the manifest is technically not an id, but in fact:
I think it could potentially be named better. Maybe "nonce" actually works pretty well? Example error message:
|
I thought about this, and concluded that the future format can add such character with equal utility. For instance, it could put a character outside the base64 charset at |
I hate to open another bikeshed, but I don't think "nonce" is an accurate name. It has a very specific meaning in cryptography (wikipedia). The defining characteristic of a nonce is that it is ephemeral and can only be used once. Since this package identifier is persistent and used repeatedly to identify the package, it is not a nonce. I personally don't see any issue with using "id" here, it's definitely more accurate than nonce. If you want to convey the fact that it is a unique id in the name, perhaps "uid" or similar would make sense. Note that the existing and widely used UUID format is named universally unique identifier not universally unique nonce. |
I agree it's important to consider seriously these names that would require breakage to change. I agree that a cryptographic nonce has a very specific meaning, and that it has to do with being ephemeral and only used once. Regardless of any arguments I make here, I recognize that even just the fact that it seemed wrong to you is a fault against the name. However, I think there is actually a solid argument to make that the word is appropriate.
If you consider the "communication" to be the entire lifetime of the package, then it's a good fit. But I get that "communication" is typically measured in seconds, minutes, or days at most. Furthermore, it's partially composed of a checksum of the name. I'm afraid this is a new concept that does not have a premade nice name for us. In this case it might be justified to create one. Cryptography had to create the word "nonce", "salt", "hash", etc. The reason I don't want to use "id" - although I'm not strictly vetoing the name - is that there is already the concept of a package id. It's whatever this thing is combined with name. It's what I called a "logical id" earlier. However, the Zig programmer does not specify package id directly; they specify the name component, and then rely on tooling to autogenerate the random bits component. It would be nice when talking about package ids to not have to disambiguate between id field and logical package id. In conclusion, I'm fully open to name counter-proposals, however, I also think that, while not perfect, "nonce" is defensible. Edit: some names that my graphics card came up with:
Final edit: I still standby It's short for "liftoff" which is short for "takeoff every zig", and also matches @kristoff-it's book title ("Zig Liftoff"). The name indicates that it is appropriate when "launching". I.e. new project, or a fork. |
tests should use the API, not only verify compilation succeeds.
legacy format is also supported. closes #20178
This branch regressed from master by switching to binary rather than hex digest, allowing null bytes to end up in identifiers in the zig file. This commit fixes it by changing the "hash" to be literally equal to the sub_path (with a prefix '/' to indicate "global") if it can fit. If it is too long then it is actually hashed, and that value used instead.
Introduces the `id` field to `build.zig.zon`. Together with name, this represents a globally unique package identifier. This field should be initialized with a 16-bit random number when the package is first created, and then *never change*. This allows Zig to unambiguously detect when one package is an updated version of another. When forking a Zig project, this id should be regenerated with a new random number if the upstream project is still maintained. Otherwise, the fork is *hostile*, attempting to take control over the original project's identity. `0x0000` is invalid because it obviously means a random number wasn't used. `0xffff` is reserved to represent "naked" packages. Tracking issue #14288 Additionally: * Fix bad path in error messages regarding build.zig.zon file. * Manifest validates that `name` and `version` field of build.zig.zon are maximum 32 bytes. * Introduce error for root package to not switch to enum literal for name. * Introduce error for root package to omit `id`. * Update init template to generate `id` * Update init template to populate `minimum_zig_version`. * New package hash format changes: - name and version limited to 32 bytes via error rather than truncation - truncate sha256 to 192 bits rather than 40 bits - include the package id This means that, given only the package hashes for a complete dependency tree, it is possible to perform version selection and know the final size on disk, without doing any fetching whatsoever. This prevents wasted bandwidth since package versions not selected do not need to be fetched.
Adhere to the new rules: 32 byte limit + must be a valid bare zig identifier
mainly this addresses the following use case: 1. Someone creates a template with build.zig.zon, id field included (note that zig init does not create this problem since it generates fresh id every time it runs). 2. User A uses the template, changing package name to "example" but not id field. 3. User B uses the same template, changing package name also to "example", also not changing the id field. Here, both packages have unintentional conflicting logical ids. By making the field a combination of name checksum + random id, this accident is avoided. "nonce" is an OK name for this. Also relaxes errors on remote packages when using `zig fetch`.
and to make the base64 round even, bump sha256 to 200 bits (up from 192)
I quite like |
I'm quite happy with fingerprint, thanks for tolerating my bikeshed with an open mind :) |
I like |
Legacy format is also supported.
closes #20178
@
zig fetch
on legacy packageAlthough this change allows the previous hash format, it is breaking because it institutes the new package naming rules outlined in the linked issue. In practice, I expect this to affect only those who put "-" in the package name.
Followup Issues
200 * ((64-26) / 64)
). Issue is fixable with an additional check at worst, marking the directory case sensitive at best.paths
field, making makes the package unfetchable. Then disallow.paths = .{""}
. Add basic globbing support.name
,nonce
, andversion
fields whenpaths
field is omitted (unfetchable package). Do this for zig's own build.zig.zon