Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that artifact cache files will not have long file names #406

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dmikusa
Copy link
Contributor

@dmikusa dmikusa commented Mar 22, 2025

Summary

Resolves #287

The previous fix here used a sha256 hash of the full URI as a way to ensure that the resulting file name for the artifact in the dependency cache won't exceed filesystem file name limits (i.e. it won't generate a file name that is too long).

This causes some issues, mostly that you can't easily see what's in the dependency cache. This can be painful when building offline buildpacks. This would also cause a lot of test breakage as it would change the format of the dependency cache on disk and we have a lot of buildpacks that mock this format for testing purposes, if we change that, it would require updating a lot of tests.

I think we can get away without doing that though. The real problem is that you might have a URL with query parameters on it, where the query parameters can be very long. What I believe we can do is to simply not include the query params in the name, which I believe is the actual intent of our implementation anyway.

This fixes the issue with offline downloads because you still have a human-readable name as the file name in the cache.

This is OK with our test cases because test cases don't as far as I know mock any URIs with query parameters. If this does happen, it is much less frequent and shouldn't be a big deal to resolve.

My concern with truncating the query params is that you could end up with a collision, where the URIs are different but libpak restores a different file from the cache.

This should be safe because the path to the artifact file in the cache consists of the sha256 of the file being downloaded and then the last part of the URL's path. For example, https://example.com/foo/bar/file.tgz where the sha256 hash of file.tgz is sha256-foo will generate a path on disk of cache-path/sha256-foo/file.tgz. Since the dependency configured in buildpack.toml consists of the url & the sha256 of the file, it would be very odd to have a second dependency configured in buildpack.toml with a different URL that points to the exact same file (I can't think of a reason anyway). In fact, if you had that what libpak would do is to download the first file and then reuse the first file from the cache, because the sha256 hashes of the file match (i.e. it would never download the second URL).

I debated on dropping the file name altogether and just using the dependency's sha256 hash, but for the offline scenario, it is nice to have.

This solves the reported problem as well because it removes the arbitrarily long parts (i.e. the query params) that were causing the reported issue.

The only case that comes to mind where I thought this might cause an issue is the Dynatrace buildpack, which generates the URL for the dependency on demand and it does include some sets of query parameters. This should be Ok as well because it does not set a sha256 sum for the binary, which means that libpak will not cache it. It will always download it, so we don't really care what the file name is on the disk. If the query parameters change, it will download the file again anyway.

Resolves #287

The previous fix here used a sha256 hash of the full URI as a way to ensure that the resulting file name for the artifact in the dependency cache won't exceed filesystem file name limits (i.e. it won't generate a file name that is too long).

This causes some issues, mostly that you can't easily see what's in the dependency cache. This can be painful when building offline buildpacks. This would also cause a lot of test breakage as it would change the format of the dependency cache on disk and we have a lot of buildpacks that mock this format for testing purposes, if we change that, it would require updating a lot of tests.

I think we can get away without doing that though. The real problem is that you might have a URL with query parameters on it, where the query params can be very long. What I believe we can do is to simply not include the query params in the name, which I believe is the actual intent of our implementation anyway.

This fixes the issue with offline downloads because you still have a human readable name as the file name in the cache.

This is OK with our test cases because test cases don't as far as I know mock any URIs with query parameters. If this does happen, it is much less frequent and shouldn't be a big deal to resolve.

This should be safe because the path to the artifact file in the cache consists of the sha256 of the file being downloaded and then the last part of the URL's path. For example, `https://example.com/foo/bar/file.tgz` where the sha256 hash of `file.tgz` is `sha256-foo` will generate a path on disk of `cache-path/sha256-foo/file.tgz`. Since the dependency configured in `buildpack.toml` consists of the url & the sha256 of the file, it would be very odd to have a second dependency configured in `buildpack.toml` with a different URL that points to the exact same file (I can't think of a reason anyway). In fact, if you had that what libpak would do is to download the first file and then reuse the first file from the cache, because the sha256 hashes of the file match (i.e. it would never download the second URL).

I debated on dropping the file name altogether and just using the dependency's sha256 hash, but for the offline scenario, it is nice to have.

This solves the reported problem as well because it removes the arbitrarily long parts (i.e. the query params) which were causing the reported issue.

Signed-off-by: Daniel Mikusa <[email protected]>
@dmikusa dmikusa added type:bug A general bug semver:patch A change requiring a patch version bump labels Mar 22, 2025
@dmikusa dmikusa requested a review from a team as a code owner March 22, 2025 20:13
@modulo11
Copy link
Contributor

LGTM 👍

@nicolasbender
Copy link

Are the labels correct? At least v2 seems to be missing. This looks to me like it would introduce a breaking change again since the last time the file-naming was changed, it caused a rollback. Or is this change only considered a patch because it belongs to the desired functionality of libpak v2?

@dmikusa
Copy link
Contributor Author

dmikusa commented Mar 25, 2025

We recently changed the v2 branch to main, and v1 is now on its own branch. So this should merge into the v2 branch.

The previous PR was a problem because it changed the format of the cache file names. This had some unanticipated impact.

The change here doesn't change the structure for any of our Paketo Buildpacks. That's why I marked it as patch, but this is all rolled up into v2 which is a major version anyway. I do think we'll need to mark this as something to include in the migration guide. It could result in a change in behavior if you have query parameters in your download URLs.

@dmikusa dmikusa requested a review from anthonydahanne March 28, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver:patch A change requiring a patch version bump type:bug A general bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove http query params from destination file name release-2.x branch
4 participants