-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that artifact cache files will not have long file names #406
base: main
Are you sure you want to change the base?
Conversation
Resolves #287 The previous fix here used a sha256 hash of the full URI as a way to ensure that the resulting file name for the artifact in the dependency cache won't exceed filesystem file name limits (i.e. it won't generate a file name that is too long). This causes some issues, mostly that you can't easily see what's in the dependency cache. This can be painful when building offline buildpacks. This would also cause a lot of test breakage as it would change the format of the dependency cache on disk and we have a lot of buildpacks that mock this format for testing purposes, if we change that, it would require updating a lot of tests. I think we can get away without doing that though. The real problem is that you might have a URL with query parameters on it, where the query params can be very long. What I believe we can do is to simply not include the query params in the name, which I believe is the actual intent of our implementation anyway. This fixes the issue with offline downloads because you still have a human readable name as the file name in the cache. This is OK with our test cases because test cases don't as far as I know mock any URIs with query parameters. If this does happen, it is much less frequent and shouldn't be a big deal to resolve. This should be safe because the path to the artifact file in the cache consists of the sha256 of the file being downloaded and then the last part of the URL's path. For example, `https://example.com/foo/bar/file.tgz` where the sha256 hash of `file.tgz` is `sha256-foo` will generate a path on disk of `cache-path/sha256-foo/file.tgz`. Since the dependency configured in `buildpack.toml` consists of the url & the sha256 of the file, it would be very odd to have a second dependency configured in `buildpack.toml` with a different URL that points to the exact same file (I can't think of a reason anyway). In fact, if you had that what libpak would do is to download the first file and then reuse the first file from the cache, because the sha256 hashes of the file match (i.e. it would never download the second URL). I debated on dropping the file name altogether and just using the dependency's sha256 hash, but for the offline scenario, it is nice to have. This solves the reported problem as well because it removes the arbitrarily long parts (i.e. the query params) which were causing the reported issue. Signed-off-by: Daniel Mikusa <[email protected]>
Signed-off-by: Daniel Mikusa <[email protected]>
LGTM 👍 |
Are the labels correct? At least |
We recently changed the v2 branch to main, and v1 is now on its own branch. So this should merge into the v2 branch. The previous PR was a problem because it changed the format of the cache file names. This had some unanticipated impact. The change here doesn't change the structure for any of our Paketo Buildpacks. That's why I marked it as patch, but this is all rolled up into v2 which is a major version anyway. I do think we'll need to mark this as something to include in the migration guide. It could result in a change in behavior if you have query parameters in your download URLs. |
Summary
Resolves #287
The previous fix here used a sha256 hash of the full URI as a way to ensure that the resulting file name for the artifact in the dependency cache won't exceed filesystem file name limits (i.e. it won't generate a file name that is too long).
This causes some issues, mostly that you can't easily see what's in the dependency cache. This can be painful when building offline buildpacks. This would also cause a lot of test breakage as it would change the format of the dependency cache on disk and we have a lot of buildpacks that mock this format for testing purposes, if we change that, it would require updating a lot of tests.
I think we can get away without doing that though. The real problem is that you might have a URL with query parameters on it, where the query parameters can be very long. What I believe we can do is to simply not include the query params in the name, which I believe is the actual intent of our implementation anyway.
This fixes the issue with offline downloads because you still have a human-readable name as the file name in the cache.
This is OK with our test cases because test cases don't as far as I know mock any URIs with query parameters. If this does happen, it is much less frequent and shouldn't be a big deal to resolve.
My concern with truncating the query params is that you could end up with a collision, where the URIs are different but libpak restores a different file from the cache.
This should be safe because the path to the artifact file in the cache consists of the sha256 of the file being downloaded and then the last part of the URL's path. For example,
https://example.com/foo/bar/file.tgz
where the sha256 hash offile.tgz
issha256-foo
will generate a path on disk ofcache-path/sha256-foo/file.tgz
. Since the dependency configured inbuildpack.toml
consists of the url & the sha256 of the file, it would be very odd to have a second dependency configured inbuildpack.toml
with a different URL that points to the exact same file (I can't think of a reason anyway). In fact, if you had that what libpak would do is to download the first file and then reuse the first file from the cache, because the sha256 hashes of the file match (i.e. it would never download the second URL).I debated on dropping the file name altogether and just using the dependency's sha256 hash, but for the offline scenario, it is nice to have.
This solves the reported problem as well because it removes the arbitrarily long parts (i.e. the query params) that were causing the reported issue.
The only case that comes to mind where I thought this might cause an issue is the Dynatrace buildpack, which generates the URL for the dependency on demand and it does include some sets of query parameters. This should be Ok as well because it does not set a sha256 sum for the binary, which means that libpak will not cache it. It will always download it, so we don't really care what the file name is on the disk. If the query parameters change, it will download the file again anyway.