fix(bundler): support git gems#1629
Conversation
a88cb57 to
5e77cc1
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors how git dependencies are handled during hermetic builds in the Bundler package manager. Instead of using BUNDLE_LOCAL.* configurations, it redirects git remote URLs to pre-fetched local clones using git's insteadOf rewrites in a global git config. Feedback on these changes includes: 1) Handling a potential issue where multiple git dependencies point to the same repository URL but with different revisions, which would cause git's insteadOf to only respect the last entry. It is suggested to detect this and raise an UnsupportedFeature error. 2) Aligning the new unit test with the repository style guide by avoiding substring matching (in) and instead asserting exact equality against a dedented expected string.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
de167b4 to
62456d8
Compare
I feel like the blast radius of We may be safer with using dedicated key-value pairs: https://git-scm.com/docs/git-config#Documentation/git-config.txt-GITCONFIGCOUNT although that will be more fiddly to correctly generate and still won't be 100% bulletproof in the end, but definitely a lot less likely to cause issues during container build, if this git override works correctly that is (I haven't verified your claim). |
62456d8 to
dda4cfc
Compare
Thanks for the suggestion! |
4233efd to
984fa62
Compare
| To solve this without coupling to bundler's internal cache format | ||
| (which has [changed across versions](https://github.com/ruby/rubygems/commit/7d6b6316)), we use git's | ||
| [`url.insteadOf`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-urlltbasegtinsteadOf) | ||
| via `GIT_CONFIG_COUNT`/`KEY`/`VALUE` environment variables to redirect each remote URL to the |
There was a problem hiding this comment.
It'd be best to have a brief example on this so it's easier for imagination.
3e85c3f to
48e4b66
Compare
eskultety
left a comment
There was a problem hiding this comment.
@romanblanco https://github.com/hermetoproject/hermeto/blob/main/CONTRIBUTING.md#pull-request-guidelines
- Run linters & unit tests before opening a PR
- Ensure every single commit passes CI
|
@romanblanco you will need to squash the following to the 2nd commit: diff --git tests/integration/test_data/bundler_e2e/.build-config.yaml tests/integration/test_data/bundler_e2e/.build-config.yaml
index b23f60ab..34ab6c93 100644
--- tests/integration/test_data/bundler_e2e/.build-config.yaml
+++ tests/integration/test_data/bundler_e2e/.build-config.yaml
@@ -2,13 +2,13 @@ environment_variables:
- name: BUNDLE_APP_CONFIG
value: ${output_dir}/bundler/config_override
- name: GIT_CONFIG_COUNT
- value: "2"
+ value: '2'
- name: GIT_CONFIG_KEY_0
value: url.file://${output_dir}/deps/bundler/json-schema-26487618a684/.insteadOf
-- name: GIT_CONFIG_VALUE_0
- value: https://github.com/3scale/json-schema
- name: GIT_CONFIG_KEY_1
value: protocol.file.allow
+- name: GIT_CONFIG_VALUE_0
+ value: https://github.com/3scale/json-schema
- name: GIT_CONFIG_VALUE_1
value: always
project_files:
diff --git tests/integration/test_data/bundler_e2e_missing_gemspec/.build-config.yaml tests/integration/test_data/bundler_e2e_missing_gemspec/.build-config.yaml
index b23f60ab..34ab6c93 100644
--- tests/integration/test_data/bundler_e2e_missing_gemspec/.build-config.yaml
+++ tests/integration/test_data/bundler_e2e_missing_gemspec/.build-config.yaml
@@ -2,13 +2,13 @@ environment_variables:
- name: BUNDLE_APP_CONFIG
value: ${output_dir}/bundler/config_override
- name: GIT_CONFIG_COUNT
- value: "2"
+ value: '2'
- name: GIT_CONFIG_KEY_0
value: url.file://${output_dir}/deps/bundler/json-schema-26487618a684/.insteadOf
-- name: GIT_CONFIG_VALUE_0
- value: https://github.com/3scale/json-schema
- name: GIT_CONFIG_KEY_1
value: protocol.file.allow
+- name: GIT_CONFIG_VALUE_0
+ value: https://github.com/3scale/json-schema
- name: GIT_CONFIG_VALUE_1
value: always
project_files: |
eskultety
left a comment
There was a problem hiding this comment.
@romanblanco I did some further experiments on our side in context of this request of mine: https://github.com/hermetoproject/hermeto/pull/1629/changes#r3490366370
diff --git tests/integration/test_bundler.py tests/integration/test_bundler.py
index 55438345..35966c93 100644
--- tests/integration/test_bundler.py
+++ tests/integration/test_bundler.py
@@ -68,9 +68,19 @@ def test_bundler_packages(
packages=({"path": ".", "type": "bundler", "binary": {}},),
check_output=True,
),
- [], # No additional commands are run to verify the build
[],
- id="bundler_e2e",
+ [],
+ id="bundler_e2e_ruby33",
+ ),
+ pytest.param(
+ utils.TestParameters(
+ branch="bundler/e2e",
+ packages=({"path": ".", "type": "bundler", "binary": {}},),
+ check_output=True,
+ ),
+ [],
+ [],
+ id="bundler_e2e_ruby40",
),
pytest.param(
utils.TestParameters(Then I migrated/introduced test case/s to cover both old and new bundler with multi-stage builds, e.g.:
diff --git tests/integration/test_data/bundler_e2e_ruby40/container/Containerfile tests/integration/test_data/bundler_e2e_ruby40/container/Containerfile
new file mode 100644
index 00000000..0507284e
--- /dev/null
+++ tests/integration/test_data/bundler_e2e_ruby40/container/Containerfile
@@ -0,0 +1,15 @@
+FROM mirror.gcr.io/ruby:4.0 AS builder
+
+# Test disabled network access
+RUN if getent hosts www.google.com; then echo "Has network access!"; exit 1; fi
+
+WORKDIR /src
+
+ENV BUNDLE_PATH=/gems
+RUN . /tmp/hermeto.env && bundle install
+
+FROM mirror.gcr.io/ruby:4.0
+
+COPY --from=builder /gems /gems
+RUN test -d /gems/ruby/*/bundler/gems/json-schema-*and it worked. I'm not encouraging to copy this verbatim, this is something I botched quickly locally. However, I think the PR could go like this:
- Rename the existing integration test
bundler_e2etest data tobundler_e2e_ruby33 - Convert to cloning bare repos
- Introducing the GIT_CONFIG configuration
- Changing the renamed e2e_ruby33 test to a multi-stage docker build and introducing a new ruby 4.0 test pretty much identical to the e2e_ruby33 one so that we have coverage for both old and new bundler going forward.
How does ^that sound?
btw. This is not a feature, it is a bugfix ;) . |
Signed-off-by: Roman Blanco <rblanco@redhat.com>
Use bare git clones (`git clone --bare`) instead of full clones for git dependencies. Bare clones contain only git objects without a working tree, which is all that bundler needs to resolve the source. Drop the `branch` field from the lockfile parser output — bundler's git source only requires `remote` and `revision` to identify and install a git dependency [1][2]. The branch metadata was parsed but never consumed during the offline install. Remove BUNDLE_LOCAL configuration from the hermetic build setup. BUNDLE_LOCAL requires a working tree with a checked-out revision, which bare clones do not provide. [1] https://github.com/ruby/rubygems/blob/v4.0.15/bundler/lib/bundler/source/git.rb#L53-L62 [2] https://github.com/ruby/rubygems/blob/e7167b9a42e9804a52c04a2c16aab31949471d69/bundler/lib/bundler/source/git.rb#L38 Signed-off-by: Roman Blanco <rblanco@redhat.com>
…teadOf BUNDLE_LOCAL causes bundler to use git gems in-place from the clone directory rather than installing them into BUNDLE_PATH. In multi-stage container builds, the runtime stage copies only BUNDLE_PATH — the in-place clones are lost, producing LoadError at startup. Replace BUNDLE_LOCAL with git's url.<base>.insteadOf mechanism [1], configured through GIT_CONFIG_COUNT/KEY/VALUE environment variables. This transparently redirects each git dependency's remote URL to a local file:// path pointing at the prefetched bare clone. Bundler treats the dependency as a normal git source and installs it into BUNDLE_PATH, surviving stage transitions. Also add protocol.file.allow=always to the git configuration, as git may restrict file:// protocol access by default [2]. Reject multiple git gems pointing at the same remote URL, since url.insteadOf maps one URL to one local path. [1] https://git-scm.com/docs/git-config#Documentation/git-config.txt-urlltbasegtinsteadOf [2] https://git-scm.com/docs/git-config#ENVIRONMENT Closes hermetoproject#1630 Signed-off-by: Roman Blanco <rblanco@redhat.com>
…ge builds Multi-stage Containerfiles prove that url.insteadOf-based git dependency resolution survives COPY --from=builder stage transitions. The second stage verifies the git-sourced gem (json-schema) lands in the expected bundler/gems directory without network access. Add Ruby 4.0 test variant (bundler 4.x) alongside existing Ruby 3.3 (bundler 2.x) to cover both current and next-generation bundler behavior. Signed-off-by: Roman Blanco <rblanco@redhat.com>
EnvironmentVariable instances are never mutated after construction. resolve_value() is a pure function returning a new string. Making the model frozen enforces this invariant and enables hashing, which allows using EnvironmentVariable in sets and as dict keys. Reference: hermetoproject#1629 (comment) Signed-off-by: Erik Skultety <eskultet@redhat.com>
48e4b66 to
89de0a7
Compare
eskultety
left a comment
There was a problem hiding this comment.
This also needs to be documented in the user-facing docs, i.e. docs/bundler.md to tell users about the build-time env variables we expose. This is because it's still not a bulletproof solution, IOW the GIT_CONFIG_N numbered variables we expose might potentially conflict with other identically named variables populated from outside the container image during build in which case our variables would overwrite those. That said, what we've come up with seems like the only way for now and is definitely a smaller blast radius than setting global git config. Please make sure to warn users about this, so they can account for that in their builds.
Other than that the logic is IMO sound, I verified each commit in this revision with integration tests to make sure we don't magically break in between commits (it's important here) and everything is green.
Minor comments left since there will have to be 1 more revision and we should be good to go.
| tzinfo-2.0.6.gem | ||
| ``` | ||
|
|
||
| Notice that all the `.gem` dependencies are kept in their original format, and Git dependencies are just plain clones |
There was a problem hiding this comment.
Commit message:
Use bare git clones (
git clone --bare) instead of full clones for
git dependencies. Bare clones contain only git objects without a
working tree, which is all that bundler needs to resolve the source.
This is all true, undisputed, but I'd like to see more context in the message for bookkeeping reasons, i.e. this commit deletes the BUNDLE_LOCAL_* variables and converts to bare git clones but doesn't discuss why that is okay to do, especially since newer bundler would likely complain. Therefore, I'd like to see context expansion saying something along the lines of the following:
- the design was originally based on newer bundler than we tested for which differs in the git gem cache format
- adjust the design to favour the older bundler approach (bare git clones) as tested, the next patch will expand this with proper support for newer bundler and provide version agnostic support
| Bundler seems to follow a different approach when it comes to git dependencies since in its default | ||
| configuration it always tries to fetch the source from the remote to ensure the application is built | ||
| against the correct branch/revision. This argument is indirectly supported by the | ||
| [docs](https://bundler.io/guides/deploying.html#deploying-your-application), more specifically: | ||
|
|
||
| > If you have run bundle pack, checked in the vendor/cache directory, and do not have any git gems, | ||
| Bundler will not contact the internet while installing your bundle. | ||
|
|
||
| This is a problem for hermetic builds and as such setting `BUNDLE_DEPLOYMENT` alone doesn't help | ||
| and we need more settings. In order to overcome this behavioral trait, we need to follow the | ||
| recommendation in the [config](https://bundler.io/v2.5/man/bundle-config.1.html#LOCAL-GIT-REPOS) | ||
| docs and override each git dependency with the location on the disk we fetched the git dependency | ||
| to and tell bundler about it with the `BUNDLE_LOCAL__<GEM_NAME>` configuration key. | ||
| However, this still isn't enough for Bundler to honour offline installs with git dependencies, | ||
| because then it's trying to enforce further checks as outlined in the | ||
| [docs](https://bundler.io/v2.5/man/bundle-config.1.html#LOCAL-GIT-REPOS): | ||
|
|
||
| >Bundler does many checks to ensure a developer won't work with invalid references. Particularly, | ||
| >we force a developer to specify a branch in the Gemfile in order to use this feature. If the | ||
| >branch specified in the Gemfile and the current branch in the local git repository do not | ||
| >match, Bundler will abort. | ||
|
|
||
| Therefore, we additionally need to enforce both `BUNDLE_DISABLE_LOCAL_BRANCH_CHECK` and | ||
| `BUNDLE_DISABLE_LOCAL_REVISION_CHECK`. | ||
| Bundler always tries to fetch git dependencies from the remote, so `BUNDLE_DEPLOYMENT` alone | ||
| doesn't prevent network access. |
There was a problem hiding this comment.
With this hunk here I'd like to be more thorough though - it should go to the earlier commit switching to bare git clones. It's fine to keep the section in that commit brief, we'll add the rest (below) in this commit.
| To solve this without coupling to bundler's internal cache format | ||
| (which has [changed across versions](https://github.com/ruby/rubygems/commit/7d6b6316)), we use git's | ||
| [`url.insteadOf`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-urlltbasegtinsteadOf) | ||
| via `GIT_CONFIG_COUNT`/`KEY`/`VALUE` environment variables to redirect each remote URL to the | ||
| pre-fetched local bare clone. This operates at the git transport layer — bundler runs its standard | ||
| flow but git silently fetches from disk instead of the network. | ||
|
|
||
| For example, given a git dependency on `https://github.com/3scale/json-schema`, hermeto sets: | ||
|
|
||
| ``` | ||
| GIT_CONFIG_COUNT=2 | ||
| GIT_CONFIG_KEY_0=url.file:///output/deps/bundler/json-schema-26487618a684/.insteadOf | ||
| GIT_CONFIG_VALUE_0=https://github.com/3scale/json-schema | ||
| GIT_CONFIG_KEY_1=protocol.file.allow | ||
| GIT_CONFIG_VALUE_1=always | ||
| ``` | ||
|
|
||
| The `protocol.file.allow=always` entry ensures git permits the `file://` protocol after the URL | ||
| rewrite. See the [git documentation](https://git-scm.com/docs/git-config#Documentation/git-config.txt-protocolallow) | ||
| for details. |
There was a problem hiding this comment.
Yes, this hunk discussing the actual solution we've implemented should remain here as is. :)
| if len(dirs) > 1: | ||
| raise UnsupportedFeature( | ||
| f"Multiple git dependencies point to the same repository ({url}) " | ||
| f"but use different revisions: {', '.join(sorted(dirs))}. " |
There was a problem hiding this comment.
Use populate revisions in the error message with dirs which are filesystem dir names (e.g. name-abc123) rather than bare refs.
a-ovchinnikov
left a comment
There was a problem hiding this comment.
A couple of minor nitpicks, LGTM otherwise.
| """Raise if multiple git deps share the same URL with different revisions.""" | ||
| url_to_dirs: dict[str, set[str]] = {} | ||
| for _packname, dirname, url in git_paths: | ||
| url_to_dirs.setdefault(url, set()).add(dirname) |
There was a problem hiding this comment.
Non-blocking nitpicks: _packname is not used in this function thus _ should be enough (that would also tell the reader that the element is not used); since you are not using the return value of setdefault it might be better to use defaultdict:
from collections improt defaultdict
...
...
...
urls_to_dirs = defaultfict(set)
for _, dirname, url in git_paths:
urls_to_dirs[url].add(dirname)It would also be nice to find all offending url:dirs pairs in one go (there could be more than one and users would get really upset if they were to fish for them one at a time):
offenders = [(url, dirs) for url, dirs in urls_to_dirs.items() if len(dirs) > 1]
if offenders:
raise ...| EnvironmentVariable(name="BUNDLE_APP_CONFIG", value="${output_dir}/" + CONFIG_OVERRIDE), | ||
| ] | ||
| if git_paths: | ||
| # Redirect git remote URLs to pre-fetched local clones via git's |
There was a problem hiding this comment.
A non-blocking request: could you please move lines 206 to 221 into a separate helper, maybe _produce_git_redirection_variables(git_paths: list[tuple[DepName, FSDepName, DepURL]] | None) -> list[EnvironmentVariable] to get something like this:
env_vars = [
# Contains path to a directory where a new config could be found.
EnvironmentVariable(name="BUNDLE_APP_CONFIG", value="${output_dir}/" + CONFIG_OVERRIDE),
]
env_vars.extend(_produce_git_redirection_variables(git_paths))
return env_vars
Git gems (declared via
github:orgit:in Gemfile) are absent from runtime images in hermetic builds that use multi-stage Dockerfiles.Root cause
BUNDLE_LOCAL.<gem>triggers bundler's local override path ingit.rb:Bundler uses the gem in-place from the clone and never copies it into
BUNDLE_PATH.The runtime stage copies only
.bundle/and the clone is lost.Fix
Replace
BUNDLE_LOCAL.*withGIT_CONFIG_GLOBALpointing to a generated git config withurl.insteadOfredirects for each git dependency.Bundler clones from the local
file://path,local?staysfalseand checkout runs which should get the git gem inBUNDLE_PATH.Fixes: #1630
Documentation:
https://git-scm.com/docs/git-config#Documentation/git-config.txt-urlbaseinsteadOf
https://git-scm.com/docs/git-config#Documentation/git-config.txt-GITCONFIGCOUNT