Skip to content

feat(scanner): Add submodule fetch strategy for nested repositories #2679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wkl3nk
Copy link
Contributor

@wkl3nk wkl3nk commented May 8, 2025

Introduce submoduleFetchStrategy config to control how the Scanner fetches Git submodules. When set to TOP_LEVEL_ONLY, only top-level submodules are fetched, avoiding timeouts on deeply nested repos.

This mirrors the behavior already available in the Analyzer and allows to resolve nested provenances even in this kind of repositories with a vast amount of nested submodules.

If activated, in the logs you will no longer see the --recursive flag in the gib submodule update command then:

Running 'git submodule update --init --depth 50' in '/tmp/ort-DefaultWorkingTreeCache13286267791034354700'..."

If particular VCS plugin configurations are active during a scan (like submoduleFetchStrategy=TOP_LEVEL_ONLY),
ensure that VCS plugin configurations are stored alongside nested provenance data. This prevents reuse of cache entries across scans with differing VCS plugin settings, ensuring correctness and reliability of scan results.

For storing the VCS plugin configuration, a canonical string like Git/updateNestedSubmodules/false is created and stored alogside with the nested provenance data.

@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from cc97a2e to f068fce Compare May 8, 2025 12:07
@wkl3nk wkl3nk marked this pull request as ready for review May 8, 2025 12:33
@@ -85,7 +87,20 @@ class ScannerRunner(
?: listOf(SourceCodeOrigin.ARTIFACT, SourceCodeOrigin.VCS)
)

val workingTreeCache = DefaultWorkingTreeCache()
// If the submodule fetch strategy is set to TOP_LEVEL_ONLY, for git use a plugin config that prevents that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to test this? Maybe with a constructor mock and a verification that the Git-specific plugin options have actually been set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oheger-bosch This additional functionality was implemented as part of DefaultWorkingTreeCache, which is a class of ORT and not ORT Server. If it is to be tested, the test should be in ORT and not in ORT Server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing that the correct options are passed when doing the checkout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oheger-bosch I tried it out to add a test, but the functionality to add the right parameters to the git submodule update command is buried deep into the VCS implementation (in this case: Git), and also just querying the plugin configuration of the Git (VCS) instance is not possible, as the configuration is a private member. And all of this is happing in ORT, and not ORT Server. Whatever, the functionality in the Git (VCS) is the same as it is already used by the Analyzer parameter submoduleFetchStrategy=TOP_LEVEL_ONLY so it is quite sure that the appropriate git submodule update command is used when the repository is checked out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not go so far to test the VCS logic in ORT. What I mean is, there is the line

val workingTreeCache = DefaultWorkingTreeCache().addVcsPluginConfigs(vcsPluginConfigs)

in ScannerRunner. When using a constructor mock for DefaultWorkingTreeCache, it should be possible to verify that addVcsPluginConfigs was called on the object and to inspect the passed configuration.

@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from f068fce to bacaadb Compare May 15, 2025 14:29
@wkl3nk wkl3nk requested a review from mnonnenmacher May 15, 2025 14:37
@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from bacaadb to 8023d6d Compare May 16, 2025 05:59
@wkl3nk wkl3nk requested a review from oheger-bosch May 16, 2025 13:23
@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from 8023d6d to 321ee16 Compare May 19, 2025 10:50
@wkl3nk
Copy link
Contributor Author

wkl3nk commented May 19, 2025

@mnonnenmacher I extended the commit message including a statement about the remaining small risk of this solution.

/**
* Specifies how submodules are fetched when resolving provenances. Currently supported only for Git repositories.
* If set to [SubmoduleFetchStrategy.FULLY_RECURSIVE] (default), all submodules are fetched recursively. If set
* to [SubmoduleFetchStrategy.TOP_LEVEL_ONLY], only the top-level submodules are fetched.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: docs are missing SubmoduleFetchStrategy.DISABLED. But to me different options don't need to be explained here at all as they are already documented in the enum.

} else {
emptyMap()
}
val canonicalVcsPluginConfigs = createCanonicalVcsPluginConfigs(vcsPluginConfigs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line belongs to the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But IntelliJ only allowed me to check the code block as a whole.

@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from 321ee16 to 2a74519 Compare May 20, 2025 10:22
wkl3nk added 2 commits May 20, 2025 12:25
Introduce `submoduleFetchStrategy` config to control how the Scanner
fetches Git submodules. When set to `TOP_LEVEL_ONLY`, only top-level
submodules are fetched, avoiding timeouts on deeply nested repos.

This mirrors the behavior already available in the Analyzer and
allows to resolve nested provenances even in this kind of repositories
with a vast amount of nested submodules.

Signed-off-by: Wolfgang Klenk <[email protected]>
If particular VCS plugin configurations are active during a
scan (like `submoduleFetchStrategy=TOP_LEVEL_ONLY`),
ensure that VCS plugin configurations are stored alongside nested
provenance data. This prevents reuse of cache entries across scans
with differing VCS plugin settings, ensuring correctness and
reliability of scan results.

However, there remains a small risk that other dependencies are
not fully resolved in such a scenario where the WorkingTreeCache
is limited to the first level of submodules. But this is rather
small, because other open source dependencies typically don't
use nested submodules at all.

Signed-off-by: Wolfgang Klenk <[email protected]>
@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from 2a74519 to 33f456a Compare May 20, 2025 10:26
@wkl3nk wkl3nk requested a review from mnonnenmacher May 20, 2025 10:40
@mnonnenmacher mnonnenmacher added this pull request to the merge queue May 20, 2025
Merged via the queue into eclipse-apoapsis:main with commit 6201be7 May 20, 2025
30 checks passed
@mnonnenmacher mnonnenmacher deleted the wkl3nk/scanner-add-submodule-fetch-strategy branch May 20, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants