Skip to content

fix: dedupe Packages with same Application type but different filepath#10789

Open
mvanhorn wants to merge 2 commits into
aquasecurity:mainfrom
mvanhorn:fix/8993-sbom-duplicate-packages-filepath
Open

fix: dedupe Packages with same Application type but different filepath#10789
mvanhorn wants to merge 2 commits into
aquasecurity:mainfrom
mvanhorn:fix/8993-sbom-duplicate-packages-filepath

Conversation

@mvanhorn

@mvanhorn mvanhorn commented Jun 4, 2026

Copy link
Copy Markdown

Summary

Coalesces the same logical application that arrives under two different file paths (SBOM vs on-disk scan) so its packages are no longer duplicated in the merged result.

Why

The language-package merge in pkg/fanal/applier/docker.go keyed on app.FilePath + "/type:" + app.Type, so the same application arriving twice with two different file paths got two distinct keys and was retained twice. Issue #8993 reports that this duplicates the application's packages in the merged scan output (for example when the same app is seen once from an SBOM and once from an on-disk scan), inflating results.

Description

The language-package merge in pkg/fanal/applier/docker.go keyed on app.FilePath + "/type:" + app.Type, so the same logical application arriving twice with two different file paths (for example once from an SBOM and once from an on-disk scan) got two distinct keys and was retained twice, duplicating its packages in the merged result. ApplyLayers now routes each application through setApplication, which detects the same-Type/different-FilePath SBOM-vs-scanned case (shouldMergeApplications) and coalesces them (mergeApplications), preferring the scanned source and dropping byte-for-byte duplicate packages. Genuinely distinct applications keep their per-filepath behavior, so non-duplicate cases are unchanged, and OS-package and misconfiguration merging in the same loop are not touched.

Related issues

Related PRs

  • None.

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

@CLAassistant

CLAassistant commented Jun 4, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@DmitriyLewen

Copy link
Copy Markdown
Contributor

Hello @mvanhorn
Thanks for your work!

Before going deeper into the current implementation, I'd like to suggest reconsidering the matching approach, because I think we already have a more reliable signal than filename heuristics.

Root cause recap (from #8993 / discussion #8863): the duplication happens because the embedded SBOM (e.g. Bitnami's SPDX) reports an application whose APPLICATION package has no type, so Trivy keys it by the .spdx file path, while the on-disk gobinary analyzer reports the same package under the real binary path. Two different FilePaths → the filepath + type dedup key keeps both.

We already solve the structurally identical problem for JARs in pkg/fanal/analyzer/analyzer.go (PostAnalyze): before running an analyzer, we skip files that the SBOM already covers, using the package's own FilePath:

for _, app := range result.Applications {
    skippedFiles = append(skippedFiles, app.FilePath)
    for _, pkg := range app.Packages {
        // The files of those packages don't have to be analyzed.
        if pkg.FilePath != "" {
            skippedFiles = append(skippedFiles, pkg.FilePath)
        }
    }
}

This works for JARs because the JAR analyzer is a PostAnalyzer, and it doesn't work for Go binaries only because gobinary is a regular analyzer — not because we lack the information to match them.

The key point: the SBOM package already carries the real binary path in pkg.FilePath. In fact your own test demonstrates this — the SBOM package has FilePath: "opt/app/bin/app", which is exactly the file path of the scanned application. So we have an exact, authoritative link between the SBOM-sourced app and the on-disk one.

Could we lean on that exact pkg.FilePath match as the primary (ideally the only) signal for coalescing, rather than the filename-token heuristics (pathToken / substring matching / the "no path evidence ⇒ merge" fallback)? Anchoring on the exact path would:

  • reuse the same signal the JAR path already relies on, keeping the two code paths conceptually consistent;
  • avoid false merges of genuinely distinct apps (the current substring matching can merge e.g. an SBOM token myappserver with a binary app, and the no-evidence fallback can merge unrelated same-type apps that share a single common dependency like Go stdlib);
  • keep the merge deterministic and avoid the per-application full-map walk.

Replace the filename-token heuristics with the exact pkg.FilePath link
between SBOM-sourced and on-disk applications, indexed by file path so
coalescing no longer walks the full application map. Same-type apps
that merely share a dependency now stay separate.
@mvanhorn

mvanhorn commented Jun 5, 2026

Copy link
Copy Markdown
Author

Reworked as you suggested in cc338a7. The filename heuristics (pathToken, substring matching, and the no-path-evidence fallback) are gone entirely. Coalescing now keys solely on the exact pkg.FilePath link: an SBOM-sourced application merges into an on-disk one only when one of its packages carries FilePath equal to the application's path, in either layer arrival order. The paths are indexed (FilePath -> application key), so the per-application full-map walk is gone too.

Also added the negative case you raised: two same-type apps that share a common dependency (Go stdlib) but have no path link now stay separate, with a test asserting it. Applier tests pass locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Trivy doesnt remove duplicate Packages received from SBOM + from the Analyzer interface

3 participants