feat(output): use new internal data structure #1609

G-Rath · 2025-02-12T00:49:59Z

This is an initial attempt to convert the table output to use the new Output.Result and co that is being used for container scanning.

Overall, I think I'm actually pretty close, but I'm going to put it down for now to focus on slog integration

G-Rath · 2025-02-12T00:53:26Z

internal/output/__snapshots__/table_test.snap

@@ -1024,12 +1024,12 @@
 ╭───────────────────────┬──────┬───────────┬───────────────┬─────────┬──────── ≈
 │ OSV URL               │ CVSS │ ECOSYSTEM │ PACKAGE       │ VERSION │ SOURCE  ≈


I think this has revealed that our test cases are structured in a way that is not expected by the new data format - we've got at least one case where a models.PackageSource has packages from different ecosystems, which results in those packages getting associated with a single ecosystem due to output.BuildResults assuming all packages in a source will belong to the same ecosystem

@hogo6002 would you mind reviewing the "multiple sources with a mixed count of packages across ecosystems, and multiple vulnerabilities" case (among others) to confirm if its valid?

I assumed that if all packages originate from a single source file, they belong to the same ecosystem. For example, a go.mod file can only contain Go packages. For the test here, when we mock data, and we assign the same source path to packages from different ecosystems. I think the tests are more incorrect here. (is it possible define Nuget and Packagist packages in one lockfile?)

In the extracting code, we determine the package ecosystem based on the extractor used. So I don't think we will encounter a scenario where a single lockfile contains packages from different ecosystems. But I can also easily modify the outputResult code to handle this if we think this type of case is possible.

@another-rex what do you think Rex

Personally I think we should be prepared to support it unless it's extremely hard, because it allows support for arbitrary "files" - though that handling could be done here or at the parser level.

First though, my example case: maybe one day we support an arbitrary CSV or JSON file, meaning I can say to the scanner "here is a single file that has multiple packages from different ecosystems".

On the one hand, that's a single file so technically it's from the same source and thus could end up here as a single source - on the other hand, we could decide to handle that at the extractor level i.e. we return a source for each ecosystem that all point to the same file.

While we've talked enough in the past about CSV support to know that that specifically is unlikely, but I still think its an easy example of how in future maybe some kind of file will come along that we do want to support.

Related, is this not actually a situation that's already possible with SBOM? can that not include packages across multiple ecosystems?

I can also easily modify the outputResult code to handle this if we think this type of case is possible.

Again, I think if it's very easy then it's probably worth doing just so we don't have to every think about it again 🤷

I actually don't know about how we get ecosystems for packages in SBOM. I can just modify the OutputResult code to add support for this case, it's just a few lines code change.

@hogo6002 fwiw, I've been working on some stuff for Drupal advisories and it sounds like one possible approach we might go with would having a dedicated Drupal ecosystem meaning extracting composer.lock files could give a mix of Packagist and Drupal packages, which would have been relevant here

G-Rath · 2025-02-12T00:55:23Z

(I think we might also have a sort bug or two, but first I want to confirm if the test cases are actually valid since that'll be a huge source of difference either way)

hogo6002 · 2025-02-12T02:01:59Z

internal/output/output_result.go

@@ -53,6 +53,8 @@ type PackageResult struct {
 	HiddenVulns      []VulnResult
 	LayerDetail      PackageContainerInfo
 	VulnCount        VulnCount
+	Path             string   `json:"-"`


Is the package path same as SourceReuslt.Name

yeah looks like it probably is - I was just trying to get things working enough that the snapshots weren't completely bogus before I started refining things.

fwiw, I do think the fields could do with some documentation to make it easier to find things - for example, whats the difference between RegularVulns and HiddenVulns? and why is OSPackageNames a thing - when should I use that instead of Name? etc

I'll add more comments to clarify.
RegularVulns refers to the called vulnerabilities, and HiddenVulns are those we've filtered out for users. These hidden vulnerabilities might be unimportant (for OS packages), or they could be uncalled (source packages). The OSPackageName field exists because OS packages have both source name and package name (the binary file name). For example, the source package krb5 on Debian might result in binary package names like krb5-localesand libkrb5-3. For project scanning, I think the package.name is enough

Added an issue to track this.

I've resolved this issue and also added license scanning results to the output.Result structure. I think we can now fully replace the table output with this new structure.

hogo6002 · 2025-02-12T02:37:14Z

internal/output/__snapshots__/table_test.snap

@@ -1024,12 +1024,12 @@
 ╭───────────────────────┬──────┬───────────┬───────────────┬─────────┬──────── ≈
 │ OSV URL               │ CVSS │ ECOSYSTEM │ PACKAGE       │ VERSION │ SOURCE  ≈


I assumed that if all packages originate from a single source file, they belong to the same ecosystem. For example, a go.mod file can only contain Go packages. For the test here, when we mock data, and we assign the same source path to packages from different ecosystems. I think the tests are more incorrect here. (is it possible define Nuget and Packagist packages in one lockfile?)

In the extracting code, we determine the package ecosystem based on the extractor used. So I don't think we will encounter a scenario where a single lockfile contains packages from different ecosystems. But I can also easily modify the outputResult code to handle this if we think this type of case is possible.

G-Rath · 2025-03-13T19:01:58Z

@hogo6002 I'm looking at picking this back up since it sounds like something we want overall?

I think there's two main points:

I've switched to using source.Name, only caveat is that has the "type" included with a colon which'll be present even if there is no type; for now to maintain the existing output I'm trimming that colon prefix, but that might be something we want to change in future
it looks like the old logic was sorting first by package source/path, whereas now we're sorting by ecosystem; personally I think it's better to be sorted by source first, but wanted to make sure we're in agreement on that before I look into changing it

(I'm not sure about the other sorting changes, but they look at least harmless enough to ignore them, though I imagine I'll probably find out the reason for them too if I dig into 2.)

G-Rath · 2025-03-13T19:03:35Z

internal/output/table.go

-					outputRow = append(outputRow, "GIT", pkgCommitStr, pkgCommitStr)
-					shouldMerge = true
-				} else {
-					name := pkg.Package.Name
 					// TODO(#1646): Migrate this earlier to the result struct directly


I'm not sure if I've actioned this as a happy accident, or if it's about moving the whole condition (and thus the need for my new DepGroups field) out earlier - if it is the latter though, is it meaning we append the dev marker elsewhere? that feels a bit weird as we'd be changing the "name" of the package and what if we need that for other logic...?

I think we can just modify package.name earlier rather than adding DepGroups. This Result struct is only for displaying the result and is being used for other formats as well. I think it would be better that package.name matches the final output, and it also avoids extra handling in other formats

hogo6002 · 2025-03-13T23:29:17Z

I've switched to using source.Name, only caveat is that has the "type" included with a colon which'll be present even if there is no type; for now to maintain the existing output I'm trimming that colon prefix, but that might be something we want to change in future

sounds good to me. We can just trim the colon prefix first.

it looks like the old logic was sorting first by package source/path, whereas now we're sorting by ecosystem; personally I think it's better to be sorted by source first, but wanted to make sure we're in agreement on that before I look into changing it

Within each ecosystem, it also sorts by source name. It groups all sources from one ecosystem together, which benefits the display of container scanning results. For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view. Then, we can just remove the ecosystem column from the result table. If we don't want to make a big change, I think it's fine to show vulnerabilities from the same ecosystem together.

G-Rath · 2025-03-13T23:45:40Z

For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view
...

I think you might have misunderstood - I'm asking if we want to be sorting first by source, or by ecosystem, as previously we were doing it by source but now we're doing it by ecosystem, and I personally think it would be better to sort by source first; if it's already been decided to not sort by source first intentionally, then so be it - but I'm not sure if that is the case, or if we've just defaulted to sorting by ecosystem (e.g. as a result of this new structure having been built for container scanning first, where the preference is by ecosystem).

The reason why I think "source first" would be better is that's one of the first things people think about as part of triaging and actioning - consider, you're scanning multiple codebases each with their own package-lock.json; you'd see say there's two entries for one of the lockfiles, and so you know that one has a low number of vulnerabilities, and for another lockfile there's a much larger list. Also, when starting to action those, you need to go to each package-lock.json file in their respective locations to run commands for actioning, and only then do I care about what those commands are (which are provided by the "ecosystem", in part).

For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view

I think this is a good idea in either case, but it doesn't answer if we do the output by ecosystem or by source.

This could be something that's actually worth a flag e.g. --group-by=<source|ecosystem>

hogo6002 · 2025-03-13T23:46:39Z

internal/output/table.go

-					outputRow = append(outputRow, pkg.Package.Ecosystem, name, pkg.Package.Version)
+					outputRow = append(outputRow, name)
+					outputRow = append(outputRow, pkg.InstalledVersion)
+					outputRow = append(outputRow, strings.TrimPrefix(source.Name, ":"))


version here can be git commit, I think we can just use getInstalledVersionOrCommit(pkg)

G-Rath · 2025-03-13T23:48:39Z

After writing the above out, I dug back into the codebase and I think the answer is it's already been decided to sort by ecosystem first so feel free not to spend time rebutting if that's the case 🙂

I personally do still feel I'd prefer the order by source than ecosystem because it feels less surprising to me, and that a flag for that could be useful, but among other things I think I'm a little biased by our test suite using "one, two, three" type file names which make it easy to mentally sort so in the wild it might be less surprising 🤷

hogo6002 · 2025-03-14T00:57:32Z

I personally do still feel I'd prefer the order by source than ecosystem because it feels less surprising to me, and that a flag for that could be useful, but among other things I think I'm a little biased by our test suite using "one, two, three" type file names which make it easy to mentally sort so in the wild it might be less surprising 🤷

Yes, for cases where users scan multiple projects, sorting by source only might make it easier to view the source with the highest number of vulnerabilities. (But also, in real-life scans, a source usually only has vulns from one ecosystem, so all vulns from a source should already be together.) The best approach, I think, is to follow the HTML output method: group all sources from one ecosystem together but separate each source into individual tables. That way, it's quite straightforward to review the number of vulnerabilities for each source. We've implemented this in both the vertical and HTML outputs. For this PR, I feel we can try to make less changes, which means keeping everything in one large table, as this is less surprising to users. If we receive complaints from users about this sorting, we can then separate them into multiple tables or add extra flags. For now, I think it's fine. (Personally, I wish users would use the vertical and HTML outputs more than the table output.)

G-Rath · 2025-03-14T01:04:48Z

The best approach, I think, is to follow the HTML output method: group all sources from one ecosystem together but separate each source into individual tables.

Personally I still think the other way around, especially for the HTML output - I tend to think about codebases/projects first, then ecosystems, but I think we can put a pin in this for now 🙃

I might see if I can get the old implementation sorting by ecosystem first to reduce the diff here, but otherwise I think this is resolved for now.

(Personally, I wish users would use the vertical and HTML outputs more than the table output.)

That's actually a really good reminder - weren't we going to make the vertical output the default?

G-Rath · 2025-03-14T01:05:44Z

Yeah we were - is that something we still want to do? cc @another-rex

fwiw, I don't think that's actually technically a breaking change but we might as well do it now if we want before v2 goes out

github-actions · 2025-05-13T01:28:08Z

This pull request has not had any activity for 60 days and will be automatically closed in two weeks

I realized when revisiting #1609 that we don't have this covered

G-Rath changed the title ~~Output/use new struct~~ feat(output): use new internal data structure Feb 12, 2025

G-Rath commented Feb 12, 2025

View reviewed changes

G-Rath mentioned this pull request Feb 12, 2025

fix(output): ensure that vulnerabilities are sorted by ID across groups in table output #1598

Merged

hogo6002 reviewed Feb 12, 2025

View reviewed changes

G-Rath force-pushed the output/use-new-struct branch 2 times, most recently from ab9e54d to dd58791 Compare February 24, 2025 23:06

hogo6002 mentioned this pull request Feb 25, 2025

Scan status files used by Ubuntu #1293

Open

G-Rath force-pushed the output/use-new-struct branch 2 times, most recently from 74f781f to b28c6ea Compare February 27, 2025 19:32

G-Rath force-pushed the output/use-new-struct branch 2 times, most recently from 665c584 to 8f882c3 Compare March 13, 2025 18:57

G-Rath commented Mar 13, 2025

View reviewed changes

hogo6002 reviewed Mar 13, 2025

View reviewed changes

github-actions bot added the stale The issue or PR is stale and pending automated closure label May 13, 2025

G-Rath force-pushed the output/use-new-struct branch from 8f882c3 to 429a840 Compare May 20, 2025 21:40

github-actions bot removed the stale The issue or PR is stale and pending automated closure label May 20, 2025

G-Rath mentioned this pull request May 21, 2025

test(output): add cases of packages with commits #1872

Merged

another-rex pushed a commit that referenced this pull request May 23, 2025

test(output): add cases of packages with commits (#1872)

c8656d2

I realized when revisiting #1609 that we don't have this covered

G-Rath force-pushed the output/use-new-struct branch from 429a840 to 25a938c Compare May 24, 2025 21:33

G-Rath force-pushed the output/use-new-struct branch from 25a938c to 6992530 Compare June 4, 2025 01:21

feat(output): use new internal data structure

dbe2525

G-Rath added 2 commits June 10, 2025 16:17

test: update snapshots

de161ef

refactor: use source.Name instead of new Path field

e84d738

G-Rath force-pushed the output/use-new-struct branch from 6992530 to e84d738 Compare June 10, 2025 04:17

test: update snapshots

75a8491

		@@ -1024,12 +1024,12 @@
		╭───────────────────────┬──────┬───────────┬───────────────┬─────────┬──────── ≈
		│ OSV URL │ CVSS │ ECOSYSTEM │ PACKAGE │ VERSION │ SOURCE ≈

feat(output): use new internal data structure #1609

Are you sure you want to change the base?

feat(output): use new internal data structure #1609

Uh oh!

Conversation

G-Rath commented Feb 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

G-Rath commented Feb 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

G-Rath commented Mar 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hogo6002 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

G-Rath commented Mar 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

G-Rath commented Mar 13, 2025

Uh oh!

hogo6002 commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

G-Rath commented Mar 14, 2025

Uh oh!

G-Rath commented Mar 14, 2025

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

Uh oh!

hogo6002 commented Mar 13, 2025 •

edited

Loading

hogo6002 commented Mar 14, 2025 •

edited

Loading