Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(output): use new internal data structure #1609

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

G-Rath
Copy link
Collaborator

@G-Rath G-Rath commented Feb 12, 2025

This is an initial attempt to convert the table output to use the new Output.Result and co that is being used for container scanning.

Overall, I think I'm actually pretty close, but I'm going to put it down for now to focus on slog integration

@G-Rath G-Rath changed the title Output/use new struct feat(output): use new internal data structure Feb 12, 2025
@@ -1024,12 +1024,12 @@
╭───────────────────────┬──────┬───────────┬───────────────┬─────────┬──────── ≈
│ OSV URL │ CVSS │ ECOSYSTEM │ PACKAGE │ VERSION │ SOURCE ≈
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has revealed that our test cases are structured in a way that is not expected by the new data format - we've got at least one case where a models.PackageSource has packages from different ecosystems, which results in those packages getting associated with a single ecosystem due to output.BuildResults assuming all packages in a source will belong to the same ecosystem

@hogo6002 would you mind reviewing the "multiple sources with a mixed count of packages across ecosystems, and multiple vulnerabilities" case (among others) to confirm if its valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that if all packages originate from a single source file, they belong to the same ecosystem. For example, a go.mod file can only contain Go packages. For the test here, when we mock data, and we assign the same source path to packages from different ecosystems. I think the tests are more incorrect here. (is it possible define Nuget and Packagist packages in one lockfile?)

In the extracting code, we determine the package ecosystem based on the extractor used. So I don't think we will encounter a scenario where a single lockfile contains packages from different ecosystems. But I can also easily modify the outputResult code to handle this if we think this type of case is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@another-rex what do you think Rex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should be prepared to support it unless it's extremely hard, because it allows support for arbitrary "files" - though that handling could be done here or at the parser level.

First though, my example case: maybe one day we support an arbitrary CSV or JSON file, meaning I can say to the scanner "here is a single file that has multiple packages from different ecosystems".

On the one hand, that's a single file so technically it's from the same source and thus could end up here as a single source - on the other hand, we could decide to handle that at the extractor level i.e. we return a source for each ecosystem that all point to the same file.

While we've talked enough in the past about CSV support to know that that specifically is unlikely, but I still think its an easy example of how in future maybe some kind of file will come along that we do want to support.

Related, is this not actually a situation that's already possible with SBOM? can that not include packages across multiple ecosystems?

I can also easily modify the outputResult code to handle this if we think this type of case is possible.

Again, I think if it's very easy then it's probably worth doing just so we don't have to every think about it again 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know about how we get ecosystems for packages in SBOM. I can just modify the OutputResult code to add support for this case, it's just a few lines code change.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Feb 12, 2025

(I think we might also have a sort bug or two, but first I want to confirm if the test cases are actually valid since that'll be a huge source of difference either way)

@@ -53,6 +53,8 @@ type PackageResult struct {
HiddenVulns []VulnResult
LayerDetail PackageContainerInfo
VulnCount VulnCount
Path string `json:"-"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the package path same as SourceReuslt.Name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah looks like it probably is - I was just trying to get things working enough that the snapshots weren't completely bogus before I started refining things.

fwiw, I do think the fields could do with some documentation to make it easier to find things - for example, whats the difference between RegularVulns and HiddenVulns? and why is OSPackageNames a thing - when should I use that instead of Name? etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more comments to clarify.
RegularVulns refers to the called vulnerabilities, and HiddenVulns are those we've filtered out for users. These hidden vulnerabilities might be unimportant (for OS packages), or they could be uncalled (source packages). The OSPackageName field exists because OS packages have both source name and package name (the binary file name). For example, the source package krb5 on Debian might result in binary package names like krb5-localesand libkrb5-3. For project scanning, I think the package.name is enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue to track this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved this issue and also added license scanning results to the output.Result structure. I think we can now fully replace the table output with this new structure.

@@ -1024,12 +1024,12 @@
╭───────────────────────┬──────┬───────────┬───────────────┬─────────┬──────── ≈
│ OSV URL │ CVSS │ ECOSYSTEM │ PACKAGE │ VERSION │ SOURCE ≈
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that if all packages originate from a single source file, they belong to the same ecosystem. For example, a go.mod file can only contain Go packages. For the test here, when we mock data, and we assign the same source path to packages from different ecosystems. I think the tests are more incorrect here. (is it possible define Nuget and Packagist packages in one lockfile?)

In the extracting code, we determine the package ecosystem based on the extractor used. So I don't think we will encounter a scenario where a single lockfile contains packages from different ecosystems. But I can also easily modify the outputResult code to handle this if we think this type of case is possible.

@G-Rath G-Rath force-pushed the output/use-new-struct branch 2 times, most recently from ab9e54d to dd58791 Compare February 24, 2025 23:06
@G-Rath G-Rath force-pushed the output/use-new-struct branch 2 times, most recently from 74f781f to b28c6ea Compare February 27, 2025 19:32
@G-Rath G-Rath force-pushed the output/use-new-struct branch from b28c6ea to 665c584 Compare March 13, 2025 18:50
@G-Rath G-Rath force-pushed the output/use-new-struct branch from 665c584 to 8f882c3 Compare March 13, 2025 18:57
@G-Rath
Copy link
Collaborator Author

G-Rath commented Mar 13, 2025

@hogo6002 I'm looking at picking this back up since it sounds like something we want overall?

I think there's two main points:

  1. I've switched to using source.Name, only caveat is that has the "type" included with a colon which'll be present even if there is no type; for now to maintain the existing output I'm trimming that colon prefix, but that might be something we want to change in future
  2. it looks like the old logic was sorting first by package source/path, whereas now we're sorting by ecosystem; personally I think it's better to be sorted by source first, but wanted to make sure we're in agreement on that before I look into changing it

(I'm not sure about the other sorting changes, but they look at least harmless enough to ignore them, though I imagine I'll probably find out the reason for them too if I dig into 2.)

outputRow = append(outputRow, "GIT", pkgCommitStr, pkgCommitStr)
shouldMerge = true
} else {
name := pkg.Package.Name
// TODO(#1646): Migrate this earlier to the result struct directly
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I've actioned this as a happy accident, or if it's about moving the whole condition (and thus the need for my new DepGroups field) out earlier - if it is the latter though, is it meaning we append the dev marker elsewhere? that feels a bit weird as we'd be changing the "name" of the package and what if we need that for other logic...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just modify package.name earlier rather than adding DepGroups. This Result struct is only for displaying the result and is being used for other formats as well. I think it would be better that package.name matches the final output, and it also avoids extra handling in other formats

@hogo6002
Copy link
Contributor

hogo6002 commented Mar 13, 2025

  1. I've switched to using source.Name, only caveat is that has the "type" included with a colon which'll be present even if there is no type; for now to maintain the existing output I'm trimming that colon prefix, but that might be something we want to change in future

sounds good to me. We can just trim the colon prefix first.

  1. it looks like the old logic was sorting first by package source/path, whereas now we're sorting by ecosystem; personally I think it's better to be sorted by source first, but wanted to make sure we're in agreement on that before I look into changing it

Within each ecosystem, it also sorts by source name. It groups all sources from one ecosystem together, which benefits the display of container scanning results. For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view. Then, we can just remove the ecosystem column from the result table. If we don't want to make a big change, I think it's fine to show vulnerabilities from the same ecosystem together.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Mar 13, 2025

For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view
...

I think you might have misunderstood - I'm asking if we want to be sorting first by source, or by ecosystem, as previously we were doing it by source but now we're doing it by ecosystem, and I personally think it would be better to sort by source first; if it's already been decided to not sort by source first intentionally, then so be it - but I'm not sure if that is the case, or if we've just defaulted to sorting by ecosystem (e.g. as a result of this new structure having been built for container scanning first, where the preference is by ecosystem).

The reason why I think "source first" would be better is that's one of the first things people think about as part of triaging and actioning - consider, you're scanning multiple codebases each with their own package-lock.json; you'd see say there's two entries for one of the lockfiles, and so you know that one has a low number of vulnerabilities, and for another lockfile there's a much larger list. Also, when starting to action those, you need to go to each package-lock.json file in their respective locations to run commands for actioning, and only then do I care about what those commands are (which are provided by the "ecosystem", in part).

For project scanning, I think we could also separate the table into multiple tables by ecosystem, similar to the package table view

I think this is a good idea in either case, but it doesn't answer if we do the output by ecosystem or by source.


This could be something that's actually worth a flag e.g. --group-by=<source|ecosystem>

outputRow = append(outputRow, pkg.Package.Ecosystem, name, pkg.Package.Version)
outputRow = append(outputRow, name)
outputRow = append(outputRow, pkg.InstalledVersion)
outputRow = append(outputRow, strings.TrimPrefix(source.Name, ":"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version here can be git commit, I think we can just use getInstalledVersionOrCommit(pkg)

@G-Rath
Copy link
Collaborator Author

G-Rath commented Mar 13, 2025

After writing the above out, I dug back into the codebase and I think the answer is it's already been decided to sort by ecosystem first so feel free not to spend time rebutting if that's the case 🙂

I personally do still feel I'd prefer the order by source than ecosystem because it feels less surprising to me, and that a flag for that could be useful, but among other things I think I'm a little biased by our test suite using "one, two, three" type file names which make it easy to mentally sort so in the wild it might be less surprising 🤷

@hogo6002
Copy link
Contributor

hogo6002 commented Mar 14, 2025

I personally do still feel I'd prefer the order by source than ecosystem because it feels less surprising to me, and that a flag for that could be useful, but among other things I think I'm a little biased by our test suite using "one, two, three" type file names which make it easy to mentally sort so in the wild it might be less surprising 🤷

Yes, for cases where users scan multiple projects, sorting by source only might make it easier to view the source with the highest number of vulnerabilities. (But also, in real-life scans, a source usually only has vulns from one ecosystem, so all vulns from a source should already be together.) The best approach, I think, is to follow the HTML output method: group all sources from one ecosystem together but separate each source into individual tables. That way, it's quite straightforward to review the number of vulnerabilities for each source. We've implemented this in both the vertical and HTML outputs. For this PR, I feel we can try to make less changes, which means keeping everything in one large table, as this is less surprising to users. If we receive complaints from users about this sorting, we can then separate them into multiple tables or add extra flags. For now, I think it's fine. (Personally, I wish users would use the vertical and HTML outputs more than the table output.)

@G-Rath
Copy link
Collaborator Author

G-Rath commented Mar 14, 2025

The best approach, I think, is to follow the HTML output method: group all sources from one ecosystem together but separate each source into individual tables.

Personally I still think the other way around, especially for the HTML output - I tend to think about codebases/projects first, then ecosystems, but I think we can put a pin in this for now 🙃

I might see if I can get the old implementation sorting by ecosystem first to reduce the diff here, but otherwise I think this is resolved for now.

(Personally, I wish users would use the vertical and HTML outputs more than the table output.)

That's actually a really good reminder - weren't we going to make the vertical output the default?

@G-Rath
Copy link
Collaborator Author

G-Rath commented Mar 14, 2025

Yeah we were - is that something we still want to do? cc @another-rex

fwiw, I don't think that's actually technically a breaking change but we might as well do it now if we want before v2 goes out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants