Skip to content

High memory from syft parsing java manifest #4626

@HairyMike

Description

@HairyMike

The pprof report shows high memory usage in the following areas:

(pprof) top10
Showing nodes accounting for 853MB, 77.86% of 1095.56MB total
Dropped 452 nodes (cum <= 5.48MB)
Showing top 10 nodes out of 143
      flat  flat%   sum%        cum   cum%
  281.63MB 25.71% 25.71%   349.70MB 31.92%  github.com/anchore/syft/internal/task.finalizePkgCatalogerResults
  223.01MB 20.36% 46.06%   281.51MB 25.70%  github.com/anchore/syft/syft/pkg/cataloger/java.parseJavaManifest
  104.57MB  9.54% 55.61%   104.57MB  9.54%  github.com/anchore/syft/syft/file.(*LocationSet).Add
   66.50MB  6.07% 61.68%    66.50MB  6.07%  bufio.(*Scanner).Text (inline)
   55.55MB  5.07% 66.75%    63.06MB  5.76%  github.com/anchore/syft/syft/pkg.(*LicenseSet).Add
   38.51MB  3.52% 70.26%    38.51MB  3.52%  github.com/anchore/syft/syft/file.Location.WithAnnotation
   30.21MB  2.76% 73.02%    30.21MB  2.76%  github.com/google/licensecheck/internal/match.(*dfaBuilder).add
   23.50MB  2.15% 75.17%       24MB  2.19%  fmt.Sprintf
   17.01MB  1.55% 76.72%    36.11MB  3.30%  github.com/anchore/syft/syft/pkg.(*Collection).addToIndex
   12.50MB  1.14% 77.86%    12.50MB  1.14%  github.com/klauspost/compress/zip.readDirectoryHeader

Looking at http://github.com/anchore/syft/syft/pkg/cataloger/java.parseJavaManifest we see that sections is taking upwards of 150MB~

list github.com/anchore/syft/syft/pkg/cataloger/java.parseJavaManifest
Total: 1.07GB
ROUTINE ======================== github.com/anchore/syft/syft/pkg/cataloger/java.parseJavaManifest in pkg/cataloger/java/parse_java_manifest.go
  223.01MB   281.51MB (flat, cum) 25.70% of Total
         .          .     20:func parseJavaManifest(path string, reader io.Reader) (*pkg.JavaManifest, error) {
    1.50MB     1.50MB     21:	var manifest pkg.JavaManifest
         .          .     22:	sections := make([]pkg.KeyValues, 0)
         .          .     23:
         .          .     24:	currentSection := func() int {
         .          .     25:		return len(sections) - 1
         .          .     26:	}
         .          .     27:
         .          .     28:	var lastKey string
         .          .     29:	scanner := bufio.NewScanner(reader)
         .          .     30:
         .          .     31:	for scanner.Scan() {
         .    58.50MB     32:		line := scanner.Text()
         .          .     33:
         .          .     34:		// empty lines denote section separators
         .          .     35:		if line == "" {
         .          .     36:			// we don't want to allocate a new section map that won't necessarily be used, do that once there is
         .          .     37:			// a non-empty line to process
         .          .     38:
         .          .     39:			// do not process line continuations after this
         .          .     40:			lastKey = ""
         .          .     41:
         .          .     42:			continue
         .          .     43:		}
         .          .     44:
         .          .     45:		if line[0] == ' ' {
         .          .     46:			// this is a continuation
         .          .     47:
         .          .     48:			if lastKey == "" {
         .          .     49:				log.Debugf("java manifest %q: found continuation with no previous key: %q", path, line)
         .          .     50:				continue
         .          .     51:			}
         .          .     52:
         .          .     53:			lastSection := sections[currentSection()]
         .          .     54:
  155.38MB   155.38MB     55:			sections[currentSection()][len(lastSection)-1].Value += strings.TrimSpace(line)
         .          .     56:
         .          .     57:			continue
         .          .     58:		}
         .          .     59:
         .          .     60:		// this is a new key-value pair
         .          .     61:		idx := strings.Index(line, ":")
         .          .     62:		if idx == -1 {
         .          .     63:			log.Debugf("java manifest %q: unable to split java manifest key-value pairs: %q", path, line)
         .          .     64:			continue
         .          .     65:		}
         .          .     66:
         .          .     67:		key := strings.TrimSpace(line[0:idx])
         .          .     68:		value := strings.TrimSpace(line[idx+1:])
         .          .     69:
         .          .     70:		if key == "" {
         .          .     71:			// don't attempt to add new keys or sections unless there is a non-empty key
         .          .     72:			continue
         .          .     73:		}
         .          .     74:
         .          .     75:		if lastKey == "" {
         .          .     76:			// we're entering a new section
    4.58MB     4.58MB     77:			sections = append(sections, make(pkg.KeyValues, 0))
         .          .     78:		}
         .          .     79:
   61.55MB    61.55MB     80:		sections[currentSection()] = append(sections[currentSection()], pkg.KeyValue{
         .          .     81:			Key:   key,
         .          .     82:			Value: value,
         .          .     83:		})
         .          .     84:
         .          .     85:		// keep track of key for potential future continuations

for http://github.com/anchore/syft/internal/task.finalizePkgCatalogerResults it is using 281MB~ on a slice of CPEs

(pprof) list github.com/anchore/syft/internal/task.finalizePkgCatalogerResults
Total: 1.07GB
ROUTINE ======================== github.com/anchore/syft/internal/task.finalizePkgCatalogerResults in task/package_task_factory.go
  281.63MB   349.70MB (flat, cum) 31.92% of Total
         .          .     75:func finalizePkgCatalogerResults(cfg CatalogingFactoryConfig, resolver file.PathResolver, catalogerName string, pkgs []pkg.Package, relationships []artifact.Relationship) ([]pkg.Package, []artifact.Relationship) {
         .          .     76:	for i, p := range pkgs {
         .          .     77:		if p.FoundBy == "" {
         .          .     78:			p.FoundBy = catalogerName
         .          .     79:		}
         .          .     80:
         .          .     81:		if cfg.DataGenerationConfig.GenerateCPEs && !hasAuthoritativeCPE(p.CPEs) {
         .          .     82:			// generate CPEs (note: this is excluded from package ID, so is safe to mutate)
         .          .     83:			// we might have binary classified CPE already with the package so we want to append here
         .     3.51MB     84:			dictionaryCPEs, ok := cpeutils.DictionaryFind(p)
         .          .     85:			if ok {
         .          .     86:				log.Tracef("used CPE dictionary to find CPEs for %s package %q: %s", p.Type, p.Name, dictionaryCPEs)
         .          .     87:				p.CPEs = append(p.CPEs, dictionaryCPEs...)
         .          .     88:			} else {
  281.63MB   289.63MB     89:				p.CPEs = append(p.CPEs, cpeutils.Generate(p)...)
         .          .     90:			}
         .          .     91:		}
         .          .     92:
         .          .     93:		// if we were not able to identify the language we have an opportunity
         .          .     94:		// to try and get this value from the PURL. Worst case we assert that
         .          .     95:		// we could not identify the language at either stage and set UnknownLanguage
         .          .     96:		if p.Language == "" {
         .          .     97:			p.Language = pkg.LanguageFromPURL(p.PURL)
         .          .     98:		}
         .          .     99:
         .          .    100:		if cfg.RelationshipsConfig.PackageFileOwnership {
         .          .    101:			// create file-to-package relationships for files owned by the package
         .          .    102:			owningRelationships, err := packageFileOwnershipRelationships(p, resolver)
         .          .    103:			if err != nil {
         .          .    104:				log.Debugf("unable to create any package-file relationships for package name=%q type=%q: %v", p.Name, p.Type, err)
         .          .    105:			} else {
         .          .    106:				relationships = append(relationships, owningRelationships...)
         .          .    107:			}
         .          .    108:		}
         .          .    109:
         .          .    110:		// we want to know if the user wants to preserve license content or not in the final SBOM
         .          .    111:		// note: this looks incorrect, but pkg.License.Content is NOT used to compute the Package ID
         .          .    112:		// this does NOT change the reproducibility of the Package ID
         .    56.55MB    113:		applyLicenseContentRules(&p, cfg.LicenseConfig)
         .          .    114:
         .          .    115:		pkgs[i] = p
         .          .    116:	}
         .          .    117:	return pkgs, relationships
         .          .    118:}

This PR reduces the amount of string allocations in https://github.com/anchore/syft/syft/pkg/cataloger/java.parseJavaManifest -> #4624

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions