Skip to content

Conversation

@Churro
Copy link
Contributor

@Churro Churro commented Jan 11, 2026

Problem

osv-offline-db relies on @seald-io/nedb to load the OSV vulnerability databases (in NDJSON format) of all supported ecosystems into memory upon initialization. loadDatabaseAsync() reads the entire file, parses every line into JS objects, builds in-memory indexes and keeps all documents resident in heap. As, npm.nedb has grown to 316MB, pypi.nedb to 38MB, etc. this requires > 1GB of heap mem (peak) and loading takes ~16sec on a Ryzen 5 5600X 6-core CPU.

When running renovate with osvVulnerabilityAlerts: true on https://github.com/renovate-demo/gh-vulntest/

DEBUG: fetchVulnerabilities() - osvVulnerabilityAlerts=true (repository=renovate-demo/gh-vulntest)
  osv-offline:download Downloading databases ... +0ms
  osv-offline:download Downloading databases done. +1s
  osv-offline:download Extracting databases ... +0ms
  osv-offline:download Extracting databases done. +2s	
  osv-offline:download Initializing databases ... +0ms
  osv-offline:download Initializing databases done. +16s
  osv-offline:download 
  osv-offline:download Memory Profile: initialize +0ms
  osv-offline:download   Peak Heap: 1264.83 MB +0ms
  osv-offline:download   Peak RSS: 1889.38 MB +0ms
  osv-offline:download   Delta Heap: 656.05 MB +0ms
Details per ecosystem
  osv-offline-db:db Initializing ecosystem crates.io. +0ms
  osv-offline-db:db crates.io loaded in 209ms +209ms
  osv-offline-db:db Heap used: 9.27 MB +0ms
  osv-offline-db:db RSS: 20.25 MB +0ms
  osv-offline-db:db Initializing ecosystem Go. +0ms
  osv-offline-db:db Go loaded in 505ms +505ms
  osv-offline-db:db Heap used: 35.09 MB +0ms
  osv-offline-db:db RSS: 49.16 MB +0ms
  osv-offline-db:db Initializing ecosystem Hackage. +0ms
  osv-offline-db:db Hackage loaded in 14ms +14ms
  osv-offline-db:db Heap used: 1.18 MB +0ms
  osv-offline-db:db RSS: 0.98 MB +0ms
  osv-offline-db:db Initializing ecosystem Hex. +0ms
  osv-offline-db:db Hex loaded in 25ms +25ms
  osv-offline-db:db Heap used: 2.79 MB +0ms
  osv-offline-db:db RSS: 0.31 MB +0ms
  osv-offline-db:db Initializing ecosystem Maven. +0ms
  osv-offline-db:db Maven loaded in 1262ms +1s
  osv-offline-db:db Heap used: 57.67 MB +0ms
  osv-offline-db:db RSS: 83.16 MB +0ms
  osv-offline-db:db Initializing ecosystem npm. +0ms
  osv-offline-db:db npm loaded in 11259ms +11s
  osv-offline-db:db Heap used: 745.21 MB +0ms
  osv-offline-db:db RSS: 1122.33 MB +0ms
  osv-offline-db:db Initializing ecosystem NuGet. +0ms
  osv-offline-db:db NuGet loaded in 245ms +245ms
  osv-offline-db:db Heap used: 7.03 MB +0ms
  osv-offline-db:db RSS: 14.11 MB +0ms
  osv-offline-db:db Initializing ecosystem Packagist. +0ms
  osv-offline-db:db Packagist loaded in 837ms +837ms
  osv-offline-db:db Heap used: 62.04 MB +0ms
  osv-offline-db:db RSS: 66.29 MB +0ms
  osv-offline-db:db Initializing ecosystem Pub. +0ms
  osv-offline-db:db Pub loaded in 11ms +11ms
  osv-offline-db:db Heap used: 0.88 MB +0ms
  osv-offline-db:db RSS: 0.07 MB +0ms
  osv-offline-db:db Initializing ecosystem PyPI. +0ms
  osv-offline-db:db PyPI loaded in 1889ms +2s
  osv-offline-db:db Heap used: -279.89 MB +0ms
  osv-offline-db:db RSS: 157.12 MB +0ms
  osv-offline-db:db Initializing ecosystem RubyGems. +0ms
  osv-offline-db:db RubyGems loaded in 227ms +227ms
  osv-offline-db:db Heap used: 14.60 MB +0ms
  osv-offline-db:db RSS: -386.29 MB +0ms
  osv-offline-db:db
  osv-offline-db:db All ecosystems loaded in 16483ms +0ms

Proposed Solution

  • Stream NDJSON line-by-line
  • For every ecosystem, build a package -> byte-offsets index per vulnerability record
  • Query:
    • Lookup offsets for a package
    • Read only matching records

It keeps mem usage low by "remembering" just where to find matching vulnerability records instead of reading all of them into memory. Parallel queries work safely since positional reads are thread-safe and direct random access via offset works in O(1) time.

When running renovate with osvVulnerabilityAlerts: true on https://github.com/renovate-demo/gh-vulntest/

DEBUG: fetchVulnerabilities() - osvVulnerabilityAlerts=true (repository=renovate-demo/gh-vulntest)
  osv-offline:download Downloading databases ... +0ms
  osv-offline:download Downloading databases done. +1s
  osv-offline:download Extracting databases ... +0ms
  osv-offline:download Extracting databases done. +2s	
  osv-offline:download Initializing databases ... +0ms
  osv-offline:download Initializing databases done. +3s
  osv-offline:download 
  osv-offline:download Memory Profile: initialize +0ms
  osv-offline:download   Peak Heap: 333.45 MB +0ms
  osv-offline:download   Peak RSS: 469.11 MB +0ms
  osv-offline:download   Delta Heap: 108.29 MB +0ms
Details per ecosystem
  osv-offline-db:db Initializing ecosystem crates.io. +0ms
  osv-offline-db:db crates.io loaded in 46ms +46ms
  osv-offline-db:db Heap used: 0.75 MB +0ms
  osv-offline-db:db RSS: -82.52 MB +0ms
  osv-offline-db:db Initializing ecosystem Go. +0ms
  osv-offline-db:db Go loaded in 92ms +92ms
  osv-offline-db:db Heap used: 5.18 MB +0ms
  osv-offline-db:db RSS: 3.11 MB +0ms
  osv-offline-db:db Initializing ecosystem Hackage. +0ms
  osv-offline-db:db Hackage loaded in 2ms +2ms
  osv-offline-db:db Heap used: 0.46 MB +0ms
  osv-offline-db:db RSS: 0.55 MB +0ms
  osv-offline-db:db Initializing ecosystem Hex. +0ms
  osv-offline-db:db Hex loaded in 2ms +2ms
  osv-offline-db:db Heap used: 1.20 MB +0ms
  osv-offline-db:db RSS: 0.40 MB +0ms
  osv-offline-db:db Initializing ecosystem Maven. +0ms
  osv-offline-db:db Maven loaded in 145ms +145ms
  osv-offline-db:db Heap used: 3.58 MB +0ms
  osv-offline-db:db RSS: 3.25 MB +0ms
  osv-offline-db:db Initializing ecosystem npm. +0ms
  osv-offline-db:db npm loaded in 1810ms +2s
  osv-offline-db:db Heap used: 82.43 MB +0ms
  osv-offline-db:db RSS: -33.60 MB +1ms
  osv-offline-db:db Initializing ecosystem NuGet. +0ms
  osv-offline-db:db NuGet loaded in 35ms +35ms
  osv-offline-db:db Heap used: 12.35 MB +0ms
  osv-offline-db:db RSS: 5.10 MB +0ms
  osv-offline-db:db Initializing ecosystem Packagist. +0ms
  osv-offline-db:db Packagist loaded in 126ms +127ms
  osv-offline-db:db Heap used: -11.30 MB +0ms
  osv-offline-db:db RSS: -2.52 MB +0ms
  osv-offline-db:db Initializing ecosystem Pub. +0ms
  osv-offline-db:db Pub loaded in 1ms +1ms
  osv-offline-db:db Heap used: 0.28 MB +0ms
  osv-offline-db:db RSS: 0.05 MB +0ms
  osv-offline-db:db Initializing ecosystem PyPI. +0ms
  osv-offline-db:db PyPI loaded in 275ms +275ms
  osv-offline-db:db Heap used: 11.98 MB +0ms
  osv-offline-db:db RSS: 5.77 MB +1ms
  osv-offline-db:db Initializing ecosystem RubyGems. +0ms
  osv-offline-db:db RubyGems loaded in 29ms +30ms
  osv-offline-db:db Heap used: 2.05 MB +0ms
  osv-offline-db:db RSS: 1.13 MB +0ms
  osv-offline-db:db
  osv-offline-db:db All ecosystems loaded in 2567ms +0ms

As can be seen this reduces the DB loading time from 16s to 3s and peak heap mem usage from 1264.83 MB to 333.45 MB. I've repeated these benchmarks a few times and in some other runs GC ran even more efficent.

Also note the difference for npm with 316 MB:

# before:
  osv-offline-db:db npm loaded in 11259ms +11s
  osv-offline-db:db Heap used: 745.21 MB +0ms
# now:
  osv-offline-db:db npm loaded in 1810ms +2s
  osv-offline-db:db Heap used: 82.43 MB +0ms

When applied on https://github.com/renovate-demo/gh-vulntest/, the current and the new impl find 84 vulnerabilities each.

@codecov-commenter
Copy link

codecov-commenter commented Jan 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.65%. Comparing base (70f26dc) to head (285f929).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1465      +/-   ##
==========================================
+ Coverage   92.75%   95.65%   +2.89%     
==========================================
  Files           6        6              
  Lines          69      115      +46     
  Branches        7       15       +8     
==========================================
+ Hits           64      110      +46     
  Misses          5        5              
Flag Coverage Δ
node22-Linux 95.65% <100.00%> (+2.89%) ⬆️
node24-Linux 95.65% <100.00%> (+2.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

import type { Osv } from '..';
import { packageToPurl } from './purl-helper';

interface RecordPointer {
Copy link
Contributor Author

@Churro Churro Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to make this even more efficient by dropping RecordPointer and by using just a flat array of numbers. So instead of { offset, length } simply store [offset, length, ...].

Querying would look like this:

    for (let i = 0; i < pointers.length; i += 2) {
      const offset = pointers[i];
      const length = pointers[i + 1];
      ...
    }

I didn't go this route for now but it could be a further optimization.

await this.buildIndex(this.indices[ecosystem], filePath);
}

process.on('exit', () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this but it was the only way how I could make it work. I looked into await using with [Symbol.asyncDispose]: async () => { ... } instead but since we use Singletons all over here and in renovate for Vulnerabilities, we'd probably need reference counting inside the singleton(s) - which is quite complex.

Considered calling close() manually but one would need to add handling to every Singleton, incl the Vulnerabilities class in renovate. Since Vulnerabilities.create() is called 2x, the DB initialization would then probably also run twice.

}

private async initialize(): Promise<void> {
for (const ecosystem of ecosystems) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this loop could also be parallelized. Didn't run faster on my machine, probably because when parsing every JSON line of 11 files in parallel, the CPU becomes a bottleneck.

for (const affected of record.affected) {
const packageName = affected.package?.name;

if (packageName && !affectedPackageNames.has(packageName)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!affectedPackageNames.has(packageName) is needed because there can be multiple affected entries that reference the same packageName with different versions or ranges. During querying we want to find the record only once, so we deduplicate here.

} catch {
// Skip malformed lines
}
currentOffset += lineByteLength + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All .nedb files have LF endings, so this should be fine.


const candidates = await Promise.all(
pointers.map(async ({ offset, length }) => {
const buffer = Buffer.allocUnsafe(length);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer is written in the next line, so not clearing is no issue and ~5% faster than Buffer.alloc().

await fs.rm(rootDir, { recursive: true, force: true });
});

async function createDbWithContent(fileName: string, content: string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helps with testing instead of loading the same database with sampleVuln for every test

@Churro Churro changed the title refactor: reduce heap mem usage for db queries with pre-built index refactor(osv-offline-db): reduce initialization time & heap mem usage for database loading with pre-built index Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants