-
Notifications
You must be signed in to change notification settings - Fork 12
refactor(osv-offline-db): reduce initialization time & heap mem usage for database loading with pre-built index #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1465 +/- ##
==========================================
+ Coverage 92.75% 95.65% +2.89%
==========================================
Files 6 6
Lines 69 115 +46
Branches 7 15 +8
==========================================
+ Hits 64 110 +46
Misses 5 5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| import type { Osv } from '..'; | ||
| import { packageToPurl } from './purl-helper'; | ||
|
|
||
| interface RecordPointer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to make this even more efficient by dropping RecordPointer and by using just a flat array of numbers. So instead of { offset, length } simply store [offset, length, ...].
Querying would look like this:
for (let i = 0; i < pointers.length; i += 2) {
const offset = pointers[i];
const length = pointers[i + 1];
...
}I didn't go this route for now but it could be a further optimization.
| await this.buildIndex(this.indices[ecosystem], filePath); | ||
| } | ||
|
|
||
| process.on('exit', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this but it was the only way how I could make it work. I looked into await using with [Symbol.asyncDispose]: async () => { ... } instead but since we use Singletons all over here and in renovate for Vulnerabilities, we'd probably need reference counting inside the singleton(s) - which is quite complex.
Considered calling close() manually but one would need to add handling to every Singleton, incl the Vulnerabilities class in renovate. Since Vulnerabilities.create() is called 2x, the DB initialization would then probably also run twice.
| } | ||
|
|
||
| private async initialize(): Promise<void> { | ||
| for (const ecosystem of ecosystems) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, this loop could also be parallelized. Didn't run faster on my machine, probably because when parsing every JSON line of 11 files in parallel, the CPU becomes a bottleneck.
| for (const affected of record.affected) { | ||
| const packageName = affected.package?.name; | ||
|
|
||
| if (packageName && !affectedPackageNames.has(packageName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!affectedPackageNames.has(packageName) is needed because there can be multiple affected entries that reference the same packageName with different versions or ranges. During querying we want to find the record only once, so we deduplicate here.
| } catch { | ||
| // Skip malformed lines | ||
| } | ||
| currentOffset += lineByteLength + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All .nedb files have LF endings, so this should be fine.
|
|
||
| const candidates = await Promise.all( | ||
| pointers.map(async ({ offset, length }) => { | ||
| const buffer = Buffer.allocUnsafe(length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer is written in the next line, so not clearing is no issue and ~5% faster than Buffer.alloc().
| await fs.rm(rootDir, { recursive: true, force: true }); | ||
| }); | ||
|
|
||
| async function createDbWithContent(fileName: string, content: string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helps with testing instead of loading the same database with sampleVuln for every test
Problem
osv-offline-dbrelies on@seald-io/nedbto load the OSV vulnerability databases (in NDJSON format) of all supported ecosystems into memory upon initialization.loadDatabaseAsync()reads the entire file, parses every line into JS objects, builds in-memory indexes and keeps all documents resident in heap. As,npm.nedbhas grown to 316MB,pypi.nedbto 38MB, etc. this requires > 1GB of heap mem (peak) and loading takes ~16sec on a Ryzen 5 5600X 6-core CPU.When running renovate with
osvVulnerabilityAlerts: trueon https://github.com/renovate-demo/gh-vulntest/Details per ecosystem
Proposed Solution
It keeps mem usage low by "remembering" just where to find matching vulnerability records instead of reading all of them into memory. Parallel queries work safely since positional reads are thread-safe and direct random access via offset works in O(1) time.
When running renovate with
osvVulnerabilityAlerts: trueon https://github.com/renovate-demo/gh-vulntest/Details per ecosystem
As can be seen this reduces the DB loading time from 16s to 3s and peak heap mem usage from 1264.83 MB to 333.45 MB. I've repeated these benchmarks a few times and in some other runs GC ran even more efficent.
Also note the difference for
npmwith 316 MB:When applied on https://github.com/renovate-demo/gh-vulntest/, the current and the new impl find 84 vulnerabilities each.