perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065
Open
jamestexas wants to merge 1 commit intoanchore:mainfrom
Open
perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065jamestexas wants to merge 1 commit intoanchore:mainfrom
jamestexas wants to merge 1 commit intoanchore:mainfrom
Conversation
af49589 to
048e3ff
Compare
…and fix-date lookups
Two related hot-loop bottlenecks in the NVD provider, both caused by
per-item I/O inside a tight loop over ~250k CVEs.
Bottleneck 1: NVDOverrides.cve() — per-CVE file reads
cve() maintained a filepath index (CVE ID → path) but opened, read, and
JSON-parsed the file on every call. A TODO comment already flagged the
problem.
Fix: _build_data_by_cve() globs and parses all CVE-*.json files once
into a dict on first access. All subsequent cve() calls are O(1) dict
lookups with zero I/O. The duplicated lazy-init guard is extracted into
_ensure_loaded().
Bottleneck 2: GrypeDBStore.get() — per-CPE SQLite queries
get() executed an individual SELECT for every (vuln_id, cpe_or_package)
pair. Each CVE can have 5–20 CPE matches:
250,000 CVEs × 5–20 CPE matches ≈ 1.25M–5M SQLite queries per sync
At 0.1ms per query that is ~4 minutes of pure SQLite overhead.
Fix: _build_index() bulk-loads the entire fixdates table once into two
in-memory dicts keyed by (vuln_id, cpe) and (vuln_id, package,
ecosystem). get() becomes two dict lookups. The SQLAlchemy connection
infrastructure is retained — still required by get_changed_vuln_ids_since().
No provider filter is applied in _build_index(): each Store downloads
from a provider-scoped OCI image
(ghcr.io/anchore/grype-db-observed-fix-date/{provider}), so the
database only ever contains rows for this provider.
Signed-off-by: James Gardner <james.gardner@chainguard.dev>
048e3ff to
de3bdae
Compare
Contributor
Author
|
@willmurphyscode I saw you have been making some concurrency-oriented fixes in this repo, and wanted to bring this to your attention as it's presumably related. Let me know if the approach is incorrect or anything! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What and why
Two separate code paths were doing per-item I/O inside hot loops, causing several minutes of avoidable overhead per sync.
GrypeDBStore.get()affects all providers (every provider callsdefault_finderwhich usesGrypeDBStore)NVDOverrides.cve()is NVD-specificChanges
1.
GrypeDBStore.get()— 1–5M SQLite queries eliminated (all providers)Before: one
SELECTper(vuln_id, cpe_or_package)pair. Each CVE can have 5–20 CPE matches:After:
_build_index()bulk-loads the entire fixdates table once into two in-memory dicts keyed by(vuln_id, cpe)and(vuln_id, package, ecosystem). Subsequentget()calls are two dict lookups via_ensure_index(), which usesthreading.Lock()with double-checked locking (both index vars checked to prevent partial-init reads under concurrency).2.
NVDOverrides.cve()— per-CVE file reads eliminated (NVD only)Before: a filepath index (CVE ID → path) was built once, but the file was opened, read, and JSON-parsed on every call. A
# TODO: implement in-memory indexcomment already flagged this.After:
_build_data_by_cve()globs and parses allCVE-*.jsonfiles once on first access via_ensure_loaded(), which also usesthreading.Lock()with double-checked locking for safe concurrent access.Test plan
test_overrides_enabled— cache populated; repeated calls return the same object (no re-parse)test_get_uses_in_memory_index—_build_index()called exactly once regardless of call countuv run pytest tests/unit/(779 passed)uv run ruff check src/(clean)