perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups by jamestexas · Pull Request #1065 · anchore/vunnel

jamestexas · 2026-02-20T20:25:33Z

What and why

Two separate code paths were doing per-item I/O inside hot loops, causing several minutes of avoidable overhead per sync.

GrypeDBStore.get() affects all providers (every provider calls default_finder which uses GrypeDBStore)
NVDOverrides.cve() is NVD-specific

Changes

1. `GrypeDBStore.get()` — 1–5M SQLite queries eliminated (all providers)

Before: one SELECT per (vuln_id, cpe_or_package) pair. Each CVE can have 5–20 CPE matches:

250,000 CVEs × 5–20 CPE matches ≈ 1.25M–5M queries per sync

After: _build_index() bulk-loads the entire fixdates table once into two in-memory dicts keyed by (vuln_id, cpe) and (vuln_id, package, ecosystem). Subsequent get() calls are two dict lookups via _ensure_index(), which uses threading.Lock() with double-checked locking (both index vars checked to prevent partial-init reads under concurrency).

No WHERE provider = ... filter is applied: each Store downloads from a provider-scoped OCI image (ghcr.io/anchore/grype-db-observed-fix-date/{provider}), so the database only ever contains rows for one provider.

2. `NVDOverrides.cve()` — per-CVE file reads eliminated (NVD only)

Before: a filepath index (CVE ID → path) was built once, but the file was opened, read, and JSON-parsed on every call. A # TODO: implement in-memory index comment already flagged this.

After: _build_data_by_cve() globs and parses all CVE-*.json files once on first access via _ensure_loaded(), which also uses threading.Lock() with double-checked locking for safe concurrent access.

Test plan

test_overrides_enabled — cache populated; repeated calls return the same object (no re-parse)
test_get_uses_in_memory_index — _build_index() called exactly once regardless of call count
Full unit suite: uv run pytest tests/unit/ (779 passed)
Linting: uv run ruff check src/ (clean)

…and fix-date lookups Two related hot-loop bottlenecks in the NVD provider, both caused by per-item I/O inside a tight loop over ~250k CVEs. Bottleneck 1: NVDOverrides.cve() — per-CVE file reads cve() maintained a filepath index (CVE ID → path) but opened, read, and JSON-parsed the file on every call. A TODO comment already flagged the problem. Fix: _build_data_by_cve() globs and parses all CVE-*.json files once into a dict on first access. All subsequent cve() calls are O(1) dict lookups with zero I/O. The duplicated lazy-init guard is extracted into _ensure_loaded(). Bottleneck 2: GrypeDBStore.get() — per-CPE SQLite queries get() executed an individual SELECT for every (vuln_id, cpe_or_package) pair. Each CVE can have 5–20 CPE matches: 250,000 CVEs × 5–20 CPE matches ≈ 1.25M–5M SQLite queries per sync At 0.1ms per query that is ~4 minutes of pure SQLite overhead. Fix: _build_index() bulk-loads the entire fixdates table once into two in-memory dicts keyed by (vuln_id, cpe) and (vuln_id, package, ecosystem). get() becomes two dict lookups. The SQLAlchemy connection infrastructure is retained — still required by get_changed_vuln_ids_since(). No provider filter is applied in _build_index(): each Store downloads from a provider-scoped OCI image (ghcr.io/anchore/grype-db-observed-fix-date/{provider}), so the database only ever contains rows for this provider. Signed-off-by: James Gardner <james.gardner@chainguard.dev>

jamestexas · 2026-02-27T20:35:02Z

@willmurphyscode I saw you have been making some concurrency-oriented fixes in this repo, and wanted to bring this to your attention as it's presumably related. Let me know if the approach is incorrect or anything!

jamestexas force-pushed the perf/nvd-in-memory-indexes branch 3 times, most recently from af49589 to 048e3ff Compare February 20, 2026 21:06

jamestexas changed the title ~~perf(nvd): replace per-call I/O with in-memory indexes for overrides and fix-date lookups~~ perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups Feb 21, 2026

jamestexas marked this pull request as ready for review February 21, 2026 20:29

jamestexas force-pushed the perf/nvd-in-memory-indexes branch from 048e3ff to de3bdae Compare February 25, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065

perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065
jamestexas wants to merge 1 commit intoanchore:mainfrom
jamestexas:perf/nvd-in-memory-indexes

jamestexas commented Feb 20, 2026 •

edited

Loading

Uh oh!

jamestexas commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamestexas commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and why

Changes

1. GrypeDBStore.get() — 1–5M SQLite queries eliminated (all providers)

2. NVDOverrides.cve() — per-CVE file reads eliminated (NVD only)

Test plan

Uh oh!

jamestexas commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamestexas commented Feb 20, 2026 •

edited

Loading

1. `GrypeDBStore.get()` — 1–5M SQLite queries eliminated (all providers)

2. `NVDOverrides.cve()` — per-CVE file reads eliminated (NVD only)