Skip to content

perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065

Open
jamestexas wants to merge 1 commit intoanchore:mainfrom
jamestexas:perf/nvd-in-memory-indexes
Open

perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups#1065
jamestexas wants to merge 1 commit intoanchore:mainfrom
jamestexas:perf/nvd-in-memory-indexes

Conversation

@jamestexas
Copy link
Contributor

@jamestexas jamestexas commented Feb 20, 2026

What and why

Two separate code paths were doing per-item I/O inside hot loops, causing several minutes of avoidable overhead per sync.

  • GrypeDBStore.get() affects all providers (every provider calls default_finder which uses GrypeDBStore)
  • NVDOverrides.cve() is NVD-specific

Changes

1. GrypeDBStore.get() — 1–5M SQLite queries eliminated (all providers)

Before: one SELECT per (vuln_id, cpe_or_package) pair. Each CVE can have 5–20 CPE matches:

250,000 CVEs × 5–20 CPE matches ≈ 1.25M–5M queries per sync

After: _build_index() bulk-loads the entire fixdates table once into two in-memory dicts keyed by (vuln_id, cpe) and (vuln_id, package, ecosystem). Subsequent get() calls are two dict lookups via _ensure_index(), which uses threading.Lock() with double-checked locking (both index vars checked to prevent partial-init reads under concurrency).

No WHERE provider = ... filter is applied: each Store downloads from a provider-scoped OCI image (ghcr.io/anchore/grype-db-observed-fix-date/{provider}), so the database only ever contains rows for one provider.

2. NVDOverrides.cve() — per-CVE file reads eliminated (NVD only)

Before: a filepath index (CVE ID → path) was built once, but the file was opened, read, and JSON-parsed on every call. A # TODO: implement in-memory index comment already flagged this.

After: _build_data_by_cve() globs and parses all CVE-*.json files once on first access via _ensure_loaded(), which also uses threading.Lock() with double-checked locking for safe concurrent access.


Test plan

  • test_overrides_enabled — cache populated; repeated calls return the same object (no re-parse)
  • test_get_uses_in_memory_index_build_index() called exactly once regardless of call count
  • Full unit suite: uv run pytest tests/unit/ (779 passed)
  • Linting: uv run ruff check src/ (clean)

@jamestexas jamestexas force-pushed the perf/nvd-in-memory-indexes branch 3 times, most recently from af49589 to 048e3ff Compare February 20, 2026 21:06
@jamestexas jamestexas changed the title perf(nvd): replace per-call I/O with in-memory indexes for overrides and fix-date lookups perf: replace per-call I/O with in-memory indexes for overrides and fix-date lookups Feb 21, 2026
@jamestexas jamestexas marked this pull request as ready for review February 21, 2026 20:29
…and fix-date lookups

Two related hot-loop bottlenecks in the NVD provider, both caused by
per-item I/O inside a tight loop over ~250k CVEs.

Bottleneck 1: NVDOverrides.cve() — per-CVE file reads

cve() maintained a filepath index (CVE ID → path) but opened, read, and
JSON-parsed the file on every call. A TODO comment already flagged the
problem.

Fix: _build_data_by_cve() globs and parses all CVE-*.json files once
into a dict on first access. All subsequent cve() calls are O(1) dict
lookups with zero I/O. The duplicated lazy-init guard is extracted into
_ensure_loaded().

Bottleneck 2: GrypeDBStore.get() — per-CPE SQLite queries

get() executed an individual SELECT for every (vuln_id, cpe_or_package)
pair. Each CVE can have 5–20 CPE matches:

  250,000 CVEs × 5–20 CPE matches ≈ 1.25M–5M SQLite queries per sync

At 0.1ms per query that is ~4 minutes of pure SQLite overhead.

Fix: _build_index() bulk-loads the entire fixdates table once into two
in-memory dicts keyed by (vuln_id, cpe) and (vuln_id, package,
ecosystem). get() becomes two dict lookups. The SQLAlchemy connection
infrastructure is retained — still required by get_changed_vuln_ids_since().

No provider filter is applied in _build_index(): each Store downloads
from a provider-scoped OCI image
(ghcr.io/anchore/grype-db-observed-fix-date/{provider}), so the
database only ever contains rows for this provider.

Signed-off-by: James Gardner <james.gardner@chainguard.dev>
@jamestexas jamestexas force-pushed the perf/nvd-in-memory-indexes branch from 048e3ff to de3bdae Compare February 25, 2026 19:07
@jamestexas
Copy link
Contributor Author

@willmurphyscode I saw you have been making some concurrency-oriented fixes in this repo, and wanted to bring this to your attention as it's presumably related. Let me know if the approach is incorrect or anything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant