[backport]: perf(purl): batch purl_status and related entity loading in recommend_purls#2318
[backport]: perf(purl): batch purl_status and related entity loading in recommend_purls#2318helio-frota wants to merge 23 commits intoguacsec:release/0.4.zfrom
Conversation
…mmend_purls Replace per-PURL sequential self.purls() calls with two bulk database queries: one for base_purl using exact column matches with OR conditions, and one for versioned_purl using IN clause. This eliminates N+1 ILIKE queries on JSONB fields and unnecessary COUNT(*) pagination queries. Version filtering logic (redhat pattern match, semver comparison) is preserved and applied in Rust over the bulk-fetched results. Implements TC-3886 Assisted-by: Claude Code (cherry picked from commit 01bcad8)
- Replace fragile String-with-delimiter composite key with a typed PurlKey struct (Eq + Hash) for type-safe HashMap lookups - Store PurlKey in InputPurl to avoid recomputing it at each phase - Fix fallback package_str to preserve Option<version> instead of unwrap_or_default which would emit an empty version segment - Remove unused Winner::base field and #[allow(dead_code)] - Trim noisy phase comments Implements TC-3886 Assisted-by: Claude Code (cherry picked from commit 1bb68ed)
Add tests for uncovered branches: - unknown PURL not in database (empty base_purls path) - PURL without namespace (null namespace condition) - invalid version string (unparseable semver, early return) - mixed input combining valid, unknown, and no-version PURLs Implements TC-3886 Assisted-by: Claude Code (cherry picked from commit 72ab045)
Split the 200-line recommend_purls into focused private methods following project conventions (vulnerability, sbom services): - PurlKey: module-level struct with from_purl, from_base_purl, as_condition constructors - InputPurl::try_from_purl: parse and validate input PURLs - fetch_base_purls: bulk base_purl query with deduplication - fetch_versioned_purls_by_base: bulk versioned_purl fetch grouped by base_purl_id - find_highest_redhat_patch: version filtering and comparison - build_recommend_entry: vulnerability detail assembly recommend_purls is now ~50 lines of orchestration. Implements TC-3886 Assisted-by: Claude Code (cherry picked from commit 3c40c37)
Store a Purl in InputPurl instead of separate purl_string and PurlKey fields, reducing the number of concepts to track when reading the code. PurlKey is now derived on-the-fly where needed for map lookups. Addresses review feedback from @Strum355. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit eee5c10)
…_purls Replace per-winner VersionedPurlDetails::from_entity calls with a single batched purl_status query using IN clause + version_matches() filter, and SeaORM LoaderTrait for related entities (vulnerability, advisory, status, remediation). This eliminates ~548 individual purl_status queries per 100-PURL request, reducing them to 1-3 batch queries. Cumulative with the bulk PURL lookup (TC-3886), this yields a 5.1x total improvement (1.195s to 0.232s for 100 PURLs) and reduces peak RSS from 106 MB to 71 MB. Implements TC-3887 Assisted-by: Claude Code (cherry picked from commit 4a34e18)
Split the large recommend_purls method into focused helper functions: - fetch_vulnerability_statuses: batched purl_status query with LoaderTrait - fetch_qualified_purl_map: batch qualified PURL lookup - assemble_recommend_entry: builds a single RecommendEntry from batched data - Move StatusInfo and Winner structs to module level Also fixes version_matches filter to only check winning versioned_purls and enhances test assertions per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 6df8381)
- get_recommendations_fallback_package_str: exercises the unwrap_or_else path when no qualified_purl exists (versioned_purl without qualifiers) - get_recommendations_fixed_status: covers the VexStatus::Fixed match arm by overriding status slug to "fixed" Implements TC-3887 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 9d11dff)
Replace length-only assertions with full value equality assertions in get_recommendations_unknown_purl, get_recommendations_no_namespace, get_recommendations_invalid_version, and get_recommendations_mixed tests for better debugging when tests fail. Implements TC-3970 Assisted-by: Claude Code (cherry picked from commit 815c6b7)
Add one-line documentation comments to PurlKey, InputPurl, StatusInfo, Winner structs and all helper methods in the batched recommend_purls pipeline for improved reviewability. Implements TC-3971 Assisted-by: Claude Code (cherry picked from commit eff1b2f)
…ommend_purls Replace explicit type annotations (Vec<T>, HashMap<K, V>) with inferred types (Vec<_>, HashMap<_, _>) throughout recommend_purls and its helper methods where the types are obvious from the iterator/collect context. Implements TC-3972 Assisted-by: Claude Code (cherry picked from commit 6491cf6)
Use HashMap::with_capacity(purls.len()) instead of HashMap::new() to avoid rehashing as the map grows, since output size equals input size. Implements TC-3973 Assisted-by: Claude Code (cherry picked from commit 8f74b78)
Implements TC-3973 Assisted-by: Claude Code (cherry picked from commit c721e59)
Address @ctron's review comments from PR guacsec#2310: - Remove unused `advisory_models` and `_organizations` variables and the `organization` import that were computed but never referenced - Consume `winners` Vec by value instead of borrowing, avoiding an unnecessary `.clone()` on `purl_string` - Use `.zip()` for parallel iteration instead of `enumerate()` + index access, matching the established codebase pattern and avoiding panics on size mismatches - Add `#[instrument]` tracing to all four DB helper methods (`fetch_base_purls`, `fetch_versioned_purls_by_base`, `fetch_vulnerability_statuses`, `fetch_qualified_purl_map`) - Chunk all IN-clause and OR-chain queries with a 10,000-item batch size to stay within Postgres's 65,535 bind parameter limit - When the same vulnerability appears from multiple advisories with different statuses, keep the status from the most recent advisory (by `modified` or `published` date) instead of an arbitrary first match, so that newer "fixed" assessments take precedence over older "affected" ones - Document the qualified PURL dedup rationale: only one qualified PURL is kept per versioned PURL because vulnerability statuses are tracked at the base PURL level - Extract a `recommend()` test helper to reduce boilerplate across all 10 recommendation endpoint tests Implements TC-3887 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 197d15d)
Merge get_recommendations_unknown_purl and get_recommendations_invalid_version into a single parameterized get_recommendations_no_match test using rstest, reducing duplication while preserving coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit a484018)
…ndations Always return the versioned PURL (without qualifiers) as the recommended package string. Qualifiers like arch, repository_url, and type are context-dependent — the system cannot know which match the caller's environment, so including an arbitrary one is misleading. Remove fetch_qualified_purl_map and the extra DB query it performed, simplifying the recommend flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 54d3190)
Replace manual QUERY_BATCH_SIZE constant and .chunks() calls with the existing trustify_common::db::chunk::chunked_with utility, which dynamically calculates chunk sizes based on PostgreSQL's bind parameter limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit ee80f85)
…endation tests Apply test documentation standards (§5.9-5.10): every test gets a doc comment explaining what it verifies, and non-trivial tests get given-when-then inline comments to make their structure navigable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 2ca4c38)
Remove explicit type annotations where the Rust compiler can infer the type: Vec<Winner> on winners (push provides type), Vec<_> on SeaORM .all() result, HashMap key type on statuses_by_base, and vulnerabilities Vec<_> (inlined into RecommendEntry struct literal where the field type provides inference). Implements TC-4013 Assisted-by: Claude Code (cherry picked from commit 1b5a0d1)
…clones Replace .iter() with .into_iter() where source collections are no longer needed, eliminating .clone() calls on vuln IDs, remediations, and status slugs. Reorder base_purl_map construction to consume base_purls after fetch_versioned_purls_by_base. Simplify status_slug_map to build directly from status_models. Implements TC-4014 Assisted-by: Claude Code (cherry picked from commit cf3e654)
Wrap fetch_base_purls, fetch_versioned_purls_by_base, and fetch_vulnerability_statuses calls with info_span!() for profiling visibility into recommend_purls pipeline stages. Implements TC-4016 Assisted-by: Claude Code (cherry picked from commit 56667b6)
Change PurlKey fields from owned String to borrowed &str references, eliminating unnecessary string cloning for map lookups. The struct is now Copy since all fields are references. Implements TC-4017 --trailer="Assisted-by: Claude Code" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 2ec5368)
Reviewer's GuideRefactors recommend_purls to batch-load base/versioned PURLs and vulnerability status data for performance, adds supporting helper types and queries, and expands tests to cover deduplication, edge cases, and new response semantics (unqualified recommended PURLs and custom statuses). Sequence diagram for batched recommend_purls processingsequenceDiagram
actor Client
participant PurlService
participant DB as Database
Client->>PurlService: recommend_purls(purls, connection)
PurlService->>PurlService: build InputPurl list via InputPurl.try_from_purl
PurlService-->>Client: empty map when no InputPurl (alt)
PurlService->>DB: fetch_base_purls(input_purls)
DB-->>PurlService: base_purl models
PurlService-->>Client: map with empty recommendations when no base_purls (alt)
PurlService->>DB: fetch_versioned_purls_by_base(base_purls)
DB-->>PurlService: versioned_purls grouped by base_purl_id
PurlService->>PurlService: for each InputPurl
PurlService->>PurlService: PurlKey.from_purl / lookup base
PurlService->>PurlService: find_highest_redhat_patch(pattern, input_version, versioned_by_base[base])
PurlService->>PurlService: build Winner list
PurlService-->>Client: map with empty recommendations when no winners (alt)
PurlService->>DB: fetch_vulnerability_statuses(winner_base_ids, winner_vp_ids)
DB-->>PurlService: statuses_by_base (StatusInfo grouped by base_purl_id)
PurlService->>PurlService: for each Winner
PurlService->>PurlService: assemble_recommend_entry(winner, statuses_by_base)
PurlService->>PurlService: insert entry into recommendations map
PurlService-->>Client: recommendations map
ER diagram for PURL recommendation and vulnerability status entitieserDiagram
base_purl {
Uuid id
string type
string name
string namespace
}
versioned_purl {
Uuid id
Uuid base_purl_id
string version
}
purl_status {
Uuid id
Uuid base_purl_id
Uuid vulnerability_id
Uuid advisory_id
Uuid status_id
Uuid version_range_id
}
vulnerability {
Uuid id
}
advisory {
Uuid id
datetime modified
datetime published
}
status {
Uuid id
string slug
}
version_range {
Uuid id
string low_version
boolean low_inclusive
string high_version
boolean high_inclusive
}
remediation {
Uuid id
}
remediation_purl_status {
Uuid remediation_id
Uuid purl_status_id
}
base_purl ||--o{ versioned_purl : has
base_purl ||--o{ purl_status : has
purl_status }o--|| vulnerability : references
purl_status }o--|| advisory : references
purl_status }o--|| status : references
purl_status }o--|| version_range : constrains
purl_status ||--o{ remediation_purl_status : join
remediation ||--o{ remediation_purl_status : join
versioned_purl ||..o{ purl_status : matched_by_version_range
Class diagram for batched recommend_purls helpers and flowclassDiagram
class PurlService {
+recommend_purls(purls, connection) Result
+fetch_vulnerability_statuses(winner_base_ids, winner_vp_ids, connection) Result
+assemble_recommend_entry(winner, statuses_by_base) RecommendEntry
+fetch_base_purls(input_purls, connection) Result
+fetch_versioned_purls_by_base(base_purls, connection) Result
+find_highest_redhat_patch(pattern, input_version, versioned_purls) versioned_purl_Model
}
class PurlKey {
+&str ty
+Option~&str~ namespace
+&str name
+from_purl(purl) PurlKey
+from_base_purl(bp) PurlKey
+as_condition() Condition
}
class StatusInfo {
+String vuln_id
+String status_slug
+Vec~remediation_Model~ remediations
+Option~OffsetDateTime~ advisory_date
}
class Winner {
+String purl_string
+versioned_purl_Model versioned_purl
+base_purl_Model base
}
class InputPurl {
+Purl purl
+semver_Version input_version
+try_from_purl(purl) Option~InputPurl~
}
class RecommendEntry {
+String package
+Vec~VulnerabilityStatus~ vulnerabilities
}
class VulnerabilityStatus {
+String id
+Option~VexStatus~ status
+Option~String~ justification
+RemediationSummary remediations
}
class RemediationSummary {
+from_entities(remediations) RemediationSummary
}
class Purl {
+String ty
+Option~String~ namespace
+String name
+Option~String~ version
+Qualifiers qualifiers
+to_string() String
}
class base_purl_Model {
+Uuid id
+String type
+Option~String~ namespace
+String name
}
class versioned_purl_Model {
+Uuid id
+Uuid base_purl_id
+String version
}
class remediation_Model {
+Uuid id
}
class Condition
class Qualifiers
class VexStatus
class Uuid
class OffsetDateTime
class semver_Version
PurlService --> PurlKey : uses
PurlService --> StatusInfo : aggregates
PurlService --> Winner : aggregates
PurlService --> InputPurl : uses
PurlService --> RecommendEntry : returns
PurlService --> VulnerabilityStatus : builds
PurlService --> RemediationSummary : uses
PurlKey --> Purl : builds_from
PurlKey --> base_purl_Model : builds_from
Winner --> base_purl_Model : references
Winner --> versioned_purl_Model : references
StatusInfo --> remediation_Model : contains
InputPurl --> Purl : wraps
InputPurl --> semver_Version : parses
VulnerabilityStatus --> VexStatus : contains
VulnerabilityStatus --> RemediationSummary : contains
RecommendEntry --> VulnerabilityStatus : contains
PurlService --> base_purl_Model : queries
PurlService --> versioned_purl_Model : queries
PurlService --> remediation_Model : queries
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The hard-coded chunk sizes passed to
chunked_with(e.g.,1and3infetch_vulnerability_statuses,fetch_base_purls, andfetch_versioned_purls_by_base) are quite small and undocumented; consider extracting them into named constants with a brief comment describing how they relate to Postgres parameter limits so the batching behavior is easier to tune and understand. - In
fetch_vulnerability_statuses, the nested zipping ofvulns,advisories_loaded,all_statuses, andremediationsis fairly dense; consider restructuring this (e.g., with an explicit struct or destructuring in a singleforloop after azipchain) to make the relationships between these collections clearer and less error-prone to maintain. find_highest_redhat_patchcurrently takesOption<&Vec<versioned_purl::Model>>; switching this toOption<&[versioned_purl::Model]>(and storing slices in the map) would reduce coupling to the concrete container type and make the function slightly more flexible.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The hard-coded chunk sizes passed to `chunked_with` (e.g., `1` and `3` in `fetch_vulnerability_statuses`, `fetch_base_purls`, and `fetch_versioned_purls_by_base`) are quite small and undocumented; consider extracting them into named constants with a brief comment describing how they relate to Postgres parameter limits so the batching behavior is easier to tune and understand.
- In `fetch_vulnerability_statuses`, the nested zipping of `vulns`, `advisories_loaded`, `all_statuses`, and `remediations` is fairly dense; consider restructuring this (e.g., with an explicit struct or destructuring in a single `for` loop after a `zip` chain) to make the relationships between these collections clearer and less error-prone to maintain.
- `find_highest_redhat_patch` currently takes `Option<&Vec<versioned_purl::Model>>`; switching this to `Option<&[versioned_purl::Model]>` (and storing slices in the map) would reduce coupling to the concrete container type and make the function slightly more flexible.
## Individual Comments
### Comment 1
<location path="modules/fundamental/src/purl/service/mod.rs" line_range="586-587" />
<code_context>
+
+ // Assemble recommendations from batched data
+ for winner in winners {
+ let entry = Self::assemble_recommend_entry(&winner, &statuses_by_base);
+ recommendations.insert(winner.purl_string, vec![entry]);
+ }
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Status selection is keyed only by base PURL, so different winning versions for the same base will share the same statuses.
`assemble_recommend_entry` looks up statuses only by `winner.base.id` via `statuses_by_base.get(&winner.base.id)`, without considering the specific `versioned_purl`. If multiple input PURLs share the same base but differ in version, they’ll all receive the same vulnerability statuses, ignoring version-specific status ranges. Consider keying by `(base_purl_id, versioned_purl_id)` or otherwise filtering by the winning version so version-range semantics are preserved per winner.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| let entry = Self::assemble_recommend_entry(&winner, &statuses_by_base); | ||
| recommendations.insert(winner.purl_string, vec![entry]); |
There was a problem hiding this comment.
issue (bug_risk): Status selection is keyed only by base PURL, so different winning versions for the same base will share the same statuses.
assemble_recommend_entry looks up statuses only by winner.base.id via statuses_by_base.get(&winner.base.id), without considering the specific versioned_purl. If multiple input PURLs share the same base but differ in version, they’ll all receive the same vulnerability statuses, ignoring version-specific status ranges. Consider keying by (base_purl_id, versioned_purl_id) or otherwise filtering by the winning version so version-range semantics are preserved per winner.
- Remove redundant .instrument() wrappers on self-calls that already have #[instrument] attributes - Use skip_all and err(level=INFO) per log_tracing.md conventions - Accept impl IntoIterator instead of &[Uuid] to avoid intermediate allocations - Add instrumentation spans to SeaORM load_one/load_many_to_many calls - Use HashMap<_, _> type inference for base_purl_map Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 8533cde)
2649c43 to
eb88de8
Compare
Summary by Sourcery
Optimize and harden PURL recommendation generation by batching database lookups, simplifying the recommended package representation, and expanding test coverage for edge cases and status handling.
Bug Fixes:
Enhancements:
Tests: