fix(migration): add missing FK indexes for query performance#2276
fix(migration): add missing FK indexes for query performance#2276imurphy-rh wants to merge 1 commit intoguacsec:mainfrom
Conversation
Several foreign key columns lack indexes, causing sequential scans on cold tables. The most impactful gap is product_version.sbom_id, which is joined on every SBOM graph load in the analysis service (the get_nodes() query in modules/analysis/src/service/load/mod.rs:188). Indexes added: - product_version(sbom_id) — eliminates seq scan on every graph load - product_version(product_id) — completes the product JOIN chain - package_relates_to_package(sbom_id, relationship) — the existing PK (sbom_id, left_node_id, relationship, right_node_id) defeats queries filtering on sbom_id + relationship because left_node_id sits between them in the composite key - purl_status(version_range_id) — used in vulnerability analysis JOINs - cpe(vendor, product, version) — used in generalized CPE tuple lookup within product_advisory_info_sql() - advisory(issuer_id) — used in advisory listing LEFT JOIN to organization Follows the same pattern as m0001200_source_document_fk_indexes and m0001110_sbom_node_checksum_indexes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reviewer's GuideAdds a new migration m0002100 to create six performance-focused indexes on frequently joined foreign key and composite key columns, and wires the migration into the global migrator so it runs in sequence with prior migrations. ER diagram for new FK index coverage on core tableserDiagram
Sbom {
bigint id
}
Product {
bigint id
}
ProductVersion {
bigint sbom_id
bigint product_id
}
PackageRelatesToPackage {
bigint sbom_id
int relationship
}
VersionRange {
bigint id
}
PurlStatus {
bigint version_range_id
}
Cpe {
text vendor
text product
text version
}
Organization {
bigint id
}
Advisory {
bigint issuer_id
}
ProductVersion }o--|| Sbom : sbom_id_fk
ProductVersion }o--|| Product : product_id_fk
PackageRelatesToPackage }o--|| Sbom : sbom_id_fk
PurlStatus }o--|| VersionRange : version_range_id_fk
Advisory }o--|| Organization : issuer_id_fk
Class diagram for migration m0002100_perf_fk_indexesclassDiagram
class Migration {
+up(manager SchemaManager) Result
+down(manager SchemaManager) Result
}
class SchemaManager
class DbErr
class Index
class Indexes {
<<enumeration>>
ProductVersionSbomIdIdx
ProductVersionProductIdIdx
PackageRelatesToPackageSbomRelIdx
PurlStatusVersionRangeIdIdx
CpeVendorProductVersionIdx
AdvisoryIssuerIdIdx
}
class ProductVersion {
<<enumeration>>
Table
SbomId
ProductId
}
class PackageRelatesToPackage {
<<enumeration>>
Table
SbomId
Relationship
}
class PurlStatus {
<<enumeration>>
Table
VersionRangeId
}
class Cpe {
<<enumeration>>
Table
Vendor
Product
Version
}
class Advisory {
<<enumeration>>
Table
IssuerId
}
Migration ..> SchemaManager : uses
Migration ..> DbErr : returns
Migration ..> Index : creates_drops
Migration ..> Indexes : names_indexes
Migration ..> ProductVersion : refs_columns
Migration ..> PackageRelatesToPackage : refs_columns
Migration ..> PurlStatus : refs_columns
Migration ..> Cpe : refs_columns
Migration ..> Advisory : refs_columns
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Using
Indexes::...Idx.to_string()will produce CamelCase index names based on the enum variant; if your schema conventions expect snake_case names, consider implementingDisplayforIndexesor usingwith_namehelpers to control the actual index name strings. - For the larger tables (e.g.
product_version,package_relates_to_package), consider whether you needCREATE INDEX CONCURRENTLYsemantics to avoid long exclusive locks during migration, and if so whether the migration framework supports that pattern.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Using `Indexes::...Idx.to_string()` will produce CamelCase index names based on the enum variant; if your schema conventions expect snake_case names, consider implementing `Display` for `Indexes` or using `with_name` helpers to control the actual index name strings.
- For the larger tables (e.g. `product_version`, `package_relates_to_package`), consider whether you need `CREATE INDEX CONCURRENTLY` semantics to avoid long exclusive locks during migration, and if so whether the migration framework supports that pattern.
## Individual Comments
### Comment 1
<location path="migration/src/m0002100_perf_fk_indexes.rs" line_range="15-23" />
<code_context>
+ // Without this index, every SBOM graph load triggers a sequential scan
+ // of the entire product_version table.
+ manager
+ .create_index(
+ Index::create()
+ .if_not_exists()
+ .table(ProductVersion::Table)
+ .name(Indexes::ProductVersionSbomIdIdx.to_string())
+ .col(ProductVersion::SbomId)
+ .to_owned(),
+ )
+ .await?;
+
+ // product_version.product_id — used in analysis graph loading:
</code_context>
<issue_to_address>
**issue (performance):** Consider the impact of non-concurrent index creation on large, hot tables.
Plain `CREATE INDEX` will take an exclusive lock on the table for the duration of the build, which can be disruptive on large, hot tables like `product_version`, `package_relates_to_package`, or `cpe`. If this runs on a live system, consider using `CREATE INDEX CONCURRENTLY` (via raw SQL or a special migration) or scheduling the migration during a maintenance window to avoid impacting production traffic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| .create_index( | ||
| Index::create() | ||
| .if_not_exists() | ||
| .table(ProductVersion::Table) | ||
| .name(Indexes::ProductVersionSbomIdIdx.to_string()) | ||
| .col(ProductVersion::SbomId) | ||
| .to_owned(), | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
issue (performance): Consider the impact of non-concurrent index creation on large, hot tables.
Plain CREATE INDEX will take an exclusive lock on the table for the duration of the build, which can be disruptive on large, hot tables like product_version, package_relates_to_package, or cpe. If this runs on a live system, consider using CREATE INDEX CONCURRENTLY (via raw SQL or a special migration) or scheduling the migration during a maintenance window to avoid impacting production traffic.
|
Re: the Sourcery suggestion about Good point in general, but it doesn't apply here for a few reasons:
If the project wants to adopt concurrent index creation as a pattern, that would be a broader conversation about migration infrastructure, not specific to this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2276 +/- ##
==========================================
+ Coverage 68.03% 68.05% +0.01%
==========================================
Files 423 424 +1
Lines 24828 24833 +5
Branches 24828 24833 +5
==========================================
+ Hits 16892 16900 +8
+ Misses 7017 7007 -10
- Partials 919 926 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
cool - though as always with data at scale and indexes there are caveats: query plans (which choose indexes) are not static across all scales of data (or in isolation to each other)- more info needed eg. Were u able to test these indexes with representative data load and activity ? An improvement at one scale of data might change dramatically at another scale (as query plan decides to do something else). And its also runtime eg. query planner chooses based on available resources ... and might come up with a different choice if 1000 concurrent users are actively spawning queries (versus 1) ... most db query planners are more 'magical black boxes' then llms ;) There are other subtleties - for example - many of trustify queries take advantage of pg parallel workers ... if any of these new indexes use them it may tie up that resource robbing other critical queries of them (pro/cons balance needs to be made). A good way to prove any new index actually improves things is to setup env with right scale data load and runtime activity (ingesting data, etc) and most importantly provide EXPLAIN ANALYZE on any query this improves - to view query plan selected which shows index being used (where it was not before) and clear performance improvement. Another useful thing we do is look at long running env and check index efficiency - with something like: AI analysis is useful for identifying opportunities but often does not have the data or runtime load context to make useful decision... more like a good 'scout' for possible optimisation opportunities. Will let the team chime in on their thoughts. |
Summary
GET /api/v2/analysis/component/{key})Problem
Several FK columns lack indexes. When tables are cold (not in PostgreSQL's buffer cache), queries that JOIN on these columns trigger full sequential scans. The analysis service's
get_nodes()query (modules/analysis/src/service/load/mod.rs:138) is the worst offender — it runs for every SBOM graph load and includes:Indexes Added
product_version_sbom_id_idxproduct_version(sbom_id)get_nodes()callproduct_version_product_id_idxproduct_version(product_id)package_relates_to_package_sbom_rel_idx(sbom_id, relationship)product_advisory_info_sql()purl_status_version_range_id_idxpurl_status(version_range_id)cpe_vendor_product_version_idxcpe(vendor, product, version)advisory_issuer_id_idxadvisory(issuer_id)Why
package_relates_to_packageneeds a separate indexThe PK is
(sbom_id, left_node_id, relationship, right_node_id). Queries filterWHERE sbom_id = $1 AND relationship = 13, butleft_node_idsits betweensbom_idandrelationshipin the composite key, so PostgreSQL can only use the leadingsbom_idcolumn and must scan allleft_node_idvalues to find matching relationships.Test plan
cargo run -p trustify-migration -- up)cargo run -p trustify-migration -- down)EXPLAIN ANALYZEon theget_nodes()query shows Index Scan onproduct_versioninstead of Seq Scancargo test --all-features)🤖 Generated with Claude Code
Summary by Sourcery
Enhancements: