Skip to content

fix(migration): add missing FK indexes for query performance#2276

Open
imurphy-rh wants to merge 1 commit intoguacsec:mainfrom
imurphy-rh:fix/perf-fk-indexes
Open

fix(migration): add missing FK indexes for query performance#2276
imurphy-rh wants to merge 1 commit intoguacsec:mainfrom
imurphy-rh:fix/perf-fk-indexes

Conversation

@imurphy-rh
Copy link
Copy Markdown

@imurphy-rh imurphy-rh commented Mar 6, 2026

Summary

  • Adds 6 missing indexes on foreign key columns that cause sequential scans on cold tables
  • Most impactful for the analysis service's SBOM graph loading path (GET /api/v2/analysis/component/{key})
  • Follows the same pattern as m0001200 and m0001110

Problem

Several FK columns lack indexes. When tables are cold (not in PostgreSQL's buffer cache), queries that JOIN on these columns trigger full sequential scans. The analysis service's get_nodes() query (modules/analysis/src/service/load/mod.rs:138) is the worst offender — it runs for every SBOM graph load and includes:

LEFT JOIN product_version ON sbom.sbom_id = product_version.sbom_id  -- no index
LEFT JOIN product ON product_version.product_id = product.id          -- no index

Indexes Added

Index Column(s) Query Path
product_version_sbom_id_idx product_version(sbom_id) Analysis graph load — every get_nodes() call
product_version_product_id_idx product_version(product_id) Analysis graph load — product JOIN
package_relates_to_package_sbom_rel_idx (sbom_id, relationship) CPE context filter in product_advisory_info_sql()
purl_status_version_range_id_idx purl_status(version_range_id) Vulnerability analysis JOINs
cpe_vendor_product_version_idx cpe(vendor, product, version) Generalized CPE tuple lookup
advisory_issuer_id_idx advisory(issuer_id) Advisory listing with organization JOIN

Why package_relates_to_package needs a separate index

The PK is (sbom_id, left_node_id, relationship, right_node_id). Queries filter WHERE sbom_id = $1 AND relationship = 13, but left_node_id sits between sbom_id and relationship in the composite key, so PostgreSQL can only use the leading sbom_id column and must scan all left_node_id values to find matching relationships.

Test plan

  • Migration applies cleanly (cargo run -p trustify-migration -- up)
  • Migration rolls back cleanly (cargo run -p trustify-migration -- down)
  • EXPLAIN ANALYZE on the get_nodes() query shows Index Scan on product_version instead of Seq Scan
  • Existing tests pass (cargo test --all-features)

🤖 Generated with Claude Code

Summary by Sourcery

Enhancements:

  • Introduce migration m0002100 to add targeted indexes on frequently joined foreign key columns and CPE tuple fields to reduce sequential scans and speed up analysis-related queries.

Several foreign key columns lack indexes, causing sequential scans on
cold tables. The most impactful gap is product_version.sbom_id, which
is joined on every SBOM graph load in the analysis service (the
get_nodes() query in modules/analysis/src/service/load/mod.rs:188).

Indexes added:
- product_version(sbom_id) — eliminates seq scan on every graph load
- product_version(product_id) — completes the product JOIN chain
- package_relates_to_package(sbom_id, relationship) — the existing PK
  (sbom_id, left_node_id, relationship, right_node_id) defeats queries
  filtering on sbom_id + relationship because left_node_id sits between
  them in the composite key
- purl_status(version_range_id) — used in vulnerability analysis JOINs
- cpe(vendor, product, version) — used in generalized CPE tuple lookup
  within product_advisory_info_sql()
- advisory(issuer_id) — used in advisory listing LEFT JOIN to
  organization

Follows the same pattern as m0001200_source_document_fk_indexes and
m0001110_sbom_node_checksum_indexes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 6, 2026

Reviewer's Guide

Adds a new migration m0002100 to create six performance-focused indexes on frequently joined foreign key and composite key columns, and wires the migration into the global migrator so it runs in sequence with prior migrations.

ER diagram for new FK index coverage on core tables

erDiagram
    Sbom {
        bigint id
    }

    Product {
        bigint id
    }

    ProductVersion {
        bigint sbom_id
        bigint product_id
    }

    PackageRelatesToPackage {
        bigint sbom_id
        int relationship
    }

    VersionRange {
        bigint id
    }

    PurlStatus {
        bigint version_range_id
    }

    Cpe {
        text vendor
        text product
        text version
    }

    Organization {
        bigint id
    }

    Advisory {
        bigint issuer_id
    }

    ProductVersion }o--|| Sbom : sbom_id_fk
    ProductVersion }o--|| Product : product_id_fk

    PackageRelatesToPackage }o--|| Sbom : sbom_id_fk

    PurlStatus }o--|| VersionRange : version_range_id_fk

    Advisory }o--|| Organization : issuer_id_fk
Loading

Class diagram for migration m0002100_perf_fk_indexes

classDiagram
    class Migration {
        +up(manager SchemaManager) Result
        +down(manager SchemaManager) Result
    }

    class SchemaManager
    class DbErr
    class Index

    class Indexes {
        <<enumeration>>
        ProductVersionSbomIdIdx
        ProductVersionProductIdIdx
        PackageRelatesToPackageSbomRelIdx
        PurlStatusVersionRangeIdIdx
        CpeVendorProductVersionIdx
        AdvisoryIssuerIdIdx
    }

    class ProductVersion {
        <<enumeration>>
        Table
        SbomId
        ProductId
    }

    class PackageRelatesToPackage {
        <<enumeration>>
        Table
        SbomId
        Relationship
    }

    class PurlStatus {
        <<enumeration>>
        Table
        VersionRangeId
    }

    class Cpe {
        <<enumeration>>
        Table
        Vendor
        Product
        Version
    }

    class Advisory {
        <<enumeration>>
        Table
        IssuerId
    }

    Migration ..> SchemaManager : uses
    Migration ..> DbErr : returns
    Migration ..> Index : creates_drops
    Migration ..> Indexes : names_indexes
    Migration ..> ProductVersion : refs_columns
    Migration ..> PackageRelatesToPackage : refs_columns
    Migration ..> PurlStatus : refs_columns
    Migration ..> Cpe : refs_columns
    Migration ..> Advisory : refs_columns
Loading

File-Level Changes

Change Details Files
Add new migration m0002100_perf_fk_indexes to create performance indexes on key relational tables
  • Introduce Migration struct implementing MigrationTrait with up/down methods for index creation and cleanup
  • Create single-column indexes on product_version.sbom_id and product_version.product_id to support analysis graph joins
  • Create composite index on package_relates_to_package (sbom_id, relationship) to match common WHERE filters that the PK cannot satisfy efficiently
  • Create single-column index on purl_status.version_range_id to speed up joins with version_range
  • Create composite index on cpe (vendor, product, version) for generalized CPE tuple lookups
  • Create single-column index on advisory.issuer_id for joins with organization
  • Add local DeriveIden enums for each affected table and an Indexes enum holding the index identifiers
migration/src/m0002100_perf_fk_indexes.rs
Register the new migration in the migrator pipeline so it is executed with other normal migrations
  • Declare m0002100_perf_fk_indexes module in the migration library
  • Append m0002100_perf_fk_indexes::Migration to the MigratorExt::build_migrations() chain as a normal migration
migration/src/lib.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Using Indexes::...Idx.to_string() will produce CamelCase index names based on the enum variant; if your schema conventions expect snake_case names, consider implementing Display for Indexes or using with_name helpers to control the actual index name strings.
  • For the larger tables (e.g. product_version, package_relates_to_package), consider whether you need CREATE INDEX CONCURRENTLY semantics to avoid long exclusive locks during migration, and if so whether the migration framework supports that pattern.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Using `Indexes::...Idx.to_string()` will produce CamelCase index names based on the enum variant; if your schema conventions expect snake_case names, consider implementing `Display` for `Indexes` or using `with_name` helpers to control the actual index name strings.
- For the larger tables (e.g. `product_version`, `package_relates_to_package`), consider whether you need `CREATE INDEX CONCURRENTLY` semantics to avoid long exclusive locks during migration, and if so whether the migration framework supports that pattern.

## Individual Comments

### Comment 1
<location path="migration/src/m0002100_perf_fk_indexes.rs" line_range="15-23" />
<code_context>
+        // Without this index, every SBOM graph load triggers a sequential scan
+        // of the entire product_version table.
+        manager
+            .create_index(
+                Index::create()
+                    .if_not_exists()
+                    .table(ProductVersion::Table)
+                    .name(Indexes::ProductVersionSbomIdIdx.to_string())
+                    .col(ProductVersion::SbomId)
+                    .to_owned(),
+            )
+            .await?;
+
+        // product_version.product_id — used in analysis graph loading:
</code_context>
<issue_to_address>
**issue (performance):** Consider the impact of non-concurrent index creation on large, hot tables.

Plain `CREATE INDEX` will take an exclusive lock on the table for the duration of the build, which can be disruptive on large, hot tables like `product_version`, `package_relates_to_package`, or `cpe`. If this runs on a live system, consider using `CREATE INDEX CONCURRENTLY` (via raw SQL or a special migration) or scheduling the migration during a maintenance window to avoid impacting production traffic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +15 to +23
.create_index(
Index::create()
.if_not_exists()
.table(ProductVersion::Table)
.name(Indexes::ProductVersionSbomIdIdx.to_string())
.col(ProductVersion::SbomId)
.to_owned(),
)
.await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Consider the impact of non-concurrent index creation on large, hot tables.

Plain CREATE INDEX will take an exclusive lock on the table for the duration of the build, which can be disruptive on large, hot tables like product_version, package_relates_to_package, or cpe. If this runs on a live system, consider using CREATE INDEX CONCURRENTLY (via raw SQL or a special migration) or scheduling the migration during a maintenance window to avoid impacting production traffic.

@imurphy-rh
Copy link
Copy Markdown
Author

Re: the Sourcery suggestion about CREATE INDEX CONCURRENTLY

Good point in general, but it doesn't apply here for a few reasons:

  1. No existing migration in the project uses CONCURRENTLY — all 53 migrations use standard Index::create() or raw SQL. This PR is consistent with the established pattern.
  2. SeaORM migrations run inside transactionsCREATE INDEX CONCURRENTLY cannot run inside a transaction (PostgreSQL docs). We'd need to override is_transactional() to return false, which no migration in this project does.
  3. Trustify migrations run at startup (PM mode) or as an explicit trustify-migration up step — not while the service is actively handling traffic.

If the project wants to adopt concurrent index creation as a pattern, that would be a broader conversation about migration infrastructure, not specific to this PR.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.05%. Comparing base (50b9e82) to head (6678c5a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2276      +/-   ##
==========================================
+ Coverage   68.03%   68.05%   +0.01%     
==========================================
  Files         423      424       +1     
  Lines       24828    24833       +5     
  Branches    24828    24833       +5     
==========================================
+ Hits        16892    16900       +8     
+ Misses       7017     7007      -10     
- Partials      919      926       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JimFuller-RedHat
Copy link
Copy Markdown
Contributor

JimFuller-RedHat commented Mar 7, 2026

cool - though as always with data at scale and indexes there are caveats:

query plans (which choose indexes) are not static across all scales of data (or in isolation to each other)- more info needed eg.

Were u able to test these indexes with representative data load and activity ? An improvement at one scale of data might change dramatically at another scale (as query plan decides to do something else). And its also runtime eg. query planner chooses based on available resources ... and might come up with a different choice if 1000 concurrent users are actively spawning queries (versus 1) ... most db query planners are more 'magical black boxes' then llms ;)

There are other subtleties - for example - many of trustify queries take advantage of pg parallel workers ... if any of these new indexes use them it may tie up that resource robbing other critical queries of them (pro/cons balance needs to be made).

A good way to prove any new index actually improves things is to setup env with right scale data load and runtime activity (ingesting data, etc) and most importantly provide EXPLAIN ANALYZE on any query this improves - to view query plan selected which shows index being used (where it was not before) and clear performance improvement.

Another useful thing we do is look at long running env and check index efficiency - with something like:

SELECT
schemaname,
relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS table_size,
seq_scan AS full_scans,
idx_scan AS index_scans,
round(100.0 * idx_scan / (seq_scan + idx_scan + 1), 1) AS efficiency_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100 
LIMIT 10

AI analysis is useful for identifying opportunities but often does not have the data or runtime load context to make useful decision... more like a good 'scout' for possible optimisation opportunities.

Will let the team chime in on their thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants