Skip to content

Fix double-counting in parent category crate counts#13051

Open
Aakash-Naidu wants to merge 1 commit intorust-lang:mainfrom
Aakash-Naidu:fix/issue-13032-category-counts
Open

Fix double-counting in parent category crate counts#13051
Aakash-Naidu wants to merge 1 commit intorust-lang:mainfrom
Aakash-Naidu:fix/issue-13032-category-counts

Conversation

@Aakash-Naidu
Copy link
Contributor

This PR resolves issue #13032, where parent categories incorrectly inflated crate counts when a crate belonged to multiple subcategories under the same parent.

The root cause was the use of SUM(c2.crates_cnt) on cached subcategory counters, which caused duplicate counting when the same crate appeared in multiple subcategories.


Implementation Details

The aggregation logic has been updated to count unique crate IDs directly from the crates_categories relationship table.

SQL Changes

The following queries were updated to use:

COUNT(DISTINCT cc.crate_id)

with a join to crates_categories:

toplevel.sql — fixes counts on the main Categories landing page.

subcategories.sql — ensures accurate counts for child categories.

parent_categories.sql — fixes counts in parent breadcrumb navigation.

This ensures each crate contributes at most once to a parent category count.


Performance Considerations

Since the updated logic introduces a join with crates_categories, an index was added to maintain query performance:

Migration: 2026-02-24-171800_add_category_id_index_to_crates_categories

Adds a B-Tree index on crates_categories(category_id)

This allows PostgreSQL to use an index scan instead of a sequential scan when resolving category joins.


Testing & Verification
Reproduction Test

Added category_toplevel_unique_count_reproduction, which:

Creates a crate assigned to multiple subcategories

Verifies the parent category counts it only once

This test fails under the previous implementation and passes with the updated logic.


Regression Updates

Existing tests in models/category.rs were updated to use real crate–category relationships instead of relying on manually set crates_cnt values.

Status

All database tests in crates_io_database pass successfully.

@Aakash-Naidu Aakash-Naidu force-pushed the fix/issue-13032-category-counts branch from 5e669e7 to c766f10 Compare February 25, 2026 18:20
@Turbo87
Copy link
Member

Turbo87 commented Feb 27, 2026

Since the updated logic introduces a join with crates_categories, an index was added to maintain query performance

did you measure the performance impact of this PR on all of the affected queries with a production dataset? I'd be somewhat surprised if this didn't impact performance at all 😅

@eth3lbert any thoughts on this PR?

@Turbo87 Turbo87 requested a review from eth3lbert February 27, 2026 09:49
@eth3lbert
Copy link
Contributor

eth3lbert commented Feb 28, 2026

Since the updated logic introduces a join with crates_categories, an index was added to maintain query performance

did you measure the performance impact of this PR on all of the affected queries with a production dataset? I'd be somewhat surprised if this didn't impact performance at all 😅

@eth3lbert any thoughts on this PR?

I've tested it locally and it seems to have only a small impact! The following shows the most impactful EXPLAIN ANALYZE result.

old
Sort  (cost=84.36..84.37 rows=5 width=144) (actual time=0.031..0.032 rows=1 loops=1)
  Sort Key: c.path
  Sort Method: quicksort  Memory: 25kB
  InitPlan 2
    ->  Index Scan using categories_slug_key on categories  (cost=0.15..8.17 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
          Index Cond: ((slug)::text = 'science::computational-chemistry'::text)
  ->  Bitmap Heap Scan on categories c  (cost=4.18..76.14 rows=5 width=144) (actual time=0.029..0.029 rows=1 loops=1)
        Recheck Cond: (path @> (InitPlan 2).col1)
        Filter: ((slug)::text <> 'science::computational-chemistry'::text)
        Rows Removed by Filter: 1
        Heap Blocks: exact=2
        ->  Bitmap Index Scan on path_gist_categories_idx  (cost=0.00..4.18 rows=5 width=0) (actual time=0.014..0.014 rows=2 loops=1)
              Index Cond: (path @> (InitPlan 2).col1)
        SubPlan 1
          ->  Aggregate  (cost=12.68..12.69 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)
                ->  Bitmap Heap Scan on categories c2  (cost=4.18..12.67 rows=5 width=4) (actual time=0.007..0.008 rows=16 loops=1)
                      Recheck Cond: (path <@ subltree(c.path, 0, 2))
                      Heap Blocks: exact=3
                      ->  Bitmap Index Scan on path_gist_categories_idx  (cost=0.00..4.18 rows=5 width=0) (actual time=0.005..0.005 rows=16 loops=1)
                            Index Cond: (path <@ subltree(c.path, 0, 2))
Planning Time: 0.054 ms
Execution Time: 0.054 ms
new
Sort  (cost=12589.40..12589.41 rows=5 width=144) (actual time=3.022..3.023 rows=1 loops=1)
  Sort Key: c.path
  Sort Method: quicksort  Memory: 25kB
  InitPlan 2
    ->  Index Scan using categories_slug_key on categories  (cost=0.15..8.17 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=1)
          Index Cond: ((slug)::text = 'science::computational-chemistry'::text)
  ->  Bitmap Heap Scan on categories c  (cost=4.18..12581.18 rows=5 width=144) (actual time=3.021..3.021 rows=1 loops=1)
        Recheck Cond: (path @> (InitPlan 2).col1)
        Filter: ((slug)::text <> 'science::computational-chemistry'::text)
        Rows Removed by Filter: 1
        Heap Blocks: exact=2
        ->  Bitmap Index Scan on path_gist_categories_idx  (cost=0.00..4.18 rows=5 width=0) (actual time=0.011..0.011 rows=2 loops=1)
              Index Cond: (path @> (InitPlan 2).col1)
        SubPlan 1
          ->  Aggregate  (cost=2513.69..2513.70 rows=1 width=4) (actual time=3.004..3.004 rows=1 loops=1)
                ->  Sort  (cost=2503.69..2508.69 rows=2000 width=4) (actual time=2.692..2.827 rows=6593 loops=1)
                      Sort Key: cc.crate_id
                      Sort Method: quicksort  Memory: 193kB
                      ->  Nested Loop Left Join  (cost=15.76..2394.03 rows=2000 width=4) (actual time=0.010..2.136 rows=6593 loops=1)
                            ->  Bitmap Heap Scan on categories c2  (cost=4.18..12.67 rows=5 width=4) (actual time=0.007..0.008 rows=16 loops=1)
                                  Recheck Cond: (path <@ subltree(c.path, 0, 2))
                                  Heap Blocks: exact=3
                                  ->  Bitmap Index Scan on path_gist_categories_idx  (cost=0.00..4.18 rows=5 width=0) (actual time=0.005..0.005 rows=16 loops=1)
                                        Index Cond: (path <@ subltree(c.path, 0, 2))
                            ->  Bitmap Heap Scan on crates_categories cc  (cost=11.58..466.87 rows=940 width=8) (actual time=0.017..0.115 rows=412 loops=16)
                                  Recheck Cond: (category_id = c2.id)
                                  Heap Blocks: exact=1537
                                  ->  Bitmap Index Scan on index_crates_categories_category_id  (cost=0.00..11.34 rows=940 width=0) (actual time=0.011..0.011 rows=412 loops=16)
                                        Index Cond: (category_id = c2.id)
Planning Time: 0.106 ms
Execution Time: 3.046 ms

If this impact is also acceptable on the production server, maybe we could consider calculating this within the trigger function to update crates_cnt. However, a downside is that it could add a tiny bit of overhead to the publishing process.

@Turbo87
Copy link
Member

Turbo87 commented Feb 28, 2026

it seems to have only a small impact

0.054 ms vs 3.046 ms seems like a pretty big impact to me 😅

it might be small in absolute terms, but I assume that on the production server this could end up having a bigger impact 🤔

@eth3lbert
Copy link
Contributor

it seems to have only a small impact

0.054 ms vs 3.046 ms seems like a pretty big impact to me 😅

it might be small in absolute terms, but I assume that on the production server this could end up having a bigger impact 🤔

Fair enough! That’s why I brought up whether the impact on the production server is acceptable. If so, we could change it to a trigger so it only runs once during the publish stage.

@Aakash-Naidu
Copy link
Contributor Author

Since the updated logic introduces a join with crates_categories, an index was added to maintain query performance

did you measure the performance impact of this PR on all of the affected queries with a production dataset? I'd be somewhat surprised if this didn't impact performance at all 😅

@eth3lbert any thoughts on this PR?

I don’t have access to a production-sized dataset to benchmark this directly. I validated the logic locally and added a B-Tree index on crates_categories(category_id) to ensure the join can use an index scan rather than a sequential scan.

I’m happy to adjust the approach if there’s a preferred alternative for minimizing runtime impact.

@Turbo87
Copy link
Member

Turbo87 commented Mar 2, 2026

I don’t have access to a production-sized dataset to benchmark this directly

see https://crates.io/data-access#database-dumps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants