Fix double-counting in parent category crate counts#13051
Fix double-counting in parent category crate counts#13051Aakash-Naidu wants to merge 1 commit intorust-lang:mainfrom
Conversation
5e669e7 to
c766f10
Compare
did you measure the performance impact of this PR on all of the affected queries with a production dataset? I'd be somewhat surprised if this didn't impact performance at all 😅 @eth3lbert any thoughts on this PR? |
I've tested it locally and it seems to have only a small impact! The following shows the most impactful EXPLAIN ANALYZE result. oldSort (cost=84.36..84.37 rows=5 width=144) (actual time=0.031..0.032 rows=1 loops=1)
Sort Key: c.path
Sort Method: quicksort Memory: 25kB
InitPlan 2
-> Index Scan using categories_slug_key on categories (cost=0.15..8.17 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: ((slug)::text = 'science::computational-chemistry'::text)
-> Bitmap Heap Scan on categories c (cost=4.18..76.14 rows=5 width=144) (actual time=0.029..0.029 rows=1 loops=1)
Recheck Cond: (path @> (InitPlan 2).col1)
Filter: ((slug)::text <> 'science::computational-chemistry'::text)
Rows Removed by Filter: 1
Heap Blocks: exact=2
-> Bitmap Index Scan on path_gist_categories_idx (cost=0.00..4.18 rows=5 width=0) (actual time=0.014..0.014 rows=2 loops=1)
Index Cond: (path @> (InitPlan 2).col1)
SubPlan 1
-> Aggregate (cost=12.68..12.69 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)
-> Bitmap Heap Scan on categories c2 (cost=4.18..12.67 rows=5 width=4) (actual time=0.007..0.008 rows=16 loops=1)
Recheck Cond: (path <@ subltree(c.path, 0, 2))
Heap Blocks: exact=3
-> Bitmap Index Scan on path_gist_categories_idx (cost=0.00..4.18 rows=5 width=0) (actual time=0.005..0.005 rows=16 loops=1)
Index Cond: (path <@ subltree(c.path, 0, 2))
Planning Time: 0.054 ms
Execution Time: 0.054 msnewSort (cost=12589.40..12589.41 rows=5 width=144) (actual time=3.022..3.023 rows=1 loops=1)
Sort Key: c.path
Sort Method: quicksort Memory: 25kB
InitPlan 2
-> Index Scan using categories_slug_key on categories (cost=0.15..8.17 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ((slug)::text = 'science::computational-chemistry'::text)
-> Bitmap Heap Scan on categories c (cost=4.18..12581.18 rows=5 width=144) (actual time=3.021..3.021 rows=1 loops=1)
Recheck Cond: (path @> (InitPlan 2).col1)
Filter: ((slug)::text <> 'science::computational-chemistry'::text)
Rows Removed by Filter: 1
Heap Blocks: exact=2
-> Bitmap Index Scan on path_gist_categories_idx (cost=0.00..4.18 rows=5 width=0) (actual time=0.011..0.011 rows=2 loops=1)
Index Cond: (path @> (InitPlan 2).col1)
SubPlan 1
-> Aggregate (cost=2513.69..2513.70 rows=1 width=4) (actual time=3.004..3.004 rows=1 loops=1)
-> Sort (cost=2503.69..2508.69 rows=2000 width=4) (actual time=2.692..2.827 rows=6593 loops=1)
Sort Key: cc.crate_id
Sort Method: quicksort Memory: 193kB
-> Nested Loop Left Join (cost=15.76..2394.03 rows=2000 width=4) (actual time=0.010..2.136 rows=6593 loops=1)
-> Bitmap Heap Scan on categories c2 (cost=4.18..12.67 rows=5 width=4) (actual time=0.007..0.008 rows=16 loops=1)
Recheck Cond: (path <@ subltree(c.path, 0, 2))
Heap Blocks: exact=3
-> Bitmap Index Scan on path_gist_categories_idx (cost=0.00..4.18 rows=5 width=0) (actual time=0.005..0.005 rows=16 loops=1)
Index Cond: (path <@ subltree(c.path, 0, 2))
-> Bitmap Heap Scan on crates_categories cc (cost=11.58..466.87 rows=940 width=8) (actual time=0.017..0.115 rows=412 loops=16)
Recheck Cond: (category_id = c2.id)
Heap Blocks: exact=1537
-> Bitmap Index Scan on index_crates_categories_category_id (cost=0.00..11.34 rows=940 width=0) (actual time=0.011..0.011 rows=412 loops=16)
Index Cond: (category_id = c2.id)
Planning Time: 0.106 ms
Execution Time: 3.046 msIf this impact is also acceptable on the production server, maybe we could consider calculating this within the trigger function to update |
0.054 ms vs 3.046 ms seems like a pretty big impact to me 😅 it might be small in absolute terms, but I assume that on the production server this could end up having a bigger impact 🤔 |
Fair enough! That’s why I brought up whether the impact on the production server is acceptable. If so, we could change it to a trigger so it only runs once during the publish stage. |
I don’t have access to a production-sized dataset to benchmark this directly. I validated the logic locally and added a B-Tree index on crates_categories(category_id) to ensure the join can use an index scan rather than a sequential scan. I’m happy to adjust the approach if there’s a preferred alternative for minimizing runtime impact. |
|
This PR resolves issue #13032, where parent categories incorrectly inflated crate counts when a crate belonged to multiple subcategories under the same parent.
The root cause was the use of SUM(c2.crates_cnt) on cached subcategory counters, which caused duplicate counting when the same crate appeared in multiple subcategories.
Implementation Details
The aggregation logic has been updated to count unique crate IDs directly from the crates_categories relationship table.
SQL Changes
The following queries were updated to use:
COUNT(DISTINCT cc.crate_id)with a join to crates_categories:
toplevel.sql — fixes counts on the main Categories landing page.
subcategories.sql — ensures accurate counts for child categories.
parent_categories.sql — fixes counts in parent breadcrumb navigation.
This ensures each crate contributes at most once to a parent category count.
Performance Considerations
Since the updated logic introduces a join with crates_categories, an index was added to maintain query performance:
Migration: 2026-02-24-171800_add_category_id_index_to_crates_categories
Adds a B-Tree index on crates_categories(category_id)
This allows PostgreSQL to use an index scan instead of a sequential scan when resolving category joins.
Testing & Verification
Reproduction Test
Added category_toplevel_unique_count_reproduction, which:
Creates a crate assigned to multiple subcategories
Verifies the parent category counts it only once
This test fails under the previous implementation and passes with the updated logic.
Regression Updates
Existing tests in models/category.rs were updated to use real crate–category relationships instead of relying on manually set crates_cnt values.
Status
All database tests in crates_io_database pass successfully.