OSIDB-4678: Improve promote performance #1156

roduran-dev · 2025-12-10T12:15:26Z

Improves the performance for the promote endpoint.

This are numbers taken from Alvaro PR to get the results.

There are 2 main changes.

Using Prefetch in the API endpoint call for the Flaw.
Uses Bulk update for the ACL update (this one is more tricky as it changes how its done)
Important: this changes a lot how it updates if during the review you feel something is wrong or dont like something please say so, I can create a new PR with just the Prefetch that increases the performance.
I also added a new safety net for set_public in case there is an embargo.

How the new bulk update works.

Collection Phase: Collects all objects first without saving
- Added a recursion Protection: Uses visited set to prevent infinite loops
Bulk Update: Updates all objects of the same type in a single query
- Batch Processing: Processes 100 objects at a time to avoid memory issues (stole idea from Newcli 😝 )
- One Query Per Model Type: Instead of N queries for N objects, we have 1 query per model type

I took both reports and asked the AI to write a report 🤖

The naming convention is test_promote_megaflaw__, so test_promote_megaflaw_5_2 mean 5 affects with 2 trackers.

Performance Comparison: Before (report 1) vs this MR (report 4)

📊 Executive Summary Comparison

Test	Metric	Report 1 (before_prefetch)	Report 4 (review_0)	Change	% Change
test_promote_megaflaw_1_0
	Time	930ms	235ms	-695ms	-74.7% ✅
	CPU Time	831ms	204ms	-627ms	-75.5% ✅
	DB Time	122ms	11ms	-111ms	-91% ✅
	Queries	249	140	-109	-43.8% ✅
	Tables	17	28	+11	+64.7% ⚠️
	N+1	15	10	-5	-33.3% ✅
	Writes	22	42	+20	+90.9% ⚠️
	Dup Queries	22	20	-2	-9.1% ✅
test_promote_megaflaw_1_1
	Time	968ms	255ms	-713ms	-73.7% ✅
	CPU Time	859ms	222ms	-637ms	-74.2% ✅
	DB Time	111ms	16ms	-95ms	-85.6% ✅
	Queries	285	159	-126	-44.2% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	17	10	-7	-41.2% ✅
	Writes	23	49	+26	+113% ⚠️
	Dup Queries	25	22	-3	-12% ✅
test_promote_megaflaw_2_0
	Time	1034ms	243ms	-791ms	-76.5% ✅
	CPU Time	917ms	212ms	-705ms	-76.9% ✅
	DB Time	163ms	14ms	-149ms	-91.4% ✅
	Queries	285	149	-136	-47.7% ✅
	Tables	17	28	+11	+64.7% ⚠️
	N+1	16	10	-6	-37.5% ✅
	Writes	24	46	+22	+91.7% ⚠️
	Dup Queries	30	22	-8	-26.7% ✅
test_promote_megaflaw_2_1
	Time	1245ms	270ms	-975ms	-78.3% ✅
	CPU Time	1104ms	230ms	-874ms	-79.2% ✅
	DB Time	177ms	22ms	-155ms	-87.6% ✅
	Queries	341	168	-173	-50.7% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	20	10	-10	-50% ✅
	Writes	26	53	+27	+103.8% ⚠️
	Dup Queries	44	24	-20	-45.5% ✅
test_promote_megaflaw_2_2
	Time	1279ms	270ms	-1009ms	-78.9% ✅
	CPU Time	1128ms	233ms	-895ms	-79.3% ✅
	DB Time	198ms	17ms	-181ms	-91.4% ✅
	Queries	359	181	-178	-49.6% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	20	11	-9	-45% ✅
	Writes	27	59	+32	+118.5% ⚠️
	Dup Queries	46	25	-21	-45.7% ✅
test_promote_megaflaw_3_0
	Time	995ms	250ms	-745ms	-74.9% ✅
	CPU Time	878ms	216ms	-662ms	-75.4% ✅
	DB Time	127ms	15ms	-112ms	-88.2% ✅
	Queries	339	158	-181	-53.4% ✅
	Tables	17	28	+11	+64.7% ⚠️
	N+1	25	10	-15	-60% ✅
	Writes	27	50	+23	+85.2% ⚠️
	Dup Queries	33	24	-9	-27.3% ✅
test_promote_megaflaw_3_1
	Time	1214ms	273ms	-941ms	-77.5% ✅
	CPU Time	1070ms	235ms	-835ms	-78% ✅
	DB Time	155ms	20ms	-135ms	-87.1% ✅
	Queries	380	177	-203	-53.4% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	26	10	-16	-61.5% ✅
	Writes	28	57	+29	+103.6% ⚠️
	Dup Queries	36	26	-10	-27.8% ✅
test_promote_megaflaw_5_2
	Time	1863ms	294ms	-1569ms	-84.2% ✅
	CPU Time	1623ms	250ms	-1373ms	-84.6% ✅
	DB Time	297ms	23ms	-274ms	-92.3% ✅
	Queries	641	208	-433	-67.5% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	38	12	-26	-68.4% ✅
	Writes	42	71	+29	+69% ⚠️
	Dup Queries	55	31	-24	-43.6% ✅
test_promote_megaflaw_5_5
	Time	2614ms	325ms	-2289ms	-87.6% ✅
	CPU Time	2273ms	269ms	-2004ms	-88.2% ✅
	DB Time	488ms	36ms	-452ms	-92.6% ✅
	Queries	803	247	-556	-69.2% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	37	13	-24	-64.9% ✅
	Writes	51	89	+38	+74.5% ⚠️
	Dup Queries	91	34	-57	-62.6% ✅
test_promote_megaflaw_10_0
	Time	3533ms	288ms	-3245ms	-91.9% ✅
	CPU Time	3043ms	239ms	-2804ms	-92.1% ✅
	DB Time	650ms	25ms	-625ms	-96.2% ✅
	Queries	1221	221	-1000	-81.9% ✅
	Tables	17	28	+11	+64.7% ⚠️
	N+1	25	11	-14	-56% ✅
	Writes	76	78	+2	+2.6% ✅
	Dup Queries	54	38	-16	-29.6% ✅
test_promote_megaflaw_20_0
	Time	10641ms	430ms	-10211ms	-96% ✅
	CPU Time	9158ms	355ms	-8803ms	-96.1% ✅
	DB Time	1749ms	47ms	-1702ms	-97.3% ✅
	Queries	4011	311	-3700	-92.2% ✅
	Tables	17	28	+11	+64.7% ⚠️
	N+1	25	11	-14	-56% ✅
	Writes	231	118	-113	-48.9% ✅
	Dup Queries	84	58	-26	-31% ✅
test_promote_megaflaw_20_1
	Time	9834ms	365ms	-9469ms	-96.3% ✅
	CPU Time	8438ms	300ms	-8138ms	-96.4% ✅
	DB Time	1357ms	35ms	-1322ms	-97.4% ✅
	Queries	4508	330	-4178	-92.7% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	38	11	-27	-71.1% ✅
	Writes	246	125	-121	-49.2% ✅
	Dup Queries	98	60	-38	-38.8% ✅
test_promote_megaflaw_20_10
	Time	9896ms	379ms	-9517ms	-96.2% ✅
	CPU Time	8500ms	313ms	-8187ms	-96.3% ✅
	DB Time	1357ms	39ms	-1318ms	-97.1% ✅
	Queries	4580	330	-4250	-92.8% ✅
	Tables	18	32	+14	+77.8% ⚠️
	N+1	38	11	-27	-71.1% ✅
	Writes	250	125	-125	-50% ✅
	Dup Queries	98	60	-38	-38.8% ✅

🔍 Key Findings

✅ Massive Performance Improvements in Report 4

Report 4 (review_0) shows dramatic improvements compared to Report 1 (before_prefetch):

Execution Time Reduction ✅
- Average reduction: -75% to -96% in total execution time
- Best improvement: test_promote_megaflaw_20_0 went from 10,641ms to 430ms (-96%)
- All tests show significant speedup
Query Count Reduction ✅
- Average reduction: -44% to -93% in number of queries
- Best improvement: test_promote_megaflaw_20_0 went from 4,011 to 311 queries (-92.2%)
- Massive reduction in database round trips
Database Time Reduction ✅
- Average reduction: -85% to -97% in DB time
- Best improvement: test_promote_megaflaw_20_0 went from 1,749ms to 47ms (-97.3%)
- Extremely efficient database operations
N+1 Query Reduction ✅
- Reduced from 15-38 N+1 queries to 10-13
- Average reduction: -33% to -71%
- Better query optimization
Write Operations ⚠️
- Smaller tests (1-5 items): Increased writes (+70-120%)
- Larger tests (10-20 items): Decreased writes (-48-50%)
- This suggests bulk operations are working better for larger datasets

⚠️ Areas of Concern

Table Access Increase ⚠️
- Increased from 17-18 tables to 28-32 tables
- This is likely due to audit/history tables being accessed
- The additional tables are audit tables (*audit) which are necessary for history tracking
Write Operations for Small Tests ⚠️
- Small tests show increased write operations
- This might be due to individual history updates instead of bulk operations
- However, the overall time is still much better

📊 Average Metrics Comparison

Metric	Report 1 Avg	Report 4 Avg	Change	Improvement
Time	3,500ms	290ms	-3,210ms	-91.7% ✅
CPU Time	3,000ms	248ms	-2,752ms	-91.7% ✅
DB Time	600ms	23ms	-577ms	-96.2% ✅
Queries	1,400	200	-1,200	-85.7% ✅
Writes	80	70	-10	-12.5% ✅
N+1	25	10.5	-14.5	-58% ✅

🎯 Performance Analysis

What's Working Well ✅

Query Optimization - Massive reduction in query count suggests excellent use of prefetch_related() and select_related()
Database Efficiency - 96% reduction in DB time shows optimized queries
Overall Speed - 91% faster execution time is a huge win
N+1 Reduction - Better handling of related objects

What Could Be Improved ⚠️

Write Operations for Small Tests - Small tests show more writes, suggesting individual updates instead of bulk
Audit Table Access - Additional audit tables are being accessed, which is expected but could potentially be optimized further

💡 Key Insights

Performance Evolution

Report 1 (baseline): Very slow, many queries, N+1 problems
Report 4 (current): Much faster, fewer queries, better optimization

Scale Impact

The improvements are more dramatic for larger datasets:

Small tests (1-2 items): ~75% improvement
Medium tests (3-5 items): ~85% improvement
Large tests (10-20 items): ~96% improvement

This suggests the optimizations scale very well with data size.

📈 Conclusion

Report 4 shows excellent performance improvements compared to the baseline (Report 1):

✅ 91.7% faster overall execution time
✅ 85.7% fewer database queries
✅ 96.2% less database time
✅ 58% fewer N+1 queries

While there are some areas for improvement (write operations for small tests, audit table access), the overall performance is dramatically better than the original baseline. The optimizations are working very well, especially for larger datasets.

Closes OSIDB-4678

Jincxz · 2025-12-12T21:50:57Z

osidb/mixins.py

-        refs = pghistory.models.Events.objects.tracks(self).all()
+        refs = pghistory.models.Events.objects.references(self).all()
        for ref in refs:
-            model_audit = apps.get_model(ref.pgh_model).objects.filter(
+            db, model_name = ref.pgh_model.split(".")
+            # Skip snippet audit events - snippets should always remain internal
+            if model_name == "SnippetAudit":
+                continue
+
+            model_audit = apps.get_model(db, model_name).objects.filter(
                pgh_id=ref.pgh_id
            )

-            with pgtrigger.ignore(f"{ref.pgh_model}:append_only"):
+            with pgtrigger.ignore(f"{db}.{model_name}:append_only"):


From my search it looks like references goes through the history of all of the related models while tracks processes the model is is called on. Since set_history_public is called on each model there might be a lot of duplicate history ACL adjustments with using references. The calls also ends up attempting to grab the Snippet which shouldn't happen.

I think it would make sense to sync set_history_public with what exists in the master branch (with tracks).

I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?

Jincxz · 2025-12-12T22:29:59Z

osidb/mixins.py

-            self.set_public()
-            self.set_history_public()
-            self.save(**kwargs)
+            if self.is_internal:


Slight optimization: I think you can keep the original short-circuit logic

if not self.is_internal: return

If the object is not internal the function will still search through the related objects w/o the return.

Jincxz · 2025-12-12T22:32:17Z

osidb/mixins.py

-            if issubclass(type(self), AlertMixin):
-                # suppress the validation errors as we expect that during
-                # the update the parent and child ACLs will not equal
-                kwargs["raise_validation_error"] = False
-            if issubclass(type(self), TrackingMixin):
-                # do not auto-update the updated_dt timestamp as the
-                # followup update would fail on a mid-air collision
-                kwargs["auto_timestamps"] = False


The commit removes these adjustments but they might be covered in the bulk update. I think it might be a good idea to double-check that the removal of these lines won't cause issues.

osoukup

LGTM. I have only some minor comments and question

osoukup · 2025-12-15T08:07:09Z

osidb/mixins.py

+    def _collect_objects_for_public_update(self, objects_to_update, visited):
+        """
+        Recursively collect all related objects that need to have their ACLs
+        updated to public. The Flaw itself is not collected as it's saved separately.


nitpick: The ACLMixin class is a general mixin so self is not necessarily a flaw.

osoukup · 2025-12-15T08:08:38Z

osidb/mixins.py

        """
        from osidb.models import Flaw

+        # Create a unique key for this object to track if we've visited it


nitpick: The following line does not create anything. I would more naturally put the comment three lines below.

osoukup · 2025-12-15T08:21:20Z

osidb/mixins.py

-        refs = pghistory.models.Events.objects.tracks(self).all()
+        refs = pghistory.models.Events.objects.references(self).all()
        for ref in refs:
-            model_audit = apps.get_model(ref.pgh_model).objects.filter(
+            db, model_name = ref.pgh_model.split(".")
+            # Skip snippet audit events - snippets should always remain internal
+            if model_name == "SnippetAudit":
+                continue
+
+            model_audit = apps.get_model(db, model_name).objects.filter(
                pgh_id=ref.pgh_id
            )

-            with pgtrigger.ignore(f"{ref.pgh_model}:append_only"):
+            with pgtrigger.ignore(f"{db}.{model_name}:append_only"):


I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?

Elkasitu

On top of the existing comments, I think you should provide performance regression tests that check the amount of queries expected for a base-case for both of the performance optimizations introduced, look at osidb/tests/test_query_regresion.py.

It is not clear from your benchmark how much each patch improves performance and where, having these regression tests helps determine which patch did what.

Elkasitu · 2025-12-15T09:26:27Z

apps/workflows/tests/test_endpoints.py

            self.assert_audit_acls(model, internal_read_groups, internal_write_groups)
+
+    @pytest.mark.vcr
+    def test_promote_flaw_with_multiple_affects(


you don't need a vcr test nor a test this big to test that promoting a flaw with multiple affects correctly sets its own and its related objects' ACLs, you can either:

create a unit test which creates a flaw with related affects, trackers and/or comments/cvss/etc. and then calls promote() on said flaw

piggyback off of existing promote() tests to check that related object ACLs are updated as expected

in their current state these two tests depend on so many things that it can fail for multiple reasons not directly related to ACLs: the tests are brittle

I will see to change it, The original promote test did have a VCR and I created a new one to avoid messing up the original test.

Elkasitu · 2025-12-15T10:22:37Z

osidb/mixins.py

    def set_public_nested(self):
        """
        Change internal ACLs to public ACLs for all related Flaw objects and save them.
        The only exception is "snippets", which should always have internal ACLs.
        The Flaw itself will be saved later to avoid duplicate operations.
+
+        This method collects all objects that need to be updated and uses bulk_update
+        to minimize database queries.
+        """
+
+        # Collect all objects that need to be updated (using dict to avoid duplicates)
+        # Key: (model_class, pk), Value: object instance
+        objects_to_update = {}
+        # Track visited objects to prevent infinite recursion in circular relationships
+        visited = set()
+        self._collect_objects_for_public_update(objects_to_update, visited)
+
+        # Get list of unique objects
+        valid_objects = list(objects_to_update.values())
+
+        if not valid_objects:
+            return
+
+        # Set updated_dt on all objects before bulk update
+        # Cut off microseconds to match TrackingMixin.save() behavior
+        now = timezone.now().replace(microsecond=0)
+        for obj in valid_objects:
+            if hasattr(obj, "updated_dt"):
+                obj.updated_dt = now
+
+        # Group objects by model type for bulk_update
+        from collections import defaultdict
+
+        objects_by_model = defaultdict(list)
+        for obj in valid_objects:
+            objects_by_model[type(obj)].append(obj)
+
+        # Bulk update each model type
+        # Fields to update: acl_read, acl_write, and updated_dt
+        for model_class, objects in objects_by_model.items():
+            if objects:
+                # Determine fields for this model type
+                model_update_fields = ["acl_read", "acl_write"]
+                if hasattr(model_class, "updated_dt"):
+                    model_update_fields.append("updated_dt")
+
+                model_class.objects.bulk_update(
+                    objects,
+                    model_update_fields,
+                    batch_size=100,  # Process in batches to avoid memory issues
+                )
+
+                # Update audit history for all updated objects
+                # Each object needs its history set to public after ACL update
+                for obj in objects:
+                    if hasattr(obj, "set_history_public"):
+                        obj.set_history_public()


I think this whole approach is not using the correct tools provided by Django:

First and foremost, there's a non-negligible memory cost to storing all these objects locally -- wouldn't it be best to simply keep track of ids instead of model instance objects?

This brings me to my second point, which is that bulk_update is intended for cases where you want to update the same fields for multiple records with different values, here we're updating the same fields for multiple records with the same values, so QuerySet.update() can be used instead:

for model_class, object_ids in objects_by_model.items(): model_class.filter(pk__in=object_ids).update(acl_read=..., acl_write=..., updated_dt=...)

This will generate one UPDATE query per model, whereas update_bulk will generate more than one UPDATE query, which correlates with the increase of write queries in your test, in fact I'm not convinced that the current approach is any better than simply calling save() (other than for the side-effects such as signals and save() overrides).

Lastly, I think the collection can be done with a defaultdict(set) directly, you initialize it and pass it to the collection function which will simply do something like:

if self.pk and self.pk in objects_to_update[type(self)]: # this works as visited return ... objects_to_update[type(self)].add(self.pk) ...

Its a good point I will try to just use the uuid.

Elkasitu · 2025-12-15T10:42:46Z

osidb/mixins.py

        self.acl_write = acls
        return acls

-    def set_public(self):


I'm not sure I agree with this -- set_public is a low-level method that does one job: change an ACL-enabled object to public ACLs regardless of its current state, said state can even already be public for all it cares.

It's the caller's responsibility to ensure that only objects that should be public are passed to this method, this method is unaware of business rules and should stay unaware IMO.

Its true, but doing the change on set_public_nested I felt a lot of responsibility as not to change anything that is not in internal and even more so for posible embargo, I do agree that it breaks the main point but the benefit for a safety net to avoid setting public an embargo is bigger than the pattern in this case.

ok, I remove it but I will add a test to be sure this a case like this does not happen.

MrMarble · 2025-12-17T12:53:31Z

osidb/tests/test_query_regresion.py

+        # Force initial classification because signals are muted in queryset tests
+        flaw.classification = {
+            "workflow": "DEFAULT",
+            "state": WorkflowModel.WorkflowState.NEW,
+        }
+        # NEW -> TRIAGE requires owner
+        flaw.owner = "Alice"
+        flaw.save(raise_validation_error=False)


Is the @pytest.mark.enable_signals decorator not enough?

MrMarble · 2025-12-17T12:55:36Z

osidb/tests/test_query_regresion.py

+            "HTTP_JIRA_API_KEY": jira_token,
+            "HTTP_BUGZILLA_API_KEY": bugzilla_token,
+        }
+        with assertNumQueriesLessThan(92):


Just nitpicking, normally when I write new regression tests, I write them before doing the changes to see the actual change, because now we don't know if 92 is an improvement or not...
Notice the # initial value -> X in some tests

Also, did you try to use assertNumQueries first? We should utilize it whenever possible, assertNumQueriesLessThan is less accurate but some tests are a bit flaky due to environment

yeah, I just change it assertNumQueries and I also added quantity of affects/trackers to check that the query quanty doesnt grow depnding on the quantitiy.

PS: there seems to be a bug for the first iteration, I couldn't find what it is.

roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch 4 times, most recently from 05358f2 to 0694834 Compare December 12, 2025 15:06

roduran-dev changed the title ~~OSIDB-4678: Add new test of a flaw with many affects and trackers~~ OSIDB-4678: Improve promote performance Dec 12, 2025

roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch 4 times, most recently from b238de6 to be28eab Compare December 12, 2025 16:40

roduran-dev marked this pull request as ready for review December 12, 2025 17:59

roduran-dev requested a review from a team December 12, 2025 17:59

Jincxz requested changes Dec 12, 2025

View reviewed changes

osoukup approved these changes Dec 15, 2025

View reviewed changes

Elkasitu requested changes Dec 15, 2025

View reviewed changes

MrMarble reviewed Dec 17, 2025

View reviewed changes

roduran-dev added 4 commits December 17, 2025 18:44

OSIDB-4678: Add new test of a flaw with many affects and trackers

7810b74

OSIDB-4678: Use prefetch for the Flaw query

cfe3683

OSIDB-4678: ACL Bulk update

5e41b14

Add: safety net in case there is an embargo.

e83e03a

roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch from c785f8b to 000bece Compare December 17, 2025 17:44

Add: regression test for promote

443f514

roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch from 000bece to 443f514 Compare December 17, 2025 17:56

roduran-dev self-assigned this Dec 18, 2025

OSIDB-4678: Improve promote performance #1156

Are you sure you want to change the base?

OSIDB-4678: Improve promote performance #1156

Uh oh!

Conversation

roduran-dev commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Comparison: Before (report 1) vs this MR (report 4)

📊 Executive Summary Comparison

🔍 Key Findings

✅ Massive Performance Improvements in Report 4

⚠️ Areas of Concern

📊 Average Metrics Comparison

🎯 Performance Analysis

What's Working Well ✅

What Could Be Improved ⚠️

💡 Key Insights

Performance Evolution

Scale Impact

📈 Conclusion

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

osoukup left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elkasitu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

roduran-dev commented Dec 10, 2025 •

edited

Loading