-
Notifications
You must be signed in to change notification settings - Fork 12
OSIDB-4678: Improve promote performance #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
05358f2 to
0694834
Compare
b238de6 to
be28eab
Compare
| refs = pghistory.models.Events.objects.tracks(self).all() | ||
| refs = pghistory.models.Events.objects.references(self).all() | ||
| for ref in refs: | ||
| model_audit = apps.get_model(ref.pgh_model).objects.filter( | ||
| db, model_name = ref.pgh_model.split(".") | ||
| # Skip snippet audit events - snippets should always remain internal | ||
| if model_name == "SnippetAudit": | ||
| continue | ||
|
|
||
| model_audit = apps.get_model(db, model_name).objects.filter( | ||
| pgh_id=ref.pgh_id | ||
| ) | ||
|
|
||
| with pgtrigger.ignore(f"{ref.pgh_model}:append_only"): | ||
| with pgtrigger.ignore(f"{db}.{model_name}:append_only"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my search it looks like references goes through the history of all of the related models while tracks processes the model is is called on. Since set_history_public is called on each model there might be a lot of duplicate history ACL adjustments with using references. The calls also ends up attempting to grab the Snippet which shouldn't happen.
I think it would make sense to sync set_history_public with what exists in the master branch (with tracks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?
| self.set_public() | ||
| self.set_history_public() | ||
| self.save(**kwargs) | ||
| if self.is_internal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight optimization: I think you can keep the original short-circuit logic
if not self.is_internal:
return
If the object is not internal the function will still search through the related objects w/o the return.
| if issubclass(type(self), AlertMixin): | ||
| # suppress the validation errors as we expect that during | ||
| # the update the parent and child ACLs will not equal | ||
| kwargs["raise_validation_error"] = False | ||
| if issubclass(type(self), TrackingMixin): | ||
| # do not auto-update the updated_dt timestamp as the | ||
| # followup update would fail on a mid-air collision | ||
| kwargs["auto_timestamps"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit removes these adjustments but they might be covered in the bulk update. I think it might be a good idea to double-check that the removal of these lines won't cause issues.
osoukup
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have only some minor comments and question
| def _collect_objects_for_public_update(self, objects_to_update, visited): | ||
| """ | ||
| Recursively collect all related objects that need to have their ACLs | ||
| updated to public. The Flaw itself is not collected as it's saved separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: The ACLMixin class is a general mixin so self is not necessarily a flaw.
| """ | ||
| from osidb.models import Flaw | ||
|
|
||
| # Create a unique key for this object to track if we've visited it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: The following line does not create anything. I would more naturally put the comment three lines below.
| refs = pghistory.models.Events.objects.tracks(self).all() | ||
| refs = pghistory.models.Events.objects.references(self).all() | ||
| for ref in refs: | ||
| model_audit = apps.get_model(ref.pgh_model).objects.filter( | ||
| db, model_name = ref.pgh_model.split(".") | ||
| # Skip snippet audit events - snippets should always remain internal | ||
| if model_name == "SnippetAudit": | ||
| continue | ||
|
|
||
| model_audit = apps.get_model(db, model_name).objects.filter( | ||
| pgh_id=ref.pgh_id | ||
| ) | ||
|
|
||
| with pgtrigger.ignore(f"{ref.pgh_model}:append_only"): | ||
| with pgtrigger.ignore(f"{db}.{model_name}:append_only"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?
Elkasitu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of the existing comments, I think you should provide performance regression tests that check the amount of queries expected for a base-case for both of the performance optimizations introduced, look at osidb/tests/test_query_regresion.py.
It is not clear from your benchmark how much each patch improves performance and where, having these regression tests helps determine which patch did what.
| self.assert_audit_acls(model, internal_read_groups, internal_write_groups) | ||
|
|
||
| @pytest.mark.vcr | ||
| def test_promote_flaw_with_multiple_affects( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need a vcr test nor a test this big to test that promoting a flaw with multiple affects correctly sets its own and its related objects' ACLs, you can either:
- create a unit test which creates a flaw with related affects, trackers and/or comments/cvss/etc. and then calls promote() on said flaw
- piggyback off of existing promote() tests to check that related object ACLs are updated as expected
in their current state these two tests depend on so many things that it can fail for multiple reasons not directly related to ACLs: the tests are brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will see to change it, The original promote test did have a VCR and I created a new one to avoid messing up the original test.
| def set_public_nested(self): | ||
| """ | ||
| Change internal ACLs to public ACLs for all related Flaw objects and save them. | ||
| The only exception is "snippets", which should always have internal ACLs. | ||
| The Flaw itself will be saved later to avoid duplicate operations. | ||
| This method collects all objects that need to be updated and uses bulk_update | ||
| to minimize database queries. | ||
| """ | ||
|
|
||
| # Collect all objects that need to be updated (using dict to avoid duplicates) | ||
| # Key: (model_class, pk), Value: object instance | ||
| objects_to_update = {} | ||
| # Track visited objects to prevent infinite recursion in circular relationships | ||
| visited = set() | ||
| self._collect_objects_for_public_update(objects_to_update, visited) | ||
|
|
||
| # Get list of unique objects | ||
| valid_objects = list(objects_to_update.values()) | ||
|
|
||
| if not valid_objects: | ||
| return | ||
|
|
||
| # Set updated_dt on all objects before bulk update | ||
| # Cut off microseconds to match TrackingMixin.save() behavior | ||
| now = timezone.now().replace(microsecond=0) | ||
| for obj in valid_objects: | ||
| if hasattr(obj, "updated_dt"): | ||
| obj.updated_dt = now | ||
|
|
||
| # Group objects by model type for bulk_update | ||
| from collections import defaultdict | ||
|
|
||
| objects_by_model = defaultdict(list) | ||
| for obj in valid_objects: | ||
| objects_by_model[type(obj)].append(obj) | ||
|
|
||
| # Bulk update each model type | ||
| # Fields to update: acl_read, acl_write, and updated_dt | ||
| for model_class, objects in objects_by_model.items(): | ||
| if objects: | ||
| # Determine fields for this model type | ||
| model_update_fields = ["acl_read", "acl_write"] | ||
| if hasattr(model_class, "updated_dt"): | ||
| model_update_fields.append("updated_dt") | ||
|
|
||
| model_class.objects.bulk_update( | ||
| objects, | ||
| model_update_fields, | ||
| batch_size=100, # Process in batches to avoid memory issues | ||
| ) | ||
|
|
||
| # Update audit history for all updated objects | ||
| # Each object needs its history set to public after ACL update | ||
| for obj in objects: | ||
| if hasattr(obj, "set_history_public"): | ||
| obj.set_history_public() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole approach is not using the correct tools provided by Django:
First and foremost, there's a non-negligible memory cost to storing all these objects locally -- wouldn't it be best to simply keep track of ids instead of model instance objects?
This brings me to my second point, which is that bulk_update is intended for cases where you want to update the same fields for multiple records with different values, here we're updating the same fields for multiple records with the same values, so QuerySet.update() can be used instead:
for model_class, object_ids in objects_by_model.items():
model_class.filter(pk__in=object_ids).update(acl_read=..., acl_write=..., updated_dt=...)This will generate one UPDATE query per model, whereas update_bulk will generate more than one UPDATE query, which correlates with the increase of write queries in your test, in fact I'm not convinced that the current approach is any better than simply calling save() (other than for the side-effects such as signals and save() overrides).
Lastly, I think the collection can be done with a defaultdict(set) directly, you initialize it and pass it to the collection function which will simply do something like:
if self.pk and self.pk in objects_to_update[type(self)]:
# this works as visited
return
...
objects_to_update[type(self)].add(self.pk)
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a good point I will try to just use the uuid.
| self.acl_write = acls | ||
| return acls | ||
|
|
||
| def set_public(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with this -- set_public is a low-level method that does one job: change an ACL-enabled object to public ACLs regardless of its current state, said state can even already be public for all it cares.
It's the caller's responsibility to ensure that only objects that should be public are passed to this method, this method is unaware of business rules and should stay unaware IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its true, but doing the change on set_public_nested I felt a lot of responsibility as not to change anything that is not in internal and even more so for posible embargo, I do agree that it breaks the main point but the benefit for a safety net to avoid setting public an embargo is bigger than the pattern in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I remove it but I will add a test to be sure this a case like this does not happen.
| # Force initial classification because signals are muted in queryset tests | ||
| flaw.classification = { | ||
| "workflow": "DEFAULT", | ||
| "state": WorkflowModel.WorkflowState.NEW, | ||
| } | ||
| # NEW -> TRIAGE requires owner | ||
| flaw.owner = "Alice" | ||
| flaw.save(raise_validation_error=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the @pytest.mark.enable_signals decorator not enough?
osidb/tests/test_query_regresion.py
Outdated
| "HTTP_JIRA_API_KEY": jira_token, | ||
| "HTTP_BUGZILLA_API_KEY": bugzilla_token, | ||
| } | ||
| with assertNumQueriesLessThan(92): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nitpicking, normally when I write new regression tests, I write them before doing the changes to see the actual change, because now we don't know if 92 is an improvement or not...
Notice the # initial value -> X in some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, did you try to use assertNumQueries first? We should utilize it whenever possible, assertNumQueriesLessThan is less accurate but some tests are a bit flaky due to environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I just change it assertNumQueries and I also added quantity of affects/trackers to check that the query quanty doesnt grow depnding on the quantitiy.
PS: there seems to be a bug for the first iteration, I couldn't find what it is.
c785f8b to
000bece
Compare
000bece to
443f514
Compare
Improves the performance for the promote endpoint.
This are numbers taken from Alvaro PR to get the results.
There are 2 main changes.
Important: this changes a lot how it updates if during the review you feel something is wrong or dont like something please say so, I can create a new PR with just the Prefetch that increases the performance.
I also added a new safety net for set_public in case there is an embargo.
How the new bulk update works.
I took both reports and asked the AI to write a report 🤖
The naming convention is test_promote_megaflaw__, so test_promote_megaflaw_5_2 mean 5 affects with 2 trackers.
Performance Comparison: Before (report 1) vs this MR (report 4)
📊 Executive Summary Comparison
🔍 Key Findings
✅ Massive Performance Improvements in Report 4
Report 4 (review_0) shows dramatic improvements compared to Report 1 (before_prefetch):
Execution Time Reduction ✅
test_promote_megaflaw_20_0went from 10,641ms to 430ms (-96%)Query Count Reduction ✅
test_promote_megaflaw_20_0went from 4,011 to 311 queries (-92.2%)Database Time Reduction ✅
test_promote_megaflaw_20_0went from 1,749ms to 47ms (-97.3%)N+1 Query Reduction ✅
Write Operations⚠️
Table Access Increase⚠️
*audit) which are necessary for history trackingWrite Operations for Small Tests⚠️
📊 Average Metrics Comparison
🎯 Performance Analysis
What's Working Well ✅
prefetch_related()andselect_related()What Could Be Improved⚠️
💡 Key Insights
Performance Evolution
Scale Impact
The improvements are more dramatic for larger datasets:
This suggests the optimizations scale very well with data size.
📈 Conclusion
Report 4 shows excellent performance improvements compared to the baseline (Report 1):
✅ 91.7% faster overall execution time
✅ 85.7% fewer database queries
✅ 96.2% less database time
✅ 58% fewer N+1 queries
While there are some areas for improvement (write operations for small tests, audit table access), the overall performance is dramatically better than the original baseline. The optimizations are working very well, especially for larger datasets.
Closes OSIDB-4678