Skip to content

Conversation

@roduran-dev
Copy link
Contributor

@roduran-dev roduran-dev commented Dec 10, 2025

Improves the performance for the promote endpoint.

This are numbers taken from Alvaro PR to get the results.

There are 2 main changes.

  1. Using Prefetch in the API endpoint call for the Flaw.
  2. Uses Bulk update for the ACL update (this one is more tricky as it changes how its done)
    Important: this changes a lot how it updates if during the review you feel something is wrong or dont like something please say so, I can create a new PR with just the Prefetch that increases the performance.
    I also added a new safety net for set_public in case there is an embargo.

How the new bulk update works.

  • Collection Phase: Collects all objects first without saving
    • Added a recursion Protection: Uses visited set to prevent infinite loops
  • Bulk Update: Updates all objects of the same type in a single query
    • Batch Processing: Processes 100 objects at a time to avoid memory issues (stole idea from Newcli 😝 )
    • One Query Per Model Type: Instead of N queries for N objects, we have 1 query per model type

I took both reports and asked the AI to write a report 🤖

The naming convention is test_promote_megaflaw__, so test_promote_megaflaw_5_2 mean 5 affects with 2 trackers.

Performance Comparison: Before (report 1) vs this MR (report 4)

📊 Executive Summary Comparison

Test Metric Report 1 (before_prefetch) Report 4 (review_0) Change % Change
test_promote_megaflaw_1_0
Time 930ms 235ms -695ms -74.7% ✅
CPU Time 831ms 204ms -627ms -75.5% ✅
DB Time 122ms 11ms -111ms -91% ✅
Queries 249 140 -109 -43.8% ✅
Tables 17 28 +11 +64.7% ⚠️
N+1 15 10 -5 -33.3% ✅
Writes 22 42 +20 +90.9% ⚠️
Dup Queries 22 20 -2 -9.1% ✅
test_promote_megaflaw_1_1
Time 968ms 255ms -713ms -73.7% ✅
CPU Time 859ms 222ms -637ms -74.2% ✅
DB Time 111ms 16ms -95ms -85.6% ✅
Queries 285 159 -126 -44.2% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 17 10 -7 -41.2% ✅
Writes 23 49 +26 +113% ⚠️
Dup Queries 25 22 -3 -12% ✅
test_promote_megaflaw_2_0
Time 1034ms 243ms -791ms -76.5% ✅
CPU Time 917ms 212ms -705ms -76.9% ✅
DB Time 163ms 14ms -149ms -91.4% ✅
Queries 285 149 -136 -47.7% ✅
Tables 17 28 +11 +64.7% ⚠️
N+1 16 10 -6 -37.5% ✅
Writes 24 46 +22 +91.7% ⚠️
Dup Queries 30 22 -8 -26.7% ✅
test_promote_megaflaw_2_1
Time 1245ms 270ms -975ms -78.3% ✅
CPU Time 1104ms 230ms -874ms -79.2% ✅
DB Time 177ms 22ms -155ms -87.6% ✅
Queries 341 168 -173 -50.7% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 20 10 -10 -50% ✅
Writes 26 53 +27 +103.8% ⚠️
Dup Queries 44 24 -20 -45.5% ✅
test_promote_megaflaw_2_2
Time 1279ms 270ms -1009ms -78.9% ✅
CPU Time 1128ms 233ms -895ms -79.3% ✅
DB Time 198ms 17ms -181ms -91.4% ✅
Queries 359 181 -178 -49.6% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 20 11 -9 -45% ✅
Writes 27 59 +32 +118.5% ⚠️
Dup Queries 46 25 -21 -45.7% ✅
test_promote_megaflaw_3_0
Time 995ms 250ms -745ms -74.9% ✅
CPU Time 878ms 216ms -662ms -75.4% ✅
DB Time 127ms 15ms -112ms -88.2% ✅
Queries 339 158 -181 -53.4% ✅
Tables 17 28 +11 +64.7% ⚠️
N+1 25 10 -15 -60% ✅
Writes 27 50 +23 +85.2% ⚠️
Dup Queries 33 24 -9 -27.3% ✅
test_promote_megaflaw_3_1
Time 1214ms 273ms -941ms -77.5% ✅
CPU Time 1070ms 235ms -835ms -78% ✅
DB Time 155ms 20ms -135ms -87.1% ✅
Queries 380 177 -203 -53.4% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 26 10 -16 -61.5% ✅
Writes 28 57 +29 +103.6% ⚠️
Dup Queries 36 26 -10 -27.8% ✅
test_promote_megaflaw_5_2
Time 1863ms 294ms -1569ms -84.2% ✅
CPU Time 1623ms 250ms -1373ms -84.6% ✅
DB Time 297ms 23ms -274ms -92.3% ✅
Queries 641 208 -433 -67.5% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 38 12 -26 -68.4% ✅
Writes 42 71 +29 +69% ⚠️
Dup Queries 55 31 -24 -43.6% ✅
test_promote_megaflaw_5_5
Time 2614ms 325ms -2289ms -87.6% ✅
CPU Time 2273ms 269ms -2004ms -88.2% ✅
DB Time 488ms 36ms -452ms -92.6% ✅
Queries 803 247 -556 -69.2% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 37 13 -24 -64.9% ✅
Writes 51 89 +38 +74.5% ⚠️
Dup Queries 91 34 -57 -62.6% ✅
test_promote_megaflaw_10_0
Time 3533ms 288ms -3245ms -91.9% ✅
CPU Time 3043ms 239ms -2804ms -92.1% ✅
DB Time 650ms 25ms -625ms -96.2% ✅
Queries 1221 221 -1000 -81.9% ✅
Tables 17 28 +11 +64.7% ⚠️
N+1 25 11 -14 -56% ✅
Writes 76 78 +2 +2.6% ✅
Dup Queries 54 38 -16 -29.6% ✅
test_promote_megaflaw_20_0
Time 10641ms 430ms -10211ms -96% ✅
CPU Time 9158ms 355ms -8803ms -96.1% ✅
DB Time 1749ms 47ms -1702ms -97.3% ✅
Queries 4011 311 -3700 -92.2% ✅
Tables 17 28 +11 +64.7% ⚠️
N+1 25 11 -14 -56% ✅
Writes 231 118 -113 -48.9% ✅
Dup Queries 84 58 -26 -31% ✅
test_promote_megaflaw_20_1
Time 9834ms 365ms -9469ms -96.3% ✅
CPU Time 8438ms 300ms -8138ms -96.4% ✅
DB Time 1357ms 35ms -1322ms -97.4% ✅
Queries 4508 330 -4178 -92.7% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 38 11 -27 -71.1% ✅
Writes 246 125 -121 -49.2% ✅
Dup Queries 98 60 -38 -38.8% ✅
test_promote_megaflaw_20_10
Time 9896ms 379ms -9517ms -96.2% ✅
CPU Time 8500ms 313ms -8187ms -96.3% ✅
DB Time 1357ms 39ms -1318ms -97.1% ✅
Queries 4580 330 -4250 -92.8% ✅
Tables 18 32 +14 +77.8% ⚠️
N+1 38 11 -27 -71.1% ✅
Writes 250 125 -125 -50% ✅
Dup Queries 98 60 -38 -38.8% ✅

🔍 Key Findings

✅ Massive Performance Improvements in Report 4

Report 4 (review_0) shows dramatic improvements compared to Report 1 (before_prefetch):

  1. Execution Time Reduction

    • Average reduction: -75% to -96% in total execution time
    • Best improvement: test_promote_megaflaw_20_0 went from 10,641ms to 430ms (-96%)
    • All tests show significant speedup
  2. Query Count Reduction

    • Average reduction: -44% to -93% in number of queries
    • Best improvement: test_promote_megaflaw_20_0 went from 4,011 to 311 queries (-92.2%)
    • Massive reduction in database round trips
  3. Database Time Reduction

    • Average reduction: -85% to -97% in DB time
    • Best improvement: test_promote_megaflaw_20_0 went from 1,749ms to 47ms (-97.3%)
    • Extremely efficient database operations
  4. N+1 Query Reduction

    • Reduced from 15-38 N+1 queries to 10-13
    • Average reduction: -33% to -71%
    • Better query optimization
  5. Write Operations ⚠️

    • Smaller tests (1-5 items): Increased writes (+70-120%)
    • Larger tests (10-20 items): Decreased writes (-48-50%)
    • This suggests bulk operations are working better for larger datasets

⚠️ Areas of Concern

  1. Table Access Increase ⚠️

    • Increased from 17-18 tables to 28-32 tables
    • This is likely due to audit/history tables being accessed
    • The additional tables are audit tables (*audit) which are necessary for history tracking
  2. Write Operations for Small Tests ⚠️

    • Small tests show increased write operations
    • This might be due to individual history updates instead of bulk operations
    • However, the overall time is still much better

📊 Average Metrics Comparison

Metric Report 1 Avg Report 4 Avg Change Improvement
Time 3,500ms 290ms -3,210ms -91.7%
CPU Time 3,000ms 248ms -2,752ms -91.7%
DB Time 600ms 23ms -577ms -96.2%
Queries 1,400 200 -1,200 -85.7%
Writes 80 70 -10 -12.5%
N+1 25 10.5 -14.5 -58%

🎯 Performance Analysis

What's Working Well ✅

  1. Query Optimization - Massive reduction in query count suggests excellent use of prefetch_related() and select_related()
  2. Database Efficiency - 96% reduction in DB time shows optimized queries
  3. Overall Speed - 91% faster execution time is a huge win
  4. N+1 Reduction - Better handling of related objects

What Could Be Improved ⚠️

  1. Write Operations for Small Tests - Small tests show more writes, suggesting individual updates instead of bulk
  2. Audit Table Access - Additional audit tables are being accessed, which is expected but could potentially be optimized further

💡 Key Insights

Performance Evolution

  • Report 1 (baseline): Very slow, many queries, N+1 problems
  • Report 4 (current): Much faster, fewer queries, better optimization

Scale Impact

The improvements are more dramatic for larger datasets:

  • Small tests (1-2 items): ~75% improvement
  • Medium tests (3-5 items): ~85% improvement
  • Large tests (10-20 items): ~96% improvement

This suggests the optimizations scale very well with data size.

📈 Conclusion

Report 4 shows excellent performance improvements compared to the baseline (Report 1):

91.7% faster overall execution time
85.7% fewer database queries
96.2% less database time
58% fewer N+1 queries

While there are some areas for improvement (write operations for small tests, audit table access), the overall performance is dramatically better than the original baseline. The optimizations are working very well, especially for larger datasets.

Closes OSIDB-4678

@roduran-dev roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch 4 times, most recently from 05358f2 to 0694834 Compare December 12, 2025 15:06
@roduran-dev roduran-dev changed the title OSIDB-4678: Add new test of a flaw with many affects and trackers OSIDB-4678: Improve promote performance Dec 12, 2025
@roduran-dev roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch 4 times, most recently from b238de6 to be28eab Compare December 12, 2025 16:40
@roduran-dev roduran-dev marked this pull request as ready for review December 12, 2025 17:59
@roduran-dev roduran-dev requested a review from a team December 12, 2025 17:59
Comment on lines -530 to +545
refs = pghistory.models.Events.objects.tracks(self).all()
refs = pghistory.models.Events.objects.references(self).all()
for ref in refs:
model_audit = apps.get_model(ref.pgh_model).objects.filter(
db, model_name = ref.pgh_model.split(".")
# Skip snippet audit events - snippets should always remain internal
if model_name == "SnippetAudit":
continue

model_audit = apps.get_model(db, model_name).objects.filter(
pgh_id=ref.pgh_id
)

with pgtrigger.ignore(f"{ref.pgh_model}:append_only"):
with pgtrigger.ignore(f"{db}.{model_name}:append_only"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my search it looks like references goes through the history of all of the related models while tracks processes the model is is called on. Since set_history_public is called on each model there might be a lot of duplicate history ACL adjustments with using references. The calls also ends up attempting to grab the Snippet which shouldn't happen.

I think it would make sense to sync set_history_public with what exists in the master branch (with tracks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?

self.set_public()
self.set_history_public()
self.save(**kwargs)
if self.is_internal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight optimization: I think you can keep the original short-circuit logic

if not self.is_internal:
     return

If the object is not internal the function will still search through the related objects w/o the return.

Comment on lines -607 to -614
if issubclass(type(self), AlertMixin):
# suppress the validation errors as we expect that during
# the update the parent and child ACLs will not equal
kwargs["raise_validation_error"] = False
if issubclass(type(self), TrackingMixin):
# do not auto-update the updated_dt timestamp as the
# followup update would fail on a mid-air collision
kwargs["auto_timestamps"] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit removes these adjustments but they might be covered in the bulk update. I think it might be a good idea to double-check that the removal of these lines won't cause issues.

Copy link
Contributor

@osoukup osoukup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have only some minor comments and question

def _collect_objects_for_public_update(self, objects_to_update, visited):
"""
Recursively collect all related objects that need to have their ACLs
updated to public. The Flaw itself is not collected as it's saved separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The ACLMixin class is a general mixin so self is not necessarily a flaw.

"""
from osidb.models import Flaw

# Create a unique key for this object to track if we've visited it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The following line does not create anything. I would more naturally put the comment three lines below.

Comment on lines -530 to +545
refs = pghistory.models.Events.objects.tracks(self).all()
refs = pghistory.models.Events.objects.references(self).all()
for ref in refs:
model_audit = apps.get_model(ref.pgh_model).objects.filter(
db, model_name = ref.pgh_model.split(".")
# Skip snippet audit events - snippets should always remain internal
if model_name == "SnippetAudit":
continue

model_audit = apps.get_model(db, model_name).objects.filter(
pgh_id=ref.pgh_id
)

with pgtrigger.ignore(f"{ref.pgh_model}:append_only"):
with pgtrigger.ignore(f"{db}.{model_name}:append_only"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?

Copy link
Contributor

@Elkasitu Elkasitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of the existing comments, I think you should provide performance regression tests that check the amount of queries expected for a base-case for both of the performance optimizations introduced, look at osidb/tests/test_query_regresion.py.

It is not clear from your benchmark how much each patch improves performance and where, having these regression tests helps determine which patch did what.

self.assert_audit_acls(model, internal_read_groups, internal_write_groups)

@pytest.mark.vcr
def test_promote_flaw_with_multiple_affects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need a vcr test nor a test this big to test that promoting a flaw with multiple affects correctly sets its own and its related objects' ACLs, you can either:

  1. create a unit test which creates a flaw with related affects, trackers and/or comments/cvss/etc. and then calls promote() on said flaw
  2. piggyback off of existing promote() tests to check that related object ACLs are updated as expected

in their current state these two tests depend on so many things that it can fail for multiple reasons not directly related to ACLs: the tests are brittle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will see to change it, The original promote test did have a VCR and I created a new one to avoid messing up the original test.

Comment on lines +601 to +660
def set_public_nested(self):
"""
Change internal ACLs to public ACLs for all related Flaw objects and save them.
The only exception is "snippets", which should always have internal ACLs.
The Flaw itself will be saved later to avoid duplicate operations.
This method collects all objects that need to be updated and uses bulk_update
to minimize database queries.
"""

# Collect all objects that need to be updated (using dict to avoid duplicates)
# Key: (model_class, pk), Value: object instance
objects_to_update = {}
# Track visited objects to prevent infinite recursion in circular relationships
visited = set()
self._collect_objects_for_public_update(objects_to_update, visited)

# Get list of unique objects
valid_objects = list(objects_to_update.values())

if not valid_objects:
return

# Set updated_dt on all objects before bulk update
# Cut off microseconds to match TrackingMixin.save() behavior
now = timezone.now().replace(microsecond=0)
for obj in valid_objects:
if hasattr(obj, "updated_dt"):
obj.updated_dt = now

# Group objects by model type for bulk_update
from collections import defaultdict

objects_by_model = defaultdict(list)
for obj in valid_objects:
objects_by_model[type(obj)].append(obj)

# Bulk update each model type
# Fields to update: acl_read, acl_write, and updated_dt
for model_class, objects in objects_by_model.items():
if objects:
# Determine fields for this model type
model_update_fields = ["acl_read", "acl_write"]
if hasattr(model_class, "updated_dt"):
model_update_fields.append("updated_dt")

model_class.objects.bulk_update(
objects,
model_update_fields,
batch_size=100, # Process in batches to avoid memory issues
)

# Update audit history for all updated objects
# Each object needs its history set to public after ACL update
for obj in objects:
if hasattr(obj, "set_history_public"):
obj.set_history_public()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole approach is not using the correct tools provided by Django:

First and foremost, there's a non-negligible memory cost to storing all these objects locally -- wouldn't it be best to simply keep track of ids instead of model instance objects?

This brings me to my second point, which is that bulk_update is intended for cases where you want to update the same fields for multiple records with different values, here we're updating the same fields for multiple records with the same values, so QuerySet.update() can be used instead:

for model_class, object_ids in objects_by_model.items():
    model_class.filter(pk__in=object_ids).update(acl_read=..., acl_write=..., updated_dt=...)

This will generate one UPDATE query per model, whereas update_bulk will generate more than one UPDATE query, which correlates with the increase of write queries in your test, in fact I'm not convinced that the current approach is any better than simply calling save() (other than for the side-effects such as signals and save() overrides).

Lastly, I think the collection can be done with a defaultdict(set) directly, you initialize it and pass it to the collection function which will simply do something like:

if self.pk and self.pk in objects_to_update[type(self)]:
    # this works as visited
    return
...
objects_to_update[type(self)].add(self.pk)
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a good point I will try to just use the uuid.

self.acl_write = acls
return acls

def set_public(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this -- set_public is a low-level method that does one job: change an ACL-enabled object to public ACLs regardless of its current state, said state can even already be public for all it cares.

It's the caller's responsibility to ensure that only objects that should be public are passed to this method, this method is unaware of business rules and should stay unaware IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its true, but doing the change on set_public_nested I felt a lot of responsibility as not to change anything that is not in internal and even more so for posible embargo, I do agree that it breaks the main point but the benefit for a safety net to avoid setting public an embargo is bigger than the pattern in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I remove it but I will add a test to be sure this a case like this does not happen.

Comment on lines +260 to +279
# Force initial classification because signals are muted in queryset tests
flaw.classification = {
"workflow": "DEFAULT",
"state": WorkflowModel.WorkflowState.NEW,
}
# NEW -> TRIAGE requires owner
flaw.owner = "Alice"
flaw.save(raise_validation_error=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the @pytest.mark.enable_signals decorator not enough?

"HTTP_JIRA_API_KEY": jira_token,
"HTTP_BUGZILLA_API_KEY": bugzilla_token,
}
with assertNumQueriesLessThan(92):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nitpicking, normally when I write new regression tests, I write them before doing the changes to see the actual change, because now we don't know if 92 is an improvement or not...
Notice the # initial value -> X in some tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you try to use assertNumQueries first? We should utilize it whenever possible, assertNumQueriesLessThan is less accurate but some tests are a bit flaky due to environment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I just change it assertNumQueries and I also added quantity of affects/trackers to check that the query quanty doesnt grow depnding on the quantitiy.

PS: there seems to be a bug for the first iteration, I couldn't find what it is.

@roduran-dev roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch from c785f8b to 000bece Compare December 17, 2025 17:44
@roduran-dev roduran-dev force-pushed the OSIDB-4678_bug_megaflaw_promote branch from 000bece to 443f514 Compare December 17, 2025 17:56
@roduran-dev roduran-dev self-assigned this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants