-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3210 batch uploads creation #1006
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## main #1006 +/- ##
==========================================
- Coverage 97.79% 97.79% -0.01%
==========================================
Files 447 447
Lines 36175 36207 +32
==========================================
+ Hits 35376 35407 +31
- Misses 799 800 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
📢 Thoughts on this report? Let us know! |
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
✅ All tests successful. No failed tests were found. 📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
@@ -3727,10 +3727,6 @@ def test_create_report_upload(self, dbsession): | |||
assert res.totals is None | |||
assert res.upload_extras == {} | |||
assert res.upload_type == "uploaded" | |||
assert len(res.flags) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer gets evaluated by this test/method since we got rid of the override
report_service: ReportService, | ||
) -> list[UploadArguments]: | ||
""" | ||
This method possibly batch inserts uploads, flags and user measurements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we know it "possibly" does something from the name. What are the conditions that the upload will or will not be inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's described in the sentence after that, https://github.com/codecov/worker/pull/1006/files#diff-7ea195247ce946074a2441630db2be4bda78feed427fc0a1f1bd170cfc84a7a7R557 😅, only happens for v4 uploads since cli uploads are created in api. Should I rephrase it a bit so it's clearer?
tasks/upload.py
Outdated
self, | ||
db_session: Session, | ||
repoid: int, | ||
upload_flag_map: dict[Upload, list | str | None] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a mutable structure as default arg value is not a good idea. The same dict instance will be shared across all function calls.
It's better to make the default None
and check for that, and then assign the real default of {}
if needed.
upload_flag_map: dict[Upload, list | str | None] | None = None,
...
if upload_flag_map is None:
upload_flag_map = {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good suggestion, and my bad, you'd given me this protip some time ago as well 🫡
@@ -215,6 +215,13 @@ def _should_debounce_processing(upload_context: UploadContext) -> Optional[float | |||
return None | |||
|
|||
|
|||
class CreateUploadResponse(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's the naming here? I was not amazed by it 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personal preference:
I would like us to move the core business logic out of "celery tasks" to better decouple the thing we are doing from the specific task scheduling framework. This should help with whatever "platform revamp" we are aiming towards.
# List + helper mapping to track possible upload + flags to insert later | ||
uploads_list: list[Upload] = [] | ||
upload_flag_map: dict[Upload, list | str | None] = {} | ||
|
||
for arguments in upload_context.arguments_list(): | ||
arguments = upload_context.normalize_arguments(commit, arguments) | ||
if "upload_id" not in arguments: | ||
upload = report_service.create_report_upload(arguments, commit_report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I remember correctly, create_report_upload
internally creates the Upload
model, so I’m not sure the bulk_save_objects
based on the uploads_list
is effective at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right, I was doing that from a previous implementation.
I'd like to use the bulk_save but I need the upload's primary key for the "arguments["upload_id"] = upload.id_" line, so doing that's why we're doing the add/flush (and therefore the bulk_save is kinda redundant). I'll get rid of it for now, but can you think of a way to make it work?
I've been giving it some thought and the closest thing Ive thought of is to skip the add/flush, do an arguments_upload mapping (without the primary key), do the bulk_save, then requery the db for the uploads we just created and add the upload_id to the arguments based on the mapping, then continue execution. But that requerying seems inefficient and I feel makes this whole thing a bit more complicated.
I could just get rid of the bulk_save and still benefit from the other changes though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another idea here would be, since getting the arguments
without an upload_id
is one of the legacy upload paths:
we could just patch that legacy upload path to create the Upload
object and trigger a PreProcess
to do carry-forwarding, then we can delete the code responsible for this here.
its not ideal either :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is another option. I think the counter of that is that we couldn't batch insert since it would be done at an upload level, that being said, we already do this with the CLI, so arguably another point of performance there.
To me though, it seems the responsibility to make an upload shouldn't be in API (although I get why we currently do it this way), instead done by an inexistent "upload service" that would come w/ the new services. I think worker is the closest thing to that atm, but yeah I preferred to do it here since most of the code was here.
I agree this is the general direction, just didn't contemplate this to be part of the scope for this ticket. My compromise was that by wrapping logic in more readable fns, I was thinking it could help the copy pasting when the time came to refactor the worker logic when the time comes. Where would you suggest this could live, or how were you suggesting this be done otherwise? Curious of your thoughts 👀 |
I believe this could live in There is a bunch more code cleanup to be done here, related to some unification of how the uploads are being handled in API, like the worker/services/processing/types.py Lines 24 to 26 in 5634caa
|
Yeah that makes sense. I want to keep the code as close as it was in terms of functionality and/or structure and I'm also okay not moving the code for now to not over-abstract since we haven't really defined what the new services architecture would look like, unless you feel strongly about it. Just don't want to bloat the scope of the work here for now. I can leave a comment in the code or something to ensure we capture this moving into a "processing" type service eventually if that makes sense |
This is the more complete version of uploads, flags + measurements batch insertion. This consolidates all the efforts + changes related to improving N+1 issues. This change adds a fn to add uploads, flags + measurements when this is a v4 upload. All should be batched and improve lock + performance issues
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.