Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3210 batch uploads creation #1006

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

adrian-codecov
Copy link
Contributor

This is the more complete version of uploads, flags + measurements batch insertion. This consolidates all the efforts + changes related to improving N+1 issues. This change adds a fn to add uploads, flags + measurements when this is a v4 upload. All should be batched and improve lock + performance issues

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.79%. Comparing base (ac302e7) to head (0555c32).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/upload.py 98.18% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1006      +/-   ##
==========================================
- Coverage   97.79%   97.79%   -0.01%     
==========================================
  Files         447      447              
  Lines       36175    36207      +32     
==========================================
+ Hits        35376    35407      +31     
- Misses        799      800       +1     
Flag Coverage Δ
integration 42.19% <98.18%> (+0.04%) ⬆️
unit 90.39% <18.18%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

⚠️ Impact Analysis from Codecov is deprecated and will be sunset on Jan 31 2025. See more

@codecov-staging
Copy link

codecov-staging bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
tasks/upload.py 98.18% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@codecov-qa
Copy link

codecov-qa bot commented Jan 13, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1772 1 1771 4
View the top 1 failed tests by shortest run time
services/tests/test_report.py::TestReportService::test_create_report_upload
Stack Traces | 0.036s run time
self = <worker.services.tests.test_report.TestReportService object at 0x7fd85ed76270>
dbsession = <sqlalchemy.orm.session.Session object at 0x7fd85536ea80>

    @pytest.mark.django_db
    def test_create_report_upload(self, dbsession):
        arguments = {
            "branch": "master",
            "build": "646048900",
            "build_url": "http://github..../actions/runs/646048900",
            "cmd_args": "n,F,Q,C",
            "commit": "1280bf4b8d596f41b101ac425758226c021876da",
            "job": "thisjob",
            "flags": ["unittest"],
            "name": "this name contains more than 100 chars 1111111111111111111111111111111111111111111111111111111111111this is more than 100",
            "owner": "greenlantern",
            "package": "github-action-20210309-2b87ace",
            "pr": "33",
            "repo": "reponame",
            "reportid": "6e2b6449-4e60-43f8-80ae-2c03a5c03d92",
            "service": "github-actions",
            "slug": "greenlantern/reponame",
            "url": ".../C00AE6C87E34AF41A6D38D154C609782/1280bf4b8d596f41b101ac425758226c021876da/6e2b6449-4e60-43f8-80ae-2c03a5c03d92.txt",
            "using_global_token": "false",
            "version": "v4",
        }
        commit = CommitFactory.create()
        dbsession.add(commit)
        dbsession.flush()
        current_report_row = CommitReport(commit_id=commit.id_)
        dbsession.add(current_report_row)
        dbsession.flush()
        report_service = ReportService({})
        res = report_service.create_report_upload(arguments, current_report_row)
        dbsession.flush()
        assert res.build_code == "646048900"
        assert (
            res.build_url
            == "http://github..../actions/runs/646048900"
        )
        assert res.env is None
        assert res.job_code == "thisjob"
        assert (
            res.name
            == "this name contains more than 100 chars 1111111111111111111111111111111111111111111111111111111111111"
        )
        assert res.provider == "github-actions"
        assert res.report_id == current_report_row.id_
        assert res.state == "started"
        assert (
            res.storage_path
            == ".../C00AE6C87E34AF41A6D38D154C609782/1280bf4b8d596f41b101ac425758226c021876da/6e2b6449-4e60-43f8-80ae-2c03a5c03d92.txt"
        )
        assert res.order_number is None
        assert res.totals is None
        assert res.upload_extras == {}
        assert res.upload_type == "uploaded"
>       assert len(res.flags) == 1
E       assert 0 == 1
E        +  where 0 = len([])
E        +    where [] = <database.models.reports.Upload object at 0x7fd855b0fd50>.flags

services/tests/test_report.py:3730: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📢 Thoughts on this report? Let us know!

Copy link

codecov-public-qa bot commented Jan 13, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1772 1 1771 4
View the top 1 failed tests by shortest run time
services/tests/test_report.py::TestReportService::test_create_report_upload
Stack Traces | 0.036s run time
self = <worker.services.tests.test_report.TestReportService object at 0x7fd85ed76270>
dbsession = <sqlalchemy.orm.session.Session object at 0x7fd85536ea80>

    @pytest.mark.django_db
    def test_create_report_upload(self, dbsession):
        arguments = {
            "branch": "master",
            "build": "646048900",
            "build_url": "http://github..../actions/runs/646048900",
            "cmd_args": "n,F,Q,C",
            "commit": "1280bf4b8d596f41b101ac425758226c021876da",
            "job": "thisjob",
            "flags": ["unittest"],
            "name": "this name contains more than 100 chars 1111111111111111111111111111111111111111111111111111111111111this is more than 100",
            "owner": "greenlantern",
            "package": "github-action-20210309-2b87ace",
            "pr": "33",
            "repo": "reponame",
            "reportid": "6e2b6449-4e60-43f8-80ae-2c03a5c03d92",
            "service": "github-actions",
            "slug": "greenlantern/reponame",
            "url": ".../C00AE6C87E34AF41A6D38D154C609782/1280bf4b8d596f41b101ac425758226c021876da/6e2b6449-4e60-43f8-80ae-2c03a5c03d92.txt",
            "using_global_token": "false",
            "version": "v4",
        }
        commit = CommitFactory.create()
        dbsession.add(commit)
        dbsession.flush()
        current_report_row = CommitReport(commit_id=commit.id_)
        dbsession.add(current_report_row)
        dbsession.flush()
        report_service = ReportService({})
        res = report_service.create_report_upload(arguments, current_report_row)
        dbsession.flush()
        assert res.build_code == "646048900"
        assert (
            res.build_url
            == "http://github..../actions/runs/646048900"
        )
        assert res.env is None
        assert res.job_code == "thisjob"
        assert (
            res.name
            == "this name contains more than 100 chars 1111111111111111111111111111111111111111111111111111111111111"
        )
        assert res.provider == "github-actions"
        assert res.report_id == current_report_row.id_
        assert res.state == "started"
        assert (
            res.storage_path
            == ".../C00AE6C87E34AF41A6D38D154C609782/1280bf4b8d596f41b101ac425758226c021876da/6e2b6449-4e60-43f8-80ae-2c03a5c03d92.txt"
        )
        assert res.order_number is None
        assert res.totals is None
        assert res.upload_extras == {}
        assert res.upload_type == "uploaded"
>       assert len(res.flags) == 1
E       assert 0 == 1
E        +  where 0 = len([])
E        +    where [] = <database.models.reports.Upload object at 0x7fd855b0fd50>.flags

services/tests/test_report.py:3730: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📢 Thoughts on this report? Let us know!

Copy link

✅ All tests successful. No failed tests were found.

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

@@ -3727,10 +3727,6 @@ def test_create_report_upload(self, dbsession):
assert res.totals is None
assert res.upload_extras == {}
assert res.upload_type == "uploaded"
assert len(res.flags) == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer gets evaluated by this test/method since we got rid of the override

@adrian-codecov adrian-codecov mentioned this pull request Jan 14, 2025
report_service: ReportService,
) -> list[UploadArguments]:
"""
This method possibly batch inserts uploads, flags and user measurements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know it "possibly" does something from the name. What are the conditions that the upload will or will not be inserted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's described in the sentence after that, https://github.com/codecov/worker/pull/1006/files#diff-7ea195247ce946074a2441630db2be4bda78feed427fc0a1f1bd170cfc84a7a7R557 😅, only happens for v4 uploads since cli uploads are created in api. Should I rephrase it a bit so it's clearer?

tasks/upload.py Outdated
self,
db_session: Session,
repoid: int,
upload_flag_map: dict[Upload, list | str | None] = {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a mutable structure as default arg value is not a good idea. The same dict instance will be shared across all function calls.

It's better to make the default None and check for that, and then assign the real default of {} if needed.

upload_flag_map: dict[Upload, list | str | None] | None = None,
...
if upload_flag_map is None:
    upload_flag_map = {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good suggestion, and my bad, you'd given me this protip some time ago as well 🫡

@@ -215,6 +215,13 @@ def _should_debounce_processing(upload_context: UploadContext) -> Optional[float
return None


class CreateUploadResponse(TypedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's the naming here? I was not amazed by it 😂

Copy link
Contributor

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal preference:
I would like us to move the core business logic out of "celery tasks" to better decouple the thing we are doing from the specific task scheduling framework. This should help with whatever "platform revamp" we are aiming towards.

tasks/upload.py Outdated Show resolved Hide resolved
# List + helper mapping to track possible upload + flags to insert later
uploads_list: list[Upload] = []
upload_flag_map: dict[Upload, list | str | None] = {}

for arguments in upload_context.arguments_list():
arguments = upload_context.normalize_arguments(commit, arguments)
if "upload_id" not in arguments:
upload = report_service.create_report_upload(arguments, commit_report)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I remember correctly, create_report_upload internally creates the Upload model, so I’m not sure the bulk_save_objects based on the uploads_list is effective at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right, I was doing that from a previous implementation.

I'd like to use the bulk_save but I need the upload's primary key for the "arguments["upload_id"] = upload.id_" line, so doing that's why we're doing the add/flush (and therefore the bulk_save is kinda redundant). I'll get rid of it for now, but can you think of a way to make it work?

I've been giving it some thought and the closest thing Ive thought of is to skip the add/flush, do an arguments_upload mapping (without the primary key), do the bulk_save, then requery the db for the uploads we just created and add the upload_id to the arguments based on the mapping, then continue execution. But that requerying seems inefficient and I feel makes this whole thing a bit more complicated.

I could just get rid of the bulk_save and still benefit from the other changes though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another idea here would be, since getting the arguments without an upload_id is one of the legacy upload paths:
we could just patch that legacy upload path to create the Upload object and trigger a PreProcess to do carry-forwarding, then we can delete the code responsible for this here.
its not ideal either :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is another option. I think the counter of that is that we couldn't batch insert since it would be done at an upload level, that being said, we already do this with the CLI, so arguably another point of performance there.

To me though, it seems the responsibility to make an upload shouldn't be in API (although I get why we currently do it this way), instead done by an inexistent "upload service" that would come w/ the new services. I think worker is the closest thing to that atm, but yeah I preferred to do it here since most of the code was here.

tasks/upload.py Outdated Show resolved Hide resolved
@adrian-codecov
Copy link
Contributor Author

I would like us to move the core business logic out of "celery tasks" to better decouple the thing we are doing from the specific task scheduling framework. This should help with whatever "platform revamp" we are aiming towards.

I agree this is the general direction, just didn't contemplate this to be part of the scope for this ticket. My compromise was that by wrapping logic in more readable fns, I was thinking it could help the copy pasting when the time came to refactor the worker logic when the time comes.

Where would you suggest this could live, or how were you suggesting this be done otherwise? Curious of your thoughts 👀

@Swatinem
Copy link
Contributor

I believe this could live in services.processing, maybe in a new file called orchestration.py or some such?

There is a bunch more code cleanup to be done here, related to some unification of how the uploads are being handled in API, like the normalize_arguments code can probably be removed completely by now, based on this comment:

# TODO(swatinem): remove these fields completely being passed from API:
# `redis_key` being removed in https://github.com/codecov/codecov-api/pull/960
redis_key: NotRequired[str]

@adrian-codecov
Copy link
Contributor Author

Yeah that makes sense. I want to keep the code as close as it was in terms of functionality and/or structure and I'm also okay not moving the code for now to not over-abstract since we haven't really defined what the new services architecture would look like, unless you feel strongly about it. Just don't want to bloat the scope of the work here for now.

I can leave a comment in the code or something to ensure we capture this moving into a "processing" type service eventually if that makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants