Skip to content

Conversation

@LecrisUT
Copy link
Contributor

@LecrisUT LecrisUT commented Apr 16, 2025

TODO:

  • Skip the check if /packit command is used
  • Actually check the files changed
  • Write new tests or update the old ones to cover new functionality.
  • Update or write new documentation in packit/packit.dev.

Depends-on: packit/ogr#921

Fixes packit/packit#1997
Fixes #2006

RELEASE NOTES BEGIN

copr_build jobs are triggered only if there are changed files under the paths field

RELEASE NOTES END

@softwarefactory-project-zuul
Copy link
Contributor

@softwarefactory-project-zuul
Copy link
Contributor

@softwarefactory-project-zuul
Copy link
Contributor

This change depends on a change that failed to merge.

Change packit/ogr#921 is needed.

Comment on lines +180 to +193
for changed_file in self.get_files_changed():
# Check if any of the files changed are under the paths that are being tracked
if any(changed_file.is_relative_to(p) for p in paths):
return True
return False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what do we do if we have an empty commit, like the Fedora rebuild commits? Should it be a special case to skip this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the Fedora rebuild commits?

those wouldn't be happening in upstream, so I don't think we have to think about that, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if they want to have that behaviour. I occasionally do that 1, but if it's documented that it is or it is not supported, than that's fine as well.

Footnotes

  1. https://github.com/LecrisUT/FedoraRPM-atuin/commit/99555ccec9f6ed94fe474767e3b88f18aa835387

@softwarefactory-project-zuul
Copy link
Contributor

@lbarcziova
Copy link
Member

I briefly looked into the code and I am thinking if it wouldn't be more fitting to have the logic for checking this in this method, which would mean it would work generally, not just for Copr build jobs.

@LecrisUT
Copy link
Contributor Author

I briefly looked into the code and I am thinking if it wouldn't be more fitting to have the logic for checking this in this method, which would mean it would work generally, not just for Copr build jobs.

Probably, but it's not as extendable as with the Checker, needing another place to be aware of that check are being done. I see that there already are checks like pr_labels_match_configuration. What about extending the Checker design (or rather reimplement a similar/simpler class) or at least aggregate the checks in a list of lambdas?

@lbarcziova lbarcziova moved this from new to in-progress in Packit Kanban Board May 29, 2025
@lbarcziova
Copy link
Member

@LecrisUT can you elaborate more on what you are suggesting? I agree the code I linked could be refactored, but it is something different than the Checker's logic, as that is tight to the handlers, while the checks I linked are meant for actually getting the relevant job configurations to the particular event.

@LecrisUT
Copy link
Contributor Author

LecrisUT commented May 30, 2025

@LecrisUT can you elaborate more on what you are suggesting? I agree the code I linked could be refactored, but it is something different than the Checker's logic, as that is tight to the handlers, while the checks I linked are meant for actually getting the relevant job configurations to the particular event.

Indeed, not using the same Checker, but mainly not having all the checks implemented in the function, i.e. something like

    def get_jobs_matching_event(self) -> list[JobConfig]:
        jobs_matching_trigger = []
        checkers = JobChecker.get_all_checkers()
        for job in self.event.packages_config.get_job_views():
            if all(c.check(job) for c in checkers):
                jobs_matching_trigger.append(job)

The Checker design that I like is being able to just create a class and boom a new check is introduced, but I would design it a bit simpler, like

class JobChecker:
    all_checkers: Final[ClassVar[list[type[JobChecker]]]] = []

    def __init_subclass__(cls):
        # Register all checkers when the class is created
        cls.all_checkers.append(cls)

    @classmethod
    def get_all_checkers(cls) -> list[JobChecker]:
        return  [checker_cls() for checker_cls in cls.all_checkers]

    def check(job: JobView) -> bool: ...

(Of course with a better interface, constructor, etc.)

@lbarcziova
Copy link
Member

@LecrisUT understood, thanks for the explanation! Yes, that sounds like it could improve the readability and extensibility.

@lachmanfrantisek
Copy link
Member

Hi @LecrisUT , thanks for all the work so far. Do you think it's real to finish this? (Asking for the Copr team that would hugely benefit from this..) Anything we can help with?

@LecrisUT
Copy link
Contributor Author

I can try to make some progress this week.

One question to be answered is #2780 (comment). There are a few ways to approach this:

  • Use ogr and navigate to the PR and get the files changed from there
  • Use the webhook metadata directly
  • Teach ogr to give compare patches 1 and use the before-after that are already in the webhook payload

I am inclined to go with the 3rd option, just because it will come in handy for the future.

Footnotes

  1. https://github.com/PyGithub/PyGithub/blob/2d4785dd08374147a8eab866eeb704508558336a/github/Repository.py#L1342

@lachmanfrantisek
Copy link
Member

Nice, thanks @LecrisUT !

I don't have a strong opinion but using webhook might avoid extra API calls (compared to OGR approach). But on the other hand, not sure if this would be same for other git forges. (With OGR, we might be able to find a way more easily.)

@lbarcziova
Copy link
Member

My suggestion would be to try to use webhooks directly as it seems the most straightforward (we have also already been having issues with API rate limits).

@LecrisUT
Copy link
Contributor Author

Where do I get the push event? Otherwise, it seems we still need the ogr changes for the pull-request trigger, the webhook does not have equivalent data.

@softwarefactory-project-zuul
Copy link
Contributor

Copy link

@betulependule betulependule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. It looks good to me so far. Do you think you'll be able to make progress with this soon?

@betulependule
Copy link

Where do I get the push event? Otherwise, it seems we still need the ogr changes for the pull-request trigger, the webhook does not have equivalent data.

Could you please elaborate what you mean?

Comment on lines +171 to +181
# FIXME: This is probably unnecessary
package_config = self.package_config.get_package_config_for(self.job_config)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think this might be unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the usage of the method get_package_config_for. In some other parts I see that the method is not used and I assume it would already be processed when self.package_config is used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it seems that the get_package_config_for is used to get the config from a config for multiple packages (usually named packages_config) based on a specific job config. In other existing code, I only see packages_config.get_package_config_for being used, but never package_config.get_package_config_for. I think using self.package_config directly might be the right way to do it, but I'm not exactly sure. Tests should hopefully reveal what it right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, worst case this is just a no-op, but if anyone with more experience could chyme in and confirm it quickly then we can resolve this one way or the other. Let's keep this open for a bit.

@LecrisUT
Copy link
Contributor Author

Nice work. It looks good to me so far. Do you think you'll be able to make progress with this soon?

Yes, it's just one blocker to figure out and adding the relevant tests

Where do I get the push event? Otherwise, it seems we still need the ogr changes for the pull-request trigger, the webhook does not have equivalent data.

Could you please elaborate what you mean?

Sure, when I was looking at this, I couldn't find the appropriate mixin that can be used with Checker so that we have the Commit class that would now include the list of commits that is added in f57d857. We have GetPagurePullRequestMixin but I couldn't find the equivalent that would do for upstream PRs.

@betulependule
Copy link

Sure, when I was looking at this, I couldn't find the appropriate mixin that can be used with Checker so that we have the Commit class that would now include the list of commits that is added in f57d857. We have GetPagurePullRequestMixin but I couldn't find the equivalent that would do for upstream PRs.

Yes, you're right that there are no equivalents for upstream PRs (as far as I can tell). I suppose we'll need a new Mixin class for this, then? @lbarcziova, what do you think? I'm still rather new to this project so I don't understand this part too much just yet 😅.

Comment on lines 159 to 161
elif self.job_config.trigger == JobConfigTriggerType.commit:
# TODO: Get changed files from webhook push
raise NotImplementedError()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be simple to implement. Both GitLab and GitHub webhook push events specify the commits field, where each item contains the added, modified and removed fields describing what files have been changed.

https://docs.github.com/en/webhooks/webhook-events-and-payloads#push
https://docs.gitlab.com/user/project/integrations/webhook_events/#push-events

However, I am a bit worried about this note from GitHub's docs: A maximum of 3000 changed files will be reported per commit. I am concerned that if a user's commit includes more than 3000 changed files, then the webhook will not report all and the checker might ignore some file changes, resulting in an obscure bug... Maybe there should be a check whether the amount of changes matches 3000, because if it does, then that means that the webhook may have under-reported what files were changes and we should find the rest of changed files using a different method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or for simplicity, if 3000 changes are reported in one of the fields added, modified or removed, return the root directory ./ to make sure the checker doesn't ignore any files not reported by the webhook push event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment is the same issue as #2780 (comment). I will try to join you on figuring out how to implement this. If you have any ideas from diving in the code, do let me know

However, I am a bit worried about this note from GitHub's docs: A maximum of 3000 changed files will be reported per commit.

My gut instinct is if it ends up with that many change files, we might be having other issues on the side. If they do that big of a change, then might as well either trigger the build for everything or let the users manually trigger it and we just report " 👀 > 3000 files being changed!? Help me 🙏 "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're referring to this?

Where do I get the push event?

I wasn't sure what you meant by the question back then. Sorry about that. I suppose you don't know how to access the push object containing the webhook push event info? I can try to find out how to access it. 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, any hints on that is appreciated, since it also kinda relates to #2860.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll take a look. 👍 Thanks for your work so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I figured it out a bit. It should come from ConfigFromEventMixin and it would have either commit, or PR or equivalent event results. Commit I think it is pretty straightforward, but PRs would be tricky since it can also take in comments. Maybe we could ignore all non-push, non-pr-update events (i.e. make the checker always true then). Updated the PR with a quick sketch that I have so far in case you get some ideas around this as well :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigFromEventMixin looks quite promising. Good work. 👍

Maybe we could ignore all non-push, non-pr-update events (i.e. make the checker always true then).

Yes, I think that makes sense.

I looked into this as well and in case ConfigFromEventMixin doesn't work, we might still have the event data available in the Checker class.

parser = nested_get(
Parser.MAPPING,
source,
event_type,
default=Parser.parse_event,
)
event_object: Optional[Event] = parser(event)

Above is where the event (GitLab and GitHub push events, PR events and other) are parsed. SteveJobs returns a list of TaskResult, which still has the event variable storing what we need (it contains the parsed event, which would contain the property commits of type CommitInfo).

I believe that the event variable is eventually used to initialize the Checker class, which the class AreFilesChanged inherits from. The property self.data is initialized with some properties of the event variable and the commits variable is lost.

self.data = EventData.from_event_dict(event)

If needed, we could store the entire event variable inside the Checker class (self.event = event) to make it available in AreFilesChanged.

Using ConfigFromEventMixin seems like a cleaner solution, so let's try that first. 👍

Comment on lines 157 to 158
if self.job_config.trigger == JobConfigTriggerType.pull_request:
changes = self.copr_build_helper.pull_request_object.changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the definitions of GitHub and GitLab definitions of the MR / PR webhook events and you are definitely right that the information related to what files were changed is not present.

Do I understand correctly that your current approach is the following?

  1. In parser.py, once a PR / MR webhook event is received, use ogr to get a list of commits (cast as CommitInfo) contained in this PR / MR.
  2. Inside the AreFilesChanged class, take the list of commits and aggregate the three lists of files added, modified and removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, it does not go through the ogr and with the changes in the parser.py (parse_github_push_event) it would be materializing the webhook data to a more complete python object with the included files changed (but I did not properly test it, the failing CI should give us some reference for where this is connected).

2. Inside the AreFilesChanged class, take the list of commits and aggregate the three lists of files added, modified and removed.

That is what it is supposed to happen (replacing self.copr_build_helper.pull_request_object.changes, that is still from the ogr required change, I will drop it to make it clearer). As soon as we can get the output of parse_github_push_event in this checker class using a mixin, then that would unlock everything that we were missing here.

@softwarefactory-project-zuul
Copy link
Contributor

if not isinstance(pr_event, (github.pr.Action, gitlab.mr.Action)):
# TODO: What about comments?
raise NotImplementedError()
# TODO: How to handle PRs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we create another Mixin class, which will use the existing get_pr_files_diff function in ogr to retrieve the files diff? It should return a dictionary, so we would need to process it into a list of files changed.

// dictionary representing files diff
files_changed = project.get_pr_files_diff(pr_id=id_from_pr_event) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem to be available for github and gitlab? https://github.com/search?q=repo%3Apackit%2Fogr%20get_pr_files_diff&type=code And I think the internal API would not make it efficient to query that format. Maybe we go through packit/ogr#921?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. In that case using packit/ogr#921 sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: in-progress

Development

Successfully merging this pull request may close these issues.

[RFE] Filter copr builds on path change When using monorepo, react only on changes done to configured paths

5 participants