-
Notifications
You must be signed in to change notification settings - Fork 56
[WIP] feat: Filter copr_build jobs based on paths and files changed
#2780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Build failed. ❌ pre-commit FAILURE in 1m 48s |
|
Build failed. ✔️ pre-commit SUCCESS in 1m 52s |
|
This change depends on a change that failed to merge. Change packit/ogr#921 is needed. |
| for changed_file in self.get_files_changed(): | ||
| # Check if any of the files changed are under the paths that are being tracked | ||
| if any(changed_file.is_relative_to(p) for p in paths): | ||
| return True | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what do we do if we have an empty commit, like the Fedora rebuild commits? Should it be a special case to skip this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the Fedora rebuild commits?
those wouldn't be happening in upstream, so I don't think we have to think about that, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if they want to have that behaviour. I occasionally do that 1, but if it's documented that it is or it is not supported, than that's fine as well.
Footnotes
|
Build failed. ✔️ pre-commit SUCCESS in 1m 54s |
|
I briefly looked into the code and I am thinking if it wouldn't be more fitting to have the logic for checking this in this method, which would mean it would work generally, not just for Copr build jobs. |
Probably, but it's not as extendable as with the Checker, needing another place to be aware of that check are being done. I see that there already are checks like |
|
@LecrisUT can you elaborate more on what you are suggesting? I agree the code I linked could be refactored, but it is something different than the Checker's logic, as that is tight to the handlers, while the checks I linked are meant for actually getting the relevant job configurations to the particular event. |
Indeed, not using the same def get_jobs_matching_event(self) -> list[JobConfig]:
jobs_matching_trigger = []
checkers = JobChecker.get_all_checkers()
for job in self.event.packages_config.get_job_views():
if all(c.check(job) for c in checkers):
jobs_matching_trigger.append(job)The class JobChecker:
all_checkers: Final[ClassVar[list[type[JobChecker]]]] = []
def __init_subclass__(cls):
# Register all checkers when the class is created
cls.all_checkers.append(cls)
@classmethod
def get_all_checkers(cls) -> list[JobChecker]:
return [checker_cls() for checker_cls in cls.all_checkers]
def check(job: JobView) -> bool: ...(Of course with a better interface, constructor, etc.) |
|
@LecrisUT understood, thanks for the explanation! Yes, that sounds like it could improve the readability and extensibility. |
|
Hi @LecrisUT , thanks for all the work so far. Do you think it's real to finish this? (Asking for the Copr team that would hugely benefit from this..) Anything we can help with? |
|
I can try to make some progress this week. One question to be answered is #2780 (comment). There are a few ways to approach this:
I am inclined to go with the 3rd option, just because it will come in handy for the future. Footnotes |
|
Nice, thanks @LecrisUT ! I don't have a strong opinion but using webhook might avoid extra API calls (compared to OGR approach). But on the other hand, not sure if this would be same for other git forges. (With OGR, we might be able to find a way more easily.) |
|
My suggestion would be to try to use webhooks directly as it seems the most straightforward (we have also already been having issues with API rate limits). |
|
Where do I get the push event? Otherwise, it seems we still need the ogr changes for the pull-request trigger, the webhook does not have equivalent data. |
|
Build failed. ✔️ pre-commit SUCCESS in 1m 56s |
betulependule
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. It looks good to me so far. Do you think you'll be able to make progress with this soon?
Could you please elaborate what you mean? |
| # FIXME: This is probably unnecessary | ||
| package_config = self.package_config.get_package_config_for(self.job_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think this might be unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the usage of the method get_package_config_for. In some other parts I see that the method is not used and I assume it would already be processed when self.package_config is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it seems that the get_package_config_for is used to get the config from a config for multiple packages (usually named packages_config) based on a specific job config. In other existing code, I only see packages_config.get_package_config_for being used, but never package_config.get_package_config_for. I think using self.package_config directly might be the right way to do it, but I'm not exactly sure. Tests should hopefully reveal what it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, worst case this is just a no-op, but if anyone with more experience could chyme in and confirm it quickly then we can resolve this one way or the other. Let's keep this open for a bit.
Yes, it's just one blocker to figure out and adding the relevant tests
Sure, when I was looking at this, I couldn't find the appropriate mixin that can be used with |
Yes, you're right that there are no equivalents for upstream PRs (as far as I can tell). I suppose we'll need a new Mixin class for this, then? @lbarcziova, what do you think? I'm still rather new to this project so I don't understand this part too much just yet 😅. |
| elif self.job_config.trigger == JobConfigTriggerType.commit: | ||
| # TODO: Get changed files from webhook push | ||
| raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be simple to implement. Both GitLab and GitHub webhook push events specify the commits field, where each item contains the added, modified and removed fields describing what files have been changed.
https://docs.github.com/en/webhooks/webhook-events-and-payloads#push
https://docs.gitlab.com/user/project/integrations/webhook_events/#push-events
However, I am a bit worried about this note from GitHub's docs: A maximum of 3000 changed files will be reported per commit. I am concerned that if a user's commit includes more than 3000 changed files, then the webhook will not report all and the checker might ignore some file changes, resulting in an obscure bug... Maybe there should be a check whether the amount of changes matches 3000, because if it does, then that means that the webhook may have under-reported what files were changes and we should find the rest of changed files using a different method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or for simplicity, if 3000 changes are reported in one of the fields added, modified or removed, return the root directory ./ to make sure the checker doesn't ignore any files not reported by the webhook push event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is the same issue as #2780 (comment). I will try to join you on figuring out how to implement this. If you have any ideas from diving in the code, do let me know
However, I am a bit worried about this note from GitHub's docs:
A maximum of 3000 changed files will be reported per commit.
My gut instinct is if it ends up with that many change files, we might be having other issues on the side. If they do that big of a change, then might as well either trigger the build for everything or let the users manually trigger it and we just report " 👀 > 3000 files being changed!? Help me 🙏 "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're referring to this?
Where do I get the push event?
I wasn't sure what you meant by the question back then. Sorry about that. I suppose you don't know how to access the push object containing the webhook push event info? I can try to find out how to access it. 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, any hints on that is appreciated, since it also kinda relates to #2860.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll take a look. 👍 Thanks for your work so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I figured it out a bit. It should come from ConfigFromEventMixin and it would have either commit, or PR or equivalent event results. Commit I think it is pretty straightforward, but PRs would be tricky since it can also take in comments. Maybe we could ignore all non-push, non-pr-update events (i.e. make the checker always true then). Updated the PR with a quick sketch that I have so far in case you get some ideas around this as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigFromEventMixin looks quite promising. Good work. 👍
Maybe we could ignore all non-push, non-pr-update events (i.e. make the checker always true then).
Yes, I think that makes sense.
I looked into this as well and in case ConfigFromEventMixin doesn't work, we might still have the event data available in the Checker class.
packit-service/packit_service/worker/jobs.py
Lines 197 to 203 in 3ce7c70
| parser = nested_get( | |
| Parser.MAPPING, | |
| source, | |
| event_type, | |
| default=Parser.parse_event, | |
| ) | |
| event_object: Optional[Event] = parser(event) |
Above is where the event (GitLab and GitHub push events, PR events and other) are parsed. SteveJobs returns a list of TaskResult, which still has the event variable storing what we need (it contains the parsed event, which would contain the property commits of type CommitInfo).
I believe that the event variable is eventually used to initialize the Checker class, which the class AreFilesChanged inherits from. The property self.data is initialized with some properties of the event variable and the commits variable is lost.
| self.data = EventData.from_event_dict(event) |
If needed, we could store the entire event variable inside the Checker class (self.event = event) to make it available in AreFilesChanged.
Using ConfigFromEventMixin seems like a cleaner solution, so let's try that first. 👍
| if self.job_config.trigger == JobConfigTriggerType.pull_request: | ||
| changes = self.copr_build_helper.pull_request_object.changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked the definitions of GitHub and GitLab definitions of the MR / PR webhook events and you are definitely right that the information related to what files were changed is not present.
Do I understand correctly that your current approach is the following?
- In
parser.py, once a PR / MR webhook event is received, useogrto get a list of commits (cast asCommitInfo) contained in this PR / MR. - Inside the
AreFilesChangedclass, take the list of commits and aggregate the three lists of filesadded,modifiedandremoved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, it does not go through the ogr and with the changes in the parser.py (parse_github_push_event) it would be materializing the webhook data to a more complete python object with the included files changed (but I did not properly test it, the failing CI should give us some reference for where this is connected).
2. Inside the
AreFilesChangedclass, take the list of commits and aggregate the three lists of filesadded,modifiedandremoved.
That is what it is supposed to happen (replacing self.copr_build_helper.pull_request_object.changes, that is still from the ogr required change, I will drop it to make it clearer). As soon as we can get the output of parse_github_push_event in this checker class using a mixin, then that would unlock everything that we were missing here.
Signed-off-by: Cristian Le <[email protected]>
Signed-off-by: Cristian Le <[email protected]>
|
Build failed. ✔️ pre-commit SUCCESS in 1m 45s |
| if not isinstance(pr_event, (github.pr.Action, gitlab.mr.Action)): | ||
| # TODO: What about comments? | ||
| raise NotImplementedError() | ||
| # TODO: How to handle PRs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we create another Mixin class, which will use the existing get_pr_files_diff function in ogr to retrieve the files diff? It should return a dictionary, so we would need to process it into a list of files changed.
// dictionary representing files diff
files_changed = project.get_pr_files_diff(pr_id=id_from_pr_event)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem to be available for github and gitlab? https://github.com/search?q=repo%3Apackit%2Fogr%20get_pr_files_diff&type=code And I think the internal API would not make it efficient to query that format. Maybe we go through packit/ogr#921?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. In that case using packit/ogr#921 sounds good to me.
TODO:
/packitcommand is usedpackit/packit.dev.Depends-on: packit/ogr#921
Fixes packit/packit#1997
Fixes #2006
RELEASE NOTES BEGIN
copr_buildjobs are triggered only if there are changed files under thepathsfieldRELEASE NOTES END