Skip to content

Conversation

@LonMcGregor
Copy link
Contributor

@LonMcGregor LonMcGregor commented Dec 10, 2025

Adds a file validator for PR submissions.

The directory files should be changed in is defined in an issue body, in a comment, like so:

<!---
CHANGE_DIR=^Sprint-1/
--->
  • If it is not present, the bot does not do this check

  • If it is present, this regexp is checked against every changed file present in a PR submission

  • If any fail to match, i.e. something in a wrong directory was changed, then the bot issues a warning

  • If no files are submitted at all, the bot issues a warning

  • I propose any metadata that needs checked be included in these style of comments, and this metadata will need to be added for every task submission before the bot will be able to perform checks

  • Currently only supports checking a single directory, but could be extended to have multiple, could also extend to do things like checking max number of committed files or any other metadata visible to PRs

You can see an example of output of this check here: CodeYourFuture/Module-Structuring-and-Testing-Data#873

@LonMcGregor LonMcGregor marked this pull request as ready for review December 10, 2025 16:02
@LonMcGregor
Copy link
Contributor Author

@illicitonion Here is the validator for checking the right files are committed. Let me know what you think, in particular if I've made any faux-pas in rust, re the matching and error handling. I'm still getting used to that.

Copy link
Member

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here! This looks good, but I left a few comments about some general Rust things :)

CouldNotMatch,
BadTitleFormat { reason: String },
UnknownRegion,
WrongFiles { files: String },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure at the call-site whether this was the expected files or the incorrect files:

Suggested change
WrongFiles { files: String },
WrongFiles { expected_files_pattern: String },

Comment on lines +290 to +293
let task_issue_body = match task_issue.body {
Some(body) => body,
None => return Ok(None), // Task is empty, nothing left to check
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but I may consider:

Suggested change
let task_issue_body = match task_issue.body {
Some(body) => body,
None => return Ok(None), // Task is empty, nothing left to check
};
let task_issue_body = task_issue.body.unwrap_or_default();

To end up with an empty string which we can just handle as if the issue body was the empty string.

I'm not sure why issue bodies are Option<String> rather than just String - I can't imagine an issue without a body, but maybe some APIs don't return them or something...

Comment on lines +301 to +314
// Get all of the changed files
let pr_files_pages = octocrab
.pulls(org_name, module_name)
.list_files(pr_number)
.await
.context("Failed to get changed files")?;
if pr_files_pages.items.is_empty() {
return Ok(Some(ValidationResult::NoFiles)); // no files committed
}
let pr_files_all = octocrab
.all_pages(pr_files_pages)
.await
.context("Failed to list all changed files")?;
let pr_files = pr_files_all.into_iter();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a handy util for this in trainee_tracker::octocrab::all_pages - you'll probably need to use it and make it pub instead of pub(crate):

Suggested change
// Get all of the changed files
let pr_files_pages = octocrab
.pulls(org_name, module_name)
.list_files(pr_number)
.await
.context("Failed to get changed files")?;
if pr_files_pages.items.is_empty() {
return Ok(Some(ValidationResult::NoFiles)); // no files committed
}
let pr_files_all = octocrab
.all_pages(pr_files_pages)
.await
.context("Failed to list all changed files")?;
let pr_files = pr_files_all.into_iter();
let pr_files = all_pages("changed files", octocrab, async || {
octocrab
.pulls(org_name, module_name)
.list_files(pr_number)
.await
}).await?;
if pr_files.is_empty() {
return Ok(Some(ValidationResult::NoFiles)); // no files committed
}

Some(body) => body,
None => return Ok(None), // Task is empty, nothing left to check
};
let directory_description = Regex::new("CHANGE_DIR=(.+)\\n").unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally follow a convention that unwrap calls should either:

  1. Have a comment above describing why they can't fail (e.g. // UNWRAP: Statically known good regex), or
  2. Be replaced with an .expect("explanation") (e.g. .expect("Statically known regex failed to parse"))

Same applies below on line 296.

None => return Ok(None), // There is no match defined for this task, don't do any more checks
};
let directory_matcher = Regex::new(directory_description_regex)
.context("Invalid regex for task directory match")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to the GitHub issue to this context, as well, so we know where to look/fix:

Suggested change
.context("Invalid regex for task directory match")?;
.with_context(|| format!("Invalid regex for task directory match in issue {}", task_issue.html__url))?;


// Given a vector of sprints, and a target pr number, for a given person
// return the issue ID for the associated assignment descriptor
pub fn get_descriptor_id_for_pr(sprints: Vec<SprintWithSubmissions>, target_pr_number: u64) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than defaulting to 0, I'd expect this to either:

  1. Return an Option<u64> so that we can explicitly signal "We couldn't find the descriptor ID"
  2. Return an Option<&Submission>


// Given a vector of sprints, and a target pr number, for a given person
// return the issue ID for the associated assignment descriptor
pub fn get_descriptor_id_for_pr(sprints: Vec<SprintWithSubmissions>, target_pr_number: u64) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you're just returning a number, you don't need to take ownership of this Vec - and if you do, you shouldn't need to clone the submissions below on line 1030.

In general there are two ways of iterating over things - with .iter() (which doesn't require ownership, but means you may need to copy things if you're returning them), or with .into_iter() (which consumes the value, but avoids the need to copy things).

Here are two different ways to write the code you have at the moment:

  1. Taking ownership - note the use of into_iter() and the lack of clone():
// Given a vector of sprints, and a target pr number, for a given person
// return the issue ID for the associated assignment descriptor
pub fn get_descriptor_id_for_pr(sprints: Vec<SprintWithSubmissions>, target_pr_number: u64) -> u64 {
    match sprints
        .into_iter()
        .flat_map(|sprint_with_subs| sprint_with_subs.submissions)
        .filter_map(|missing_or_submission| match missing_or_submission {
            SubmissionState::Some(s) => Some(s),
            _ => None,
        })
        .find(|submission| match submission {
            Submission::PullRequest { pull_request, .. } => pull_request.number == target_pr_number,
            _ => false,
        }) {
        Some(Submission::PullRequest {
            assignment_descriptor,
            ..
        }) => assignment_descriptor,
        _ => 0,
    }
}
  1. Not taking ownership:
// Given a vector of sprints, and a target pr number, for a given person
// return the issue ID for the associated assignment descriptor
pub fn get_descriptor_id_for_pr(sprints: &Vec<SprintWithSubmissions>, target_pr_number: u64) -> u64 {
    match sprints
        .iter()
        .flat_map(|sprint_with_subs| sprint_with_subs.submissions.iter()) // Note: We explicitly call `.iter()` here to make sure we're borrowing when iterating over this, rather than taking ownership - by default there are "just iterate this for me" style coercions which will implicitly use `.into_iter()` by default, so we need to be explicit here to avoid the clones)
        .filter_map(|missing_or_submission| match missing_or_submission {
            SubmissionState::Some(s) => Some(s),
            _ => None,
        })
        .find(|submission| match submission {
            Submission::PullRequest { pull_request, .. } => pull_request.number == target_pr_number,
            _ => false,
        }) {
        Some(Submission::PullRequest {
            assignment_descriptor,
            ..
        }) => *assignment_descriptor, // Note the * - this is equivalent to calling `.clone()` but because `u64` is `Copy` we can just dereference to copy it. This is cheap. Personally I don't love using `*assignment_descriptor` and would prefer `assignment_descriptor.clone()` but the language community doesn't, so...
        _ => 0,
    }
}

Extending 2 a little bit, we can actually write the signature:

pub fn get_descriptor_id_for_pr(sprints: &[SprintWithSubmissions], target_pr_number: u64) -> u64 {

replacing the &Vec<SprintWithSubmissions> with &[SprintWithSubmissions] because a Vec can be treated like a slice - this is slightly more general, because it allows us to pass more types to this function.

I'd recommend avoiding both taking ownership where not needed, and the extra clones, by doing approach 2 :)

task_issue_number: u64,
) -> Result<Option<ValidationResult>, Error> {
// Get the Sprint Task's description of expected changes
let task_issue = match octocrab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match works here (and throughout), but FYI there's also a concept of guard clases - you could also write this:

    let Ok(task_issue) = octocrab
        .issues(org_name, module_name)
        .get(task_issue_number)
        .await
    else {
        return Ok(Some(ValidationResult::CouldNotMatch)); // Failed to find the right task
    };

I don't have a strong preference either way here - each can be more clear in different contexts, but it's good to know this is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants