Skip to content

Dataset creation for backout commits #4159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

benjaminmah
Copy link
Contributor

@benjaminmah benjaminmah commented May 2, 2024

Script to generate dataset of bug-inducing commits, backout commits, and the subsequent fix commit.

Intended to include:

  • The hashes of the three commits.
  • Metadata of each commit (pushdate, desc).
  • The diff between the initial commit and the fix commit.

Copy link
Member

@suhaibmujahid suhaibmujahid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @benjaminmah! Please see my comments. Also, please fix the linting errors (you may want to consider installing pre-commit1).

Footnotes

  1. https://github.com/mozilla/bugbug#auto-formatting

def main():
download_databases()

commit_dict, bug_to_commit_dict, bug_dict = preprocess_commits_and_bugs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider the space complexity when iterating over the whole dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused keys when constructing the dictionaries and implemented a cache to use generated dictionaries from previous instances of running the code via saving them as JSON files. Let me know if this needs additional changes/fixes!

@benjaminmah benjaminmah requested a review from suhaibmujahid May 7, 2024 14:20
… found, and number of commits with multiple non backed out commits following it
…he dataset, separated by filename and split into `added_lines` and `removed_line`.
@benjaminmah
Copy link
Contributor Author

Example diffs extracted:

Backout Data Collection Validation

@benjaminmah benjaminmah requested a review from suhaibmujahid May 22, 2024 19:18
@benjaminmah benjaminmah requested a review from suhaibmujahid June 3, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants