Skip to content

Dataset creation for backout commits #4159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
eabf9bb
Created base script to construct dataset for backout commits
benjaminmah May 2, 2024
aaf8386
Created new directory to store dataset, added comments to script
benjaminmah May 2, 2024
c096468
Cleaned up code, restructured dataset to include the inducing, backou…
benjaminmah May 3, 2024
24046fd
Sample dataset (count_limit = 500)
benjaminmah May 3, 2024
3eb6605
Removed old datasets
benjaminmah May 3, 2024
2db5029
Skip 'fixing commits' that are actually backout commits
benjaminmah May 3, 2024
3516c09
Sample dataset (num_count = 500)
benjaminmah May 3, 2024
0544b27
Deleted dataset
benjaminmah May 6, 2024
49570ac
Added cache for processed dictionaries, removed unused fields, simpli…
benjaminmah May 6, 2024
fc37940
Split up function `filter_commits` to handle saving to directory and …
benjaminmah May 6, 2024
10314dd
Replaced list with generator, stylized code to match standard coding …
benjaminmah May 6, 2024
943eb40
Removed commented out code
benjaminmah May 6, 2024
8ed0784
Added new file to log commits that do not have a fix commit, used `bu…
benjaminmah May 7, 2024
39ab450
Added metric collection for number of fixes found, number of no fixes…
benjaminmah May 8, 2024
fe8114b
Added condition to only append to dataset if the number of non backed…
benjaminmah May 8, 2024
74939f2
Added the diff between the original commit and the fixing commit in t…
benjaminmah May 10, 2024
be10d51
Removed separating by `added_lines` and `removed_lines`, storing raw …
benjaminmah May 10, 2024
3a406ef
Added threshold for number of changes and separated diffs by file.
benjaminmah May 13, 2024
bc23a22
Added support for hglib grafting from `repository.py`
benjaminmah May 14, 2024
6058305
Added grafting support to apply original commit to parent commit of t…
benjaminmah May 14, 2024
e666c2e
Cleaned up code
benjaminmah May 15, 2024
40bbe1b
Removed storing bugs without fixes, limited bugs to be within the las…
benjaminmah May 15, 2024
a4c5bff
Reverted to storing the raw diff as a utf-8 encoded string.
benjaminmah May 15, 2024
f133041
Removed unnecessary fields when populating dataset, extract correct d…
benjaminmah May 21, 2024
d202b0b
Fixed type hinting
benjaminmah May 22, 2024
79152a3
Added `hg merge-tool` for automatically resolving conflicts when graf…
benjaminmah May 22, 2024
4740196
Fixed docstring for function `graft`
benjaminmah May 22, 2024
38d6cf8
Added check to omit any diff containing conflicts
benjaminmah May 23, 2024
9fc018c
Made code more Pythonic
benjaminmah May 27, 2024
846210f
Changed standard collections to generic types
benjaminmah Jun 3, 2024
ae28dcf
Implemented logging error when shelving changes
benjaminmah Jun 3, 2024
c6f6a8f
Implemented logging error when grafting
benjaminmah Jun 3, 2024
37c51b6
Renamed `bug_dict` and `bug_info` to `bug_resolution_map` and `bug_re…
benjaminmah Jun 3, 2024
fad6df6
Removed `commit_dict`
benjaminmah Jun 3, 2024
fb7a17d
Changed `logger.info` to `logger.warning` when error encountered whil…
benjaminmah Jun 4, 2024
bfc77e4
Reverted importing standard collections
benjaminmah Jun 4, 2024
66108ad
Added raise-from when shelving
benjaminmah Jun 4, 2024
0d83fa7
Removed try-except when grafting
benjaminmah Jun 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 71 additions & 1 deletion bugbug/repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,14 @@
import threading
from datetime import datetime
from functools import lru_cache
from typing import Collection, Iterable, Iterator, NewType, Set, Union
from typing import (
Collection,
Iterable,
Iterator,
NewType,
Set,
Union,
)

import hglib
import lmdb
Expand Down Expand Up @@ -1543,6 +1550,69 @@ def trigger_pull() -> None:
trigger_pull()


def get_diff(repo_path, original_hash, fix_hash) -> bytes:
client = hglib.open(repo_path)

current_rev = client.identify(id=True)

try:
client.rawcommand([b"shelve"])
except Exception:
pass

parents = client.parents(rev=fix_hash)
parent_of_fix = parents[0][1]
client.update(rev=parent_of_fix, clean=True)

graft_result = graft(
client, revs=[original_hash], no_commit=True, force=True, tool=":merge"
)

if not graft_result:
return b""

final_diff = client.diff(
revs=[fix_hash], ignoreallspace=True, ignorespacechange=True, reverse=True
)

client.update(rev=current_rev, clean=True)

return final_diff


def graft(client, revs, no_commit=False, force=False, tool=":merge") -> bool:
"""Graft changesets specified by revs into the current repository state.

Args:
client: The hglib client.
revs: A list of the hashes of the commits to be applied to the current repository state.
no_commit: If True, does not commit and just applies changes in working directory.
force: If True, forces the grafts even if the revs are ancestors of the current repository state.
tool: A string representing a merge tool (see `hg help merge-tools`).

Returns:
Boolean of graft operation result (True for success, False for failure).
"""
args = hglib.util.cmdbuilder(
str.encode("graft"), r=revs, no_commit=no_commit, f=force, tool=tool
)

eh = hglib.util.reterrorhandler(args)

try:
client.rawcommand(args, eh=eh, prompt=auto_resolve_conflict_prompt)
except Exception:
return False

return True


def auto_resolve_conflict_prompt(max_bytes, current_output):
if b"was deleted in" in current_output:
return b"c\n" # Return 'c' to use the changed version
return b"\n" # Default to doing nothing, just proceed


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("repository_dir", help="Path to the repository", action="store")
Expand Down
226 changes: 226 additions & 0 deletions scripts/backout_data_collection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
import json
import logging
import os
from collections.abc import Generator
from datetime import datetime, timedelta

from tqdm import tqdm

from bugbug import bugzilla, db, repository

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def download_databases() -> None:
logger.info("Cloning Mercurial database...")
repository.clone(repo_dir="hg_dir")

logger.info("Downloading bugs database...")
assert db.download(bugzilla.BUGS_DB)

logger.info("Downloading commits database...")
assert db.download(repository.COMMITS_DB, support_files_too=True)


def preprocess_commits_and_bugs() -> tuple[dict, dict, dict]:
logger.info("Preprocessing commits and bugs...")
commit_dict, bug_dict = {}, {}
bug_to_commit_dict: dict[int, list] = {}

for commit in repository.get_commits(
include_no_bug=True, include_backouts=True, include_ignored=True
):
commit_data = {
key: commit[key]
for key in ["node", "bug_id", "pushdate", "backedoutby", "backsout"]
}
commit_dict[commit["node"]] = commit_data

bug_to_commit_dict.setdefault(commit["bug_id"], []).append(commit_data)

# We only require the bug's resolution (to check if it is 'FIXED').
bug_dict = {
bug["id"]: bug["resolution"] for bug in bugzilla.get_bugs(include_invalid=True)
}

return commit_dict, bug_to_commit_dict, bug_dict


def has_conflicts(diff: str) -> bool:
"""Return True if the diff contains any conflict markers. Used with merge-tool ':fail'."""
conflict_markers = ["<<<<<<<", "=======", ">>>>>>>"]
return any(marker in diff for marker in conflict_markers)


def generate_datapoints(
commit_limit: int,
commit_dict: dict,
bug_to_commit_dict: dict,
bug_dict: dict,
repo_dir: str,
) -> Generator[dict, None, None]:
counter = 0
commit_limit = min(commit_limit, 709458)

logger.info("Generating datapoints...")

for commit in tqdm(
repository.get_commits(
include_no_bug=True, include_backouts=True, include_ignored=True
)
):
counter += 1

bug_info = bug_dict.get(commit["bug_id"])

pushdate = datetime.strptime(commit["pushdate"], "%Y-%m-%d %H:%M:%S")

if (datetime.now() - pushdate) > timedelta(days=730):
continue

if not commit["backedoutby"] or bug_info != "FIXED":
continue

# We only add the commit if it has been backed out and the bug it is for is FIXED.
fixing_commit, non_backed_out_commits = find_next_commit(
commit["bug_id"],
bug_to_commit_dict,
commit["node"],
commit["backedoutby"],
)

if not fixing_commit or non_backed_out_commits > 1:
continue

commit_diff = repository.get_diff(
repo_dir, commit["node"], fixing_commit["node"]
)

if not commit_diff:
continue

commit_diff_encoded = commit_diff.decode("utf-8")

if has_conflicts(commit_diff_encoded):
continue

yield {
"non_backed_out_commits": non_backed_out_commits,
"fix_found": True,
"bug_id": commit["bug_id"],
"inducing_commit": commit["node"],
"backout_commit": commit["backedoutby"],
"fixing_commit": fixing_commit["node"],
"commit_diff": commit_diff_encoded,
}

if counter >= commit_limit:
break


def find_next_commit(
bug_id: int, bug_to_commit_dict: dict, inducing_node: str, backout_node: str
) -> tuple[dict, int]:
backout_commit_found = False
fixing_commit = None

non_backed_out_counter = 0

for commit in bug_to_commit_dict[bug_id]:
# If the backout commit has been found in the bug's commit history,
# find the next commit that has not been backed out or backs out other commits.
if backout_commit_found:
if (
not commit["backedoutby"]
and not fixing_commit
and not commit["backsout"]
):
fixing_commit = commit
non_backed_out_counter += 1
elif not commit["backedoutby"]:
non_backed_out_counter += 1

if commit["node"] == backout_node:
backout_commit_found = True

if (
not fixing_commit
or fixing_commit["node"] == inducing_node
or fixing_commit["node"] == backout_node
):
return {}, non_backed_out_counter

return fixing_commit, non_backed_out_counter


def save_datasets(
directory_path: str, dataset_filename: str, data_generator, batch_size: int = 10
) -> None:
os.makedirs(directory_path, exist_ok=True)
logger.info(f"Directory {directory_path} created")

dataset_filepath = os.path.join(directory_path, dataset_filename)

fix_found_counter = 0
fix_batch = []

with open(dataset_filepath, "w") as file:
file.write("[\n")
first = True

logger.info("Populating dataset...")
for item in data_generator:
item.pop("fix_found", None)
fix_batch.append(item)
fix_found_counter += 1

if len(fix_batch) >= batch_size:
if not first:
file.write(",\n")
else:
first = False

json_data = ",\n".join(json.dumps(i, indent=4) for i in fix_batch)
file.write(json_data)
file.flush()
os.fsync(file.fileno())
fix_batch = []

if fix_batch:
if not first:
file.write(",\n")
json_data = ",\n".join(json.dumps(i, indent=4) for i in fix_batch)
file.write(json_data)
file.flush()
os.fsync(file.fileno())

file.write("\n]")

logger.info(f"Dataset successfully saved to {dataset_filepath}")
logger.info(f"Number of commits with fix found saved: {fix_found_counter}")


def main():
download_databases()

commit_dict, bug_to_commit_dict, bug_dict = preprocess_commits_and_bugs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider the space complexity when iterating over the whole dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused keys when constructing the dictionaries and implemented a cache to use generated dictionaries from previous instances of running the code via saving them as JSON files. Let me know if this needs additional changes/fixes!


data_generator = generate_datapoints(
commit_limit=1000000,
commit_dict=commit_dict,
bug_to_commit_dict=bug_to_commit_dict,
bug_dict=bug_dict,
repo_dir="hg_dir",
)

save_datasets(
directory_path="dataset",
dataset_filename="backout_dataset.json",
data_generator=data_generator,
batch_size=1,
)


if __name__ == "__main__":
main()