Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions format_checker/forked_projects_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ def is_fp_sorted(data):
return list1 == list2


def find_duplicate_json_keys(file_path):
from collections import Counter
import re

with open(file_path, "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Json library to read the file. It's presumably already done somewhere in this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.load() is the standard way to parse JSON, but it automatically collapses duplicate keys by keeping only the last occurrence. Because of this behavior, any duplicate keys in the original file are lost during parsing. For that reason I did not use json.load().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about json.load(); I wasnt' aware of the default behavior. It seems that json.loads(..., object_pairs_hook=...) allows for checking for duplicates (and doesn't require you to parse keys, assuming all key-value pairs are in the same line).

content = f.read()

# Extract all keys using regex: "key":
pattern = r'"([^"]+)"\s*:'
keys = re.findall(pattern, content)
counter = Counter(keys)
return [k for k, v in counter.items() if v > 1]


def update_fp(data, file_path):
sorted_data = dict(sorted(data.items()))

Expand All @@ -61,8 +75,11 @@ def check_stale_fp(data, file_path, log):
def run_checks_sort_fp(file_path, log):
with open(file_path, "r") as file:
data = json.load(file)
if not is_fp_sorted(data):
log_esp_error(file_path, log, "Entries are not sorted")
dups = find_duplicate_json_keys(file_path)
if dups:
log_esp_error(file_path, log, f"Duplicate in forked-projects.json keys detected: {dups}")
elif not is_fp_sorted(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the sorting check strictly increasing or just non-decreasing? Can we make it strictly increasing so that it subsumes the check for duplicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the sorting check strictly increasing would catch duplicates in the loaded data. But json.load() collapses duplicates when parsing, so any duplicates in the original file wouldn’t be detected.

log_esp_error(file_path, log, "Entries in forked-projects.json are not sorted")
else:
log_info(file_path, log, "There are no changes to be checked")

Expand Down