-
Notifications
You must be signed in to change notification settings - Fork 298
Added duplicate key detection forked_projects_util.py #1982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added duplicate key detection forked_projects_util.py #1982
Conversation
darko-marinov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good starting point but some changes are needed to accept this in the checker.
| from collections import Counter | ||
| import re | ||
|
|
||
| with open(file_path, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Json library to read the file. It's presumably already done somewhere in this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json.load() is the standard way to parse JSON, but it automatically collapses duplicate keys by keeping only the last occurrence. Because of this behavior, any duplicate keys in the original file are lost during parsing. For that reason I did not use json.load().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying about json.load(); I wasnt' aware of the default behavior. It seems that json.loads(..., object_pairs_hook=...) allows for checking for duplicates (and doesn't require you to parse keys, assuming all key-value pairs are in the same line).
| dups = find_duplicate_json_keys(file_path) | ||
| if dups: | ||
| log_esp_error(file_path, log, f"Duplicate in forked-projects.json keys detected: {dups}") | ||
| elif not is_fp_sorted(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the sorting check strictly increasing or just non-decreasing? Can we make it strictly increasing so that it subsumes the check for duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the sorting check strictly increasing would catch duplicates in the loaded data. But json.load() collapses duplicates when parsing, so any duplicates in the original file wouldn’t be detected.
This PR adds a check for duplicate keys in
forked_projects.json. I added a new function,find_duplicate_json_keys, which scans the raw JSON file to detect duplicate keys before it is parsed. I then updatedrun_checks_sort_fpto report an error if any duplicates are found. This helps catch cases where duplicate keys would otherwise be silently overwritten, making the checks more reliable.Example error:
ERROR: On file format_checker/forked-projects.json: Duplicate in forked-projects.json keys detected: ['https://github.com/apache/incubator-shardingsphere']