Added duplicate key detection forked_projects_util.py #1982

Kcruz28 · 2025-12-18T01:33:10Z

This PR adds a check for duplicate keys in forked_projects.json. I added a new function, find_duplicate_json_keys, which scans the raw JSON file to detect duplicate keys before it is parsed. I then updated run_checks_sort_fp to report an error if any duplicates are found. This helps catch cases where duplicate keys would otherwise be silently overwritten, making the checks more reliable.

Example error:
ERROR: On file format_checker/forked-projects.json: Duplicate in forked-projects.json keys detected: ['https://github.com/apache/incubator-shardingsphere']

darko-marinov

Good starting point but some changes are needed to accept this in the checker.

darko-marinov · 2025-12-18T02:07:19Z

format_checker/forked_projects_util.py

+    from collections import Counter
+    import re
+
+    with open(file_path, "r") as f:


Use Json library to read the file. It's presumably already done somewhere in this script.

json.load() is the standard way to parse JSON, but it automatically collapses duplicate keys by keeping only the last occurrence. Because of this behavior, any duplicate keys in the original file are lost during parsing. For that reason I did not use json.load().

Thanks for clarifying about json.load(); I wasnt' aware of the default behavior. It seems that json.loads(..., object_pairs_hook=...) allows for checking for duplicates (and doesn't require you to parse keys, assuming all key-value pairs are in the same line).

darko-marinov · 2025-12-18T02:08:49Z

format_checker/forked_projects_util.py

+    dups = find_duplicate_json_keys(file_path)
+    if dups:
+        log_esp_error(file_path, log, f"Duplicate in forked-projects.json keys detected: {dups}")
+    elif not is_fp_sorted(data):


Does the sorting check strictly increasing or just non-decreasing? Can we make it strictly increasing so that it subsumes the check for duplicates?

Making the sorting check strictly increasing would catch duplicates in the loaded data. But json.load() collapses duplicates when parsing, so any duplicates in the original file wouldn’t be detected.

Added duplicate key detection forked_projects_util.py

2cf6dd0

darko-marinov requested changes Dec 18, 2025

View reviewed changes

Kcruz28 requested a review from darko-marinov December 19, 2025 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added duplicate key detection forked_projects_util.py #1982

Added duplicate key detection forked_projects_util.py #1982

Uh oh!

Kcruz28 commented Dec 18, 2025

Uh oh!

darko-marinov left a comment

Uh oh!

darko-marinov Dec 18, 2025

Uh oh!

Kcruz28 Dec 18, 2025

Uh oh!

darko-marinov Dec 21, 2025

Uh oh!

darko-marinov Dec 18, 2025

Uh oh!

Kcruz28 Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added duplicate key detection forked_projects_util.py #1982

Are you sure you want to change the base?

Added duplicate key detection forked_projects_util.py #1982

Uh oh!

Conversation

Kcruz28 commented Dec 18, 2025

Uh oh!

darko-marinov left a comment

Choose a reason for hiding this comment

Uh oh!

darko-marinov Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Kcruz28 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

darko-marinov Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

darko-marinov Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Kcruz28 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants