-
Notifications
You must be signed in to change notification settings - Fork 298
Open
Description
Hi IDoFT maintainers,
Context
- Motivation: I started programmatic analysis on the dataset, focusing on tests that have a fix to infer the origin of flakiness from the fix commit. While doing so I ran into inconsistent evidence formats.
- I analyzed only rows marked as “fixed” (DeveloperFixed, Accepted, InspiredAFix, FixedOrder) across pr-data.csv, gr-data.csv, py-data.csv.
- Goal: verify the presence/quality of fix evidence using the existing “PR Link” and “Notes” fields for automation (e.g., to identify the origin of flakiness from the fix commit).
- Reproducibility: the scripts and generated CSVs are available on my fork/branch: https://github.com/lukas2510/idoft/tree/origin-of-flakiness
Artifacts (full, reproducible samples)
Here is the overview over the datasets: edge_cases_report.csv
- Fork/branch with scripts and outputs: https://github.com/lukas2510/idoft/tree/origin-of-flakiness
- data_transformation/output/edge_cases/edge_case_pr_link_and_notes_samples.csv
- data_transformation/output/edge_cases/edge_case_old_dataset_repo_link_samples.csv
- data_transformation/output/edge_cases/edge_case_idoft_repo_link_samples.csv
- data_transformation/output/edge_cases/edge_case_no_evidence_samples.csv
- data_transformation/output/edge_cases/edge_case_bare_commit_sha_samples.csv
- data_transformation/output/edge_cases/edge_case_multiple_links_in_notes_samples.csv
- data_transformation/output/edge_cases/edge_case_pr_commit_url_samples.csv
- data_transformation/output/edge_cases/edge_case_fork_link_samples.csv
- data_transformation/output/edge_cases/edge_case_branch_tree_link_samples.csv
Why this matters
- The Notes column is currently overloaded (issues, PRs, commits, branches, external references). This makes automation brittle and increases manual review.
- A small, consistent structure would make it much easier to automate origin-of-flakiness analysis and keep the dataset uniform.
Minimal change proposal (grouped for clarity)
- Fix commit column & link format unification
Example of a confusing row here
- Add a single optional column:
Commit Link(orFix_Commit_Link). Keep existingPR Link. - Evidence hierarchy: Commit Link (best) > PR Link (second-best). If both exist, Commit Link is authoritative.
- Accepted commit evidence formats (one of):
- Full URL:
https://github.com/<org>/<repo>/commit/<sha> - PR commit view:
https://github.com/<org>/<repo>/pull/<num>/commits/<sha>(normalize/store as/commit/<sha>) - Bare 40-char SHA (should be resolved/expanded to full URL where possible)
- Full URL:
- Move any commit links currently embedded in
Notesinto the newCommit Linkcolumn.
- Untangle idoft-wrapped / self-referential links
Example of a confusing row here
- Pattern: Notes contains an idoft issue linking onward to the actual PR or commit.
- Proposed rule: Store the final PR/commit in
Commit Link/PR Link; keep the idoft issue only as contextual provenance inNotes. - Remove links to
TestingResearchIllinois/flaky-test-datasetbecause it does not exist anymore. - Result: Notes becomes lighter
- Cross-repo redirection
Example of a confusing row here
- Pattern: Initial PR (original repo) in
PR Link; final accepted fix in a different repo only inNotes. - Proposed rule:
PR Link(orCommit Link) should point to the authoritative PR/commit where the fix landed (target repo). Initial exploratory/redirected PR kept inNotes. - Benefit: Downstream automation can fetch the diff from the correct repository without extra heuristics.
Questions for maintainers
- In fixed-status rows with neither a PR link nor a commit link (currently 47 cases), is data missing or are these valid (e.g., private/internal fixes)? The cases can be found here
- What do you think of my approach to remove the overloading of the notes column? I am open for other suggestions as well.
Thanks for your work on IDoFT! I’m happy to prepare a PR if this direction sounds good.
Metadata
Metadata
Assignees
Labels
No labels