Skip to content

Proposal: Stop overloading of notes column #1864

@lukas2510

Description

@lukas2510

Hi IDoFT maintainers,

Context

  • Motivation: I started programmatic analysis on the dataset, focusing on tests that have a fix to infer the origin of flakiness from the fix commit. While doing so I ran into inconsistent evidence formats.
  • I analyzed only rows marked as “fixed” (DeveloperFixed, Accepted, InspiredAFix, FixedOrder) across pr-data.csv, gr-data.csv, py-data.csv.
  • Goal: verify the presence/quality of fix evidence using the existing “PR Link” and “Notes” fields for automation (e.g., to identify the origin of flakiness from the fix commit).
  • Reproducibility: the scripts and generated CSVs are available on my fork/branch: https://github.com/lukas2510/idoft/tree/origin-of-flakiness

Artifacts (full, reproducible samples)
Here is the overview over the datasets: edge_cases_report.csv

  • Fork/branch with scripts and outputs: https://github.com/lukas2510/idoft/tree/origin-of-flakiness
  • data_transformation/output/edge_cases/edge_case_pr_link_and_notes_samples.csv
  • data_transformation/output/edge_cases/edge_case_old_dataset_repo_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_idoft_repo_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_no_evidence_samples.csv
  • data_transformation/output/edge_cases/edge_case_bare_commit_sha_samples.csv
  • data_transformation/output/edge_cases/edge_case_multiple_links_in_notes_samples.csv
  • data_transformation/output/edge_cases/edge_case_pr_commit_url_samples.csv
  • data_transformation/output/edge_cases/edge_case_fork_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_branch_tree_link_samples.csv

Why this matters

  • The Notes column is currently overloaded (issues, PRs, commits, branches, external references). This makes automation brittle and increases manual review.
  • A small, consistent structure would make it much easier to automate origin-of-flakiness analysis and keep the dataset uniform.

Minimal change proposal (grouped for clarity)

  1. Fix commit column & link format unification
    Example of a confusing row here
  • Add a single optional column: Commit Link (or Fix_Commit_Link). Keep existing PR Link.
  • Evidence hierarchy: Commit Link (best) > PR Link (second-best). If both exist, Commit Link is authoritative.
  • Accepted commit evidence formats (one of):
    • Full URL: https://github.com/<org>/<repo>/commit/<sha>
    • PR commit view: https://github.com/<org>/<repo>/pull/<num>/commits/<sha> (normalize/store as /commit/<sha>)
    • Bare 40-char SHA (should be resolved/expanded to full URL where possible)
  • Move any commit links currently embedded in Notes into the new Commit Link column.
  1. Untangle idoft-wrapped / self-referential links
    Example of a confusing row here
  • Pattern: Notes contains an idoft issue linking onward to the actual PR or commit.
  • Proposed rule: Store the final PR/commit in Commit Link / PR Link; keep the idoft issue only as contextual provenance in Notes.
  • Remove links to TestingResearchIllinois/flaky-test-dataset because it does not exist anymore.
  • Result: Notes becomes lighter
  1. Cross-repo redirection
    Example of a confusing row here
  • Pattern: Initial PR (original repo) in PR Link; final accepted fix in a different repo only in Notes.
  • Proposed rule: PR Link (or Commit Link) should point to the authoritative PR/commit where the fix landed (target repo). Initial exploratory/redirected PR kept in Notes.
  • Benefit: Downstream automation can fetch the diff from the correct repository without extra heuristics.

Questions for maintainers

  • In fixed-status rows with neither a PR link nor a commit link (currently 47 cases), is data missing or are these valid (e.g., private/internal fixes)? The cases can be found here
  • What do you think of my approach to remove the overloading of the notes column? I am open for other suggestions as well.

Thanks for your work on IDoFT! I’m happy to prepare a PR if this direction sounds good.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions