Skip to content

Add handling for rename-with-modification#37

Merged
cmsetzer merged 10 commits into
mainfrom
handle-move-with-modification
May 28, 2026
Merged

Add handling for rename-with-modification#37
cmsetzer merged 10 commits into
mainfrom
handle-move-with-modification

Conversation

@cmsetzer

@cmsetzer cmsetzer commented May 13, 2026

Copy link
Copy Markdown
Contributor

When a file is both renamed and modified between snapshots, CorrelationDetector is currently unable to associate the two states—it understands them as separate files. This PR implements Git-style handling for rename-with-modification as follows:

  • Binoc now runs a new tree-level transformer, FuzzyCorrelationDetector, after CorrelationDetector. This pairs left-over add/remove leaf nodes by way of Jaccard similarity.
  • DiffNode now includes the temporary field pending_recompare, which instructs FuzzyCorrelationDetector to re-dispatch the pair through comparators and include the resulting content diff alongside the move node.

See docs/adr/rename_modify_detection.md for more information on the design.

@cmsetzer cmsetzer force-pushed the handle-move-with-modification branch from f981396 to 3c15ea0 Compare May 18, 2026 18:19
@cmsetzer cmsetzer marked this pull request as ready for review May 18, 2026 18:31
@cmsetzer cmsetzer requested a review from jcushman May 18, 2026 18:31

@jcushman jcushman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! To say it back: the idea is once we expand the tree, we have a transformer that tries to match up similar files and flags them for re-inspection, and if matched, our downstream content transformers can then report the changes more precisely than "one added, one removed." That sounds great.

I asked chat to look and it flagged a few nits:

  1. Markdown output loses the content detail it just worked to compute.** Both new test vectors render as a single line — **data_v2.csv**: Moved from data.csv (modified) / **meeting-notes-v2.txt**: Moved from notes.txt (modified) — with no mention of the added column or the added lines. The renderer's "group_as_move" path that splices child summaries onto the move line (markdown.rs:154-189) only fires when the move node has children. In practice the CSV comparator stashes its result in details + annotations.tabular_summary (no children), and the text comparator returns a Leaf (no children). So that branch is exercised only by its own unit test, while real pipeline output drops the schema-change/lines-added info that's the whole reason to detect rename+modify in the first place. Suggest either (a) surfacing annotations.tabular_summary and a text equivalent in the move-node bullet, or (b) producing children in the merge so the existing renderer path engages.
  2. Redundant byte reads. score_pairs reads each candidate's bytes inside the inner loop, so a given remove is re-read for every add it's paired with (up to rename_limit = 400 reads). Hoisting reads outside the loop (or memoizing per index) would cut I/O linearly without changing behavior.
  3. extensions_match returns true when both files have no extension (covered by extensions_match_none_on_both). That allows fuzzy-matching unrelated extensionless files (Makefile vs Dockerfile). Probably acceptable given the Jaccard threshold but worth a comment about intent.
  4. Minor: the recursive apply_transformer(c, …, false) call on inflated children is a no-op for fuzzy itself (Root shape + is_root=false), so the comment about "nested correlation still composes" is more aspirational than load-bearing for the only current caller — fine, but the only real value is for later-in-pipeline transformers picking up the inflated subtree on subsequent iterations of the outer transformer loop.

"with no mention of the added column or the added lines" makes sense to fix to me, and the redundant reads. I'd suggest pulling out the no-op apply_transformer since it's confusing -- I went back and forth with it, it sounds like we don't have any use for this currently since we inflate all trees anyway, but I'm not sure.

One other suggestion was to mention that we intentionally don't handle M:N matches with fuzzy -- if you copy a file twice and edit both, it's a move and a new file. That seems fine, maybe worth saying explicitly.

@cmsetzer

Copy link
Copy Markdown
Contributor Author

@jcushman Thanks for the feedback. I've made the following changes:

  1. Markdown output loses the content detail it just worked to compute — we now render separate bullets for a move-with-modify, e.g.:
    - **data_v2.csv**: Moved from data.csv (modified)
    - **data_v2.csv**: Column added: 'email'
    
  2. Redundant byte reads — we now memoize byte reads so that each leaf node is only read at most once.
  3. extensions_match returns true when both files have no extension — no change to behavior, but added a comment to the function for clarity
  4. Minor: the recursive apply_transformer(c, …, false) call on inflated children is a no-op — removed the no-op and left a comment in its place.
  5. Added a note about M:N matching to greedy_assign docstring

Let me know if this looks good for merge. Thanks!

@jcushman jcushman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Note for later -- the memory caching increases peak memory from 2x file size to all file size. Eventually we should attend to being able to run on files larger than available memory, but that'll probably need a whole sweep ... and I'm not even sure it'll be worth supporting.

@cmsetzer cmsetzer force-pushed the handle-move-with-modification branch from 00233cd to 3e6bf06 Compare May 28, 2026 17:07
@cmsetzer cmsetzer force-pushed the handle-move-with-modification branch from 3e6bf06 to 6f378ad Compare May 28, 2026 17:16
@cmsetzer cmsetzer merged commit 902f046 into main May 28, 2026
10 checks passed
@cmsetzer cmsetzer deleted the handle-move-with-modification branch May 28, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants