Add handling for rename-with-modification#37
Conversation
f981396 to
3c15ea0
Compare
jcushman
left a comment
There was a problem hiding this comment.
Cool! To say it back: the idea is once we expand the tree, we have a transformer that tries to match up similar files and flags them for re-inspection, and if matched, our downstream content transformers can then report the changes more precisely than "one added, one removed." That sounds great.
I asked chat to look and it flagged a few nits:
- Markdown output loses the content detail it just worked to compute.** Both new test vectors render as a single line —
**data_v2.csv**: Moved from data.csv (modified)/**meeting-notes-v2.txt**: Moved from notes.txt (modified)— with no mention of the added column or the added lines. The renderer's "group_as_move" path that splices child summaries onto the move line (markdown.rs:154-189) only fires when the move node has children. In practice the CSV comparator stashes its result indetails+annotations.tabular_summary(no children), and the text comparator returns aLeaf(no children). So that branch is exercised only by its own unit test, while real pipeline output drops the schema-change/lines-added info that's the whole reason to detect rename+modify in the first place. Suggest either (a) surfacingannotations.tabular_summaryand a text equivalent in the move-node bullet, or (b) producing children in the merge so the existing renderer path engages.- Redundant byte reads.
score_pairsreads each candidate's bytes inside the inner loop, so a givenremoveis re-read for everyaddit's paired with (up torename_limit = 400reads). Hoisting reads outside the loop (or memoizing per index) would cut I/O linearly without changing behavior.extensions_matchreturns true when both files have no extension (covered byextensions_match_none_on_both). That allows fuzzy-matching unrelated extensionless files (MakefilevsDockerfile). Probably acceptable given the Jaccard threshold but worth a comment about intent.- Minor: the recursive
apply_transformer(c, …, false)call on inflated children is a no-op for fuzzy itself (Root shape +is_root=false), so the comment about "nested correlation still composes" is more aspirational than load-bearing for the only current caller — fine, but the only real value is for later-in-pipeline transformers picking up the inflated subtree on subsequent iterations of the outer transformer loop.
"with no mention of the added column or the added lines" makes sense to fix to me, and the redundant reads. I'd suggest pulling out the no-op apply_transformer since it's confusing -- I went back and forth with it, it sounds like we don't have any use for this currently since we inflate all trees anyway, but I'm not sure.
One other suggestion was to mention that we intentionally don't handle M:N matches with fuzzy -- if you copy a file twice and edit both, it's a move and a new file. That seems fine, maybe worth saying explicitly.
|
@jcushman Thanks for the feedback. I've made the following changes:
Let me know if this looks good for merge. Thanks! |
jcushman
left a comment
There was a problem hiding this comment.
Looks good to me!
Note for later -- the memory caching increases peak memory from 2x file size to all file size. Eventually we should attend to being able to run on files larger than available memory, but that'll probably need a whole sweep ... and I'm not even sure it'll be worth supporting.
00233cd to
3e6bf06
Compare
3e6bf06 to
6f378ad
Compare
When a file is both renamed and modified between snapshots,
CorrelationDetectoris currently unable to associate the two states—it understands them as separate files. This PR implements Git-style handling for rename-with-modification as follows:FuzzyCorrelationDetector, afterCorrelationDetector. This pairs left-overadd/removeleaf nodes by way of Jaccard similarity.DiffNodenow includes the temporary fieldpending_recompare, which instructsFuzzyCorrelationDetectorto re-dispatch the pair through comparators and include the resulting content diff alongside the move node.See docs/adr/rename_modify_detection.md for more information on the design.