Skip to content

perf(fmt): speed up file diffing#30644

Open
KnorpelSenf wants to merge 25 commits intodenoland:mainfrom
KnorpelSenf:fast-file-diff
Open

perf(fmt): speed up file diffing#30644
KnorpelSenf wants to merge 25 commits intodenoland:mainfrom
KnorpelSenf:fast-file-diff

Conversation

@KnorpelSenf
Copy link
Contributor

@KnorpelSenf KnorpelSenf commented Sep 8, 2025

This swaps out dissimilar for imara which is substantially faster at diffing strings.

Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. I'm willing to fix this up if I get an OK about the general direction.

Fixes #30634

This swaps out dissimilar for imara which is substantially faster at diffing strings.

Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf.

Fixes denoland#30634
@KnorpelSenf KnorpelSenf changed the title perf: speed up file diffing perf(fmt): speed up file diffing Sep 10, 2025
dprint-plugin-typescript = "=0.95.11"
env_logger = "=0.11.6"
fancy-regex = "=0.14.0"
imara-diff = "=0.2.0"
Copy link
Member

@dsherret dsherret Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened dtolnay/dissimilar#21 -- it might be more worthwhile to pursue this path than rewrite to imara-diff, which still might not be fast enough with very large files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

I think both dissimilar and imara expose an iterator over the patches, so I would assume that we can just stop iterating and thereby abort the computation of the diff early.

I have yet to check if my assumption is correct, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar? From my superficial understanding, it looks like imara is simply a better (=faster) diffing lib in all respects that are relevant for deno.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar?

We know the diff output of dissimilar is ok, but not sure yet about imara. Generally diffs are only shown in error cases so perf doesn't matter too much, but obviously several minutes is not acceptable 😅. How much faster is imara for this diff? I guess if it's fast enough on this case then maybe that's good enough and we don't need to worry about doing some iterator or max results approach.

Copy link
Contributor Author

@KnorpelSenf KnorpelSenf Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, deno 2.5.0 needs around 36 minutes (!) to check the formatting of the file and print out the diff. My branch (still in debug build, did not compile with -r yet) cuts it down to 0.4 seconds.

I did not try larger files using Deno 2.5.0 but I tried them with this branch. The results are as follows:

  • 1 MB file: 0.4 seconds
  • 10 MB file: 4 seconds
  • 100 MB file: 40 seconds

(not evaluated this very scientifically, please take it with a grain of salt)

All files had a similar format as shown #30634.

Copy link
Contributor Author

@KnorpelSenf KnorpelSenf Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started working on bringing back properly formatted diffs. One thing I have noticed is that imara is extremely good at finding line diffs, but it does not have built-in word-diffing (see pascalkuthe/imara-diff#1). I will ask if they accept contributions, but otherwise I'm afraid we will have to add the complexity here. This is something that dissimilar provides out of the box, but they seem to do it by not even tokenizing the input at all, which explains why it is so slow. (Note that this also means that imara might get a lot slower once we run word diffs with it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream work continues: pascalkuthe/imara-diff#33

Copy link
Contributor Author

@KnorpelSenf KnorpelSenf Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my upstream work gets merged, the debug build of imara will be able to diff this same string is less then 1 second including the full word diff of the file (down from 36 minutes on Deno's main branch built in release mode)

Copilot AI review requested due to automatic review settings November 7, 2025 15:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the dissimilar diff library with imara-diff for computing text diffs in the resolver display module. The change updates the diff algorithm implementation while maintaining the same output format.

Key Changes:

  • Switched from dissimilar to imara-diff library for diff computation
  • Updated DiffBuilder implementation to work with imara-diff's hunk-based API
  • Deferred implementation of context line handling (marked with TODO comments)

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
Cargo.toml Added imara-diff = "=0.2.0" to workspace dependencies
Cargo.lock Updated lock file with imara-diff dependency and its transitive dependencies (hashbrown, memchr)
libs/resolver/Cargo.toml Replaced dissimilar with imara-diff in resolver crate dependencies
libs/resolver/display.rs Rewrote diff computation logic to use imara-diff's InternedInput and hunk-based API; added lifetime parameter to DiffBuilder; commented out unused helper functions fmt_add_text and fmt_rem_text

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@KnorpelSenf KnorpelSenf requested a review from Copilot November 7, 2025 17:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR migrates the diff implementation from the dissimilar crate to imara_diff. The changes include adding imara-diff as a new dependency in the CLI crate's Cargo.toml, replacing dissimilar with imara-diff as a workspace dependency in libs/resolver, and refactoring display.rs to use imara_diff's API. The refactoring swaps out chunk-based diff handling for a histogram algorithm with hunk-based processing, introduces a lifetime-parameterized DiffBuilder with InternedInput, and restructures the diff rendering pipeline. Additionally, line-ending normalization is enhanced to detect and report CRLF differences as a user-visible note.

Sequence Diagram

(Not applicable — the changes are primarily a dependency and internal implementation swap without new external control flow.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • libs/resolver/display.rs: This file undergoes substantial refactoring with a complete swap of the diff algorithm (dissimilar's approach to imara_diff's Histogram algorithm), new data structures (InternedInput, Diff, hunks vs. Chunks), and restructured rendering logic. The logic density is high and requires careful verification that output behavior remains correct.
  • Cargo.toml files: The dependency changes are straightforward, but the impact of swapping algorithms should be verified.
  • Pay special attention to: Correctness of the histogram diff output vs. the previous algorithm, handling of the new InternedInput and hunk iteration, CRLF detection and formatting in the normalization phase, and the highlight-based rendering paths to ensure visual output is consistent.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing dissimilar with imara to improve diffing performance in file formatting.
Description check ✅ Passed The description explains the swap from dissimilar to imara for faster diffing and references issue #30634, which is directly related to the changeset.
Linked Issues check ✅ Passed The PR directly addresses issue #30634 by replacing the slow dissimilar diffing library with imara, which should resolve the CPU hang on large HTML files during fmt --check.
Out of Scope Changes check ✅ Passed All changes are directly in scope: dependency replacement in Cargo.toml files and migration of diff logic in display.rs to use imara instead of dissimilar.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
libs/resolver/display.rs (1)

67-111: Multiple hunks are merged into one block and line numbers become incorrect

handle_diff appends all hunks’ removed/added lines into self.orig / self.edit and only calls flush_changes() once at the end. That has two concrete effects:

  • Separate change regions (distinct hunks from diff.hunks()) are rendered as a single diff block instead of separate regions.
  • orig_line / edit_line never get updated per hunk, so all lines in the combined block are numbered as if they started at line 1, which is wrong for files with multiple distant changes.

Given imara-diff’s Hunk.before / Hunk.after are 0-based index ranges into input.before / input.after (one token per line), you can restore correct behavior by flushing per hunk and resetting the starting line numbers from those ranges. For example:

-  fn handle_diff(mut self, diff: Diff) -> String {
-    for hunk in diff.hunks() {
-      if !hunk.before.is_empty() || !hunk.after.is_empty() {
-        self.has_changes = true;
-      }
-      // collect all removed lines
-      for (i, del) in hunk.before.enumerate() {
-        let s = self.input.interner[self.input.before[del as usize]];
-        if i > 0 {
-          self.orig.push('\n');
-        }
-        self.orig.push_str(&fmt_rem_text_highlight(s));
-      }
-      // collect all added lines
-      for (i, ins) in hunk.after.enumerate() {
-        let s = self.input.interner[self.input.after[ins as usize]];
-        if i > 0 {
-          self.edit.push('\n');
-        }
-        self.edit.push_str(&fmt_add_text_highlight(s));
-      }
-    }
-
-    self.flush_changes();
-    self.output
-  }
+  fn handle_diff(mut self, diff: Diff) -> String {
+    let mut first = true;
+
+    for hunk in diff.hunks() {
+      if hunk.before.is_empty() && hunk.after.is_empty() {
+        continue;
+      }
+
+      // flush previous hunk’s block before starting a new one
+      if !first {
+        self.flush_changes();
+      } else {
+        first = false;
+      }
+
+      // reset starting line numbers from absolute token indices
+      self.orig_line = hunk.before.start as usize + 1;
+      self.edit_line = hunk.after.start as usize + 1;
+      self.has_changes = true;
+
+      // collect removed lines for this hunk
+      for (i, del) in hunk.before.enumerate() {
+        let s = self.input.interner[self.input.before[del as usize]];
+        if i > 0 {
+          self.orig.push('\n');
+        }
+        self.orig.push_str(&fmt_rem_text_highlight(s));
+      }
+
+      // collect added lines for this hunk
+      for (i, ins) in hunk.after.enumerate() {
+        let s = self.input.interner[self.input.after[ins as usize]];
+        if i > 0 {
+          self.edit.push('\n');
+        }
+        self.edit.push_str(&fmt_add_text_highlight(s));
+      }
+    }
+
+    self.flush_changes();
+    self.output
+  }

This keeps the current no-context behavior but restores one block per hunk with correct absolute line numbers for each change region.

🧹 Nitpick comments (1)
libs/resolver/display.rs (1)

259-315: Add a test covering multiple separated change regions

The current tests validate single-hunk cases and line-ending normalization, but don’t exercise a file with two or more non-adjacent change regions. That’s exactly where the current handle_diff implementation misbehaves (merged hunks and wrong line numbers).

Consider adding a test that changes, say, lines 5 and 20 in a 20+ line file and asserts that the diff output shows two separate blocks with the expected line numbers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355d899 and 15bfdd9.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • Cargo.toml (1 hunks)
  • libs/resolver/Cargo.toml (1 hunks)
  • libs/resolver/display.rs (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test debug linux-aarch64
  • GitHub Check: test debug macos-x86_64
  • GitHub Check: test debug windows-x86_64
  • GitHub Check: test debug linux-x86_64
  • GitHub Check: test release linux-x86_64
  • GitHub Check: test debug macos-aarch64
  • GitHub Check: lint debug windows-x86_64
  • GitHub Check: lint debug macos-x86_64
  • GitHub Check: build libs
  • GitHub Check: lint debug linux-x86_64
🔇 Additional comments (3)
Cargo.toml (1)

315-315: Adding imara-diff workspace dependency looks consistent

Pinned version and placement match the existing dependency style; no issues from the workspace wiring side.

libs/resolver/Cargo.toml (1)

54-54: Resolver’s switch to imara-diff via workspace dep is coherent

The new imara-diff.workspace = true entry correctly matches the implementation in display.rs and the workspace dependency added in the root Cargo.toml.

libs/resolver/display.rs (1)

9-29: imara-diff integration and CRLF handling look correct

Using InternedInput::new + Diff::compute(Algorithm::Histogram, &input) followed by postprocess_lines(&input) matches imara-diff’s recommended text-diff usage, and the newline normalization + “Text differed by line endings.” fast-path preserves and clarifies the old CRLF-only-diff behavior exercised in test_newlines_differing.

Also applies to: 31-50

@KnorpelSenf KnorpelSenf marked this pull request as draft November 22, 2025 12:30
Simplify DiffBuilder to directly write lines per hunk instead of
buffering. This fixes:
- Missing flush between hunks causing all changes to merge into one block
- Incorrect line numbers when unchanged lines exist between hunks
- Token trailing newlines being duplicated in output

Also fix pre-existing sys_traits dev-dependency missing getrandom feature
and remove dead commented-out code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bartlomieju bartlomieju marked this pull request as ready for review March 12, 2026 11:01
Copy link
Contributor

@kajukitli kajukitli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not approving this as-is

it definitely makes the implementation smaller/faster, but it also regresses the output semantics in a pretty user-visible way.

before, the diff grouped adjacent delete/insert chunks together and kept unchanged context lines in the output shape. with this version you're only iterating diff.hunks() and emitting removed/added lines directly, so the output is much flatter and loses some of the structure that made the formatter diff readable.

you can see it in the adjusted tests:

  • the multiline case no longer keeps the old/new console.log( pairing together in the same way
  • the single-line case drops the trailing blank added line entirely

for a formatter diff, readability matters more than raw throughput. if the goal is "proof of concept", then yeah the direction seems plausible, but i'd want the old output quality preserved before merging.

Address review feedback: show unchanged context lines between diff
hunks so that related changes remain visually connected (e.g.
console.log( appearing as both -/+ between surrounding changes).

Context lines use non-highlighted colors to distinguish them from
actual changes. Only lines immediately preceding a hunk are shown
as context to avoid noise.

Also update Cargo.lock for the getrandom feature addition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@kajukitli kajukitli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rechecked after the follow-up commits

this is closer, but i still wouldn't merge it yet. the main semantic regression is still there: trailing newline-only insertions/deletions are getting collapsed away because write_*_line() strips the final \n and only emits one logical line per interned line token.

that's exactly why the test had to change from:

2 | +some line text test
3 | +

to just:

2 | +some line text test

for formatter diffs that last blank line is real output and users do care about it. same problem shows up in the multiline test churn too.

so yeah, the perf direction still seems good, but i think the diff renderer needs to preserve newline-only hunks before this is ready.

bartlomieju and others added 3 commits March 12, 2026 14:48
Interleave deleted and inserted lines within each hunk so that
corresponding old/new line pairs appear adjacent in the output.
This matches the expected format in spec tests (e.g. gitignore)
and preserves readability for formatter diffs.

Also fix rustfmt formatting in test code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove context lines between diff hunks that were causing spec test
failures. Remove unused prev_after_end variable and dead code
(write_context_line, fmt_rem_text, fmt_add_text). Update test
expectations since imara-diff correctly identifies unchanged lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The imara-diff algorithm matches identical lines (like closing braces)
as unchanged context rather than delete/insert pairs, producing more
minimal diffs. Update .out files to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@kajukitli kajukitli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked again and my take hasn't changed

the perf angle is still good, but the renderer is still stripping structural newline information:

let text = text.strip_suffix('\n').unwrap_or(text);

both write_rem_line() and write_add_line() do that, so newline-only changes still disappear from the rendered diff. that's why the old test case with the explicit blank added line is still gone.

for formatter diffs, losing a trailing blank line change is not cosmetic — it's a real output regression. i'd still keep this in "good direction, not merge-ready" territory until blank-line/newline-only hunks are preserved.

Add "\ No newline at end of file" marker (matching git convention)
when a line lacks a trailing newline, so that trailing-newline
changes are visible in the rendered diff instead of silently
disappearing after strip_suffix('\n').

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@kajukitli kajukitli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the changes. Initial findings didn't hold up on closer inspection. Looks good.

@KnorpelSenf
Copy link
Contributor Author

The upstream crate so far only supports word-diffing. That's a regression from the current implementation of Deno. You need to do one of the following:

  • simply accept the regression and only support word-based diffs
  • wait for feat: character diffs pascalkuthe/imara-diff#38 to be merged
  • re-implement the logic from that PR in Deno
  • acknowledge that the current changes are insufficient and close this PR

Emit a `...` separator line between hunks that have gaps (skipped
unchanged lines), making it clear when diff output jumps across
non-adjacent lines. Also remove unnecessary getrandom feature from
dev-dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Member

@dsherret dsherret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think just word diffing is probably fine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deno fmt --check gets stuck on large HTML files

5 participants