perf(fmt): speed up file diffing by KnorpelSenf · Pull Request #30644 · denoland/deno

KnorpelSenf · 2025-09-08T21:04:56Z

This swaps out dissimilar for imara which is substantially faster at diffing strings.

Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. I'm willing to fix this up if I get an OK about the general direction.

Fixes #30634

This swaps out dissimilar for imara which is substantially faster at diffing strings. Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. Fixes denoland#30634

libs/resolver/display.rs

dsherret · 2025-09-15T16:40:31Z

Cargo.toml

 dprint-plugin-typescript = "=0.95.11"
 env_logger = "=0.11.6"
 fancy-regex = "=0.14.0"
+imara-diff = "=0.2.0"


I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

I opened dtolnay/dissimilar#21 -- it might be more worthwhile to pursue this path than rewrite to imara-diff, which still might not be fast enough with very large files.

I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

I think both dissimilar and imara expose an iterator over the patches, so I would assume that we can just stop iterating and thereby abort the computation of the diff early.

I have yet to check if my assumption is correct, though.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar? From my superficial understanding, it looks like imara is simply a better (=faster) diffing lib in all respects that are relevant for deno.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar?

We know the diff output of dissimilar is ok, but not sure yet about imara. Generally diffs are only shown in error cases so perf doesn't matter too much, but obviously several minutes is not acceptable 😅. How much faster is imara for this diff? I guess if it's fast enough on this case then maybe that's good enough and we don't need to worry about doing some iterator or max results approach.

On my machine, deno 2.5.0 needs around 36 minutes (!) to check the formatting of the file and print out the diff. My branch (still in debug build, did not compile with -r yet) cuts it down to 0.4 seconds.

I did not try larger files using Deno 2.5.0 but I tried them with this branch. The results are as follows:

1 MB file: 0.4 seconds

10 MB file: 4 seconds

100 MB file: 40 seconds

(not evaluated this very scientifically, please take it with a grain of salt)

All files had a similar format as shown #30634.

I have started working on bringing back properly formatted diffs. One thing I have noticed is that imara is extremely good at finding line diffs, but it does not have built-in word-diffing (see pascalkuthe/imara-diff#1). I will ask if they accept contributions, but otherwise I'm afraid we will have to add the complexity here. This is something that dissimilar provides out of the box, but they seem to do it by not even tokenizing the input at all, which explains why it is so slow. (Note that this also means that imara might get a lot slower once we run word diffs with it.)

Upstream work continues: pascalkuthe/imara-diff#33

If my upstream work gets merged, the debug build of imara will be able to diff this same string is less then 1 second including the full word diff of the file (down from 36 minutes on Deno's main branch built in release mode)

Copilot

Pull Request Overview

This PR replaces the dissimilar diff library with imara-diff for computing text diffs in the resolver display module. The change updates the diff algorithm implementation while maintaining the same output format.

Key Changes:

Switched from dissimilar to imara-diff library for diff computation
Updated DiffBuilder implementation to work with imara-diff's hunk-based API
Deferred implementation of context line handling (marked with TODO comments)

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
Cargo.toml	Added `imara-diff = "=0.2.0"` to workspace dependencies
Cargo.lock	Updated lock file with imara-diff dependency and its transitive dependencies (hashbrown, memchr)
libs/resolver/Cargo.toml	Replaced `dissimilar` with `imara-diff` in resolver crate dependencies
libs/resolver/display.rs	Rewrote diff computation logic to use imara-diff's `InternedInput` and hunk-based API; added lifetime parameter to `DiffBuilder`; commented out unused helper functions `fmt_add_text` and `fmt_rem_text`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/resolver/display.rs

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull Request Overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/resolver/display.rs

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

coderabbitai · 2025-11-22T11:42:57Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR migrates the diff implementation from the dissimilar crate to imara_diff. The changes include adding imara-diff as a new dependency in the CLI crate's Cargo.toml, replacing dissimilar with imara-diff as a workspace dependency in libs/resolver, and refactoring display.rs to use imara_diff's API. The refactoring swaps out chunk-based diff handling for a histogram algorithm with hunk-based processing, introduces a lifetime-parameterized DiffBuilder with InternedInput, and restructures the diff rendering pipeline. Additionally, line-ending normalization is enhanced to detect and report CRLF differences as a user-visible note.

Sequence Diagram

(Not applicable — the changes are primarily a dependency and internal implementation swap without new external control flow.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

libs/resolver/display.rs: This file undergoes substantial refactoring with a complete swap of the diff algorithm (dissimilar's approach to imara_diff's Histogram algorithm), new data structures (InternedInput, Diff, hunks vs. Chunks), and restructured rendering logic. The logic density is high and requires careful verification that output behavior remains correct.
Cargo.toml files: The dependency changes are straightforward, but the impact of swapping algorithms should be verified.
Pay special attention to: Correctness of the histogram diff output vs. the previous algorithm, handling of the new InternedInput and hunk iteration, CRLF detection and formatting in the normalization phase, and the highlight-based rendering paths to ensure visual output is consistent.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing dissimilar with imara to improve diffing performance in file formatting.
Description check	✅ Passed	The description explains the swap from dissimilar to imara for faster diffing and references issue #30634, which is directly related to the changeset.
Linked Issues check	✅ Passed	The PR directly addresses issue #30634 by replacing the slow dissimilar diffing library with imara, which should resolve the CPU hang on large HTML files during fmt --check.
Out of Scope Changes check	✅ Passed	All changes are directly in scope: dependency replacement in Cargo.toml files and migration of diff logic in display.rs to use imara instead of dissimilar.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

libs/resolver/display.rs (1)

67-111: Multiple hunks are merged into one block and line numbers become incorrect

handle_diff appends all hunks’ removed/added lines into self.orig / self.edit and only calls flush_changes() once at the end. That has two concrete effects:

Separate change regions (distinct hunks from diff.hunks()) are rendered as a single diff block instead of separate regions.
orig_line / edit_line never get updated per hunk, so all lines in the combined block are numbered as if they started at line 1, which is wrong for files with multiple distant changes.

Given imara-diff’s Hunk.before / Hunk.after are 0-based index ranges into input.before / input.after (one token per line), you can restore correct behavior by flushing per hunk and resetting the starting line numbers from those ranges. For example:

-  fn handle_diff(mut self, diff: Diff) -> String {
-    for hunk in diff.hunks() {
-      if !hunk.before.is_empty() || !hunk.after.is_empty() {
-        self.has_changes = true;
-      }
-      // collect all removed lines
-      for (i, del) in hunk.before.enumerate() {
-        let s = self.input.interner[self.input.before[del as usize]];
-        if i > 0 {
-          self.orig.push('\n');
-        }
-        self.orig.push_str(&fmt_rem_text_highlight(s));
-      }
-      // collect all added lines
-      for (i, ins) in hunk.after.enumerate() {
-        let s = self.input.interner[self.input.after[ins as usize]];
-        if i > 0 {
-          self.edit.push('\n');
-        }
-        self.edit.push_str(&fmt_add_text_highlight(s));
-      }
-    }
-
-    self.flush_changes();
-    self.output
-  }
+  fn handle_diff(mut self, diff: Diff) -> String {
+    let mut first = true;
+
+    for hunk in diff.hunks() {
+      if hunk.before.is_empty() && hunk.after.is_empty() {
+        continue;
+      }
+
+      // flush previous hunk’s block before starting a new one
+      if !first {
+        self.flush_changes();
+      } else {
+        first = false;
+      }
+
+      // reset starting line numbers from absolute token indices
+      self.orig_line = hunk.before.start as usize + 1;
+      self.edit_line = hunk.after.start as usize + 1;
+      self.has_changes = true;
+
+      // collect removed lines for this hunk
+      for (i, del) in hunk.before.enumerate() {
+        let s = self.input.interner[self.input.before[del as usize]];
+        if i > 0 {
+          self.orig.push('\n');
+        }
+        self.orig.push_str(&fmt_rem_text_highlight(s));
+      }
+
+      // collect added lines for this hunk
+      for (i, ins) in hunk.after.enumerate() {
+        let s = self.input.interner[self.input.after[ins as usize]];
+        if i > 0 {
+          self.edit.push('\n');
+        }
+        self.edit.push_str(&fmt_add_text_highlight(s));
+      }
+    }
+
+    self.flush_changes();
+    self.output
+  }

This keeps the current no-context behavior but restores one block per hunk with correct absolute line numbers for each change region.

🧹 Nitpick comments (1)

libs/resolver/display.rs (1)

259-315: Add a test covering multiple separated change regions

The current tests validate single-hunk cases and line-ending normalization, but don’t exercise a file with two or more non-adjacent change regions. That’s exactly where the current handle_diff implementation misbehaves (merged hunks and wrong line numbers).

Consider adding a test that changes, say, lines 5 and 20 in a 20+ line file and asserts that the diff output shows two separate blocks with the expected line numbers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 355d899 and 15bfdd9.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

Cargo.toml (1 hunks)
libs/resolver/Cargo.toml (1 hunks)
libs/resolver/display.rs (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test debug linux-aarch64
GitHub Check: test debug macos-x86_64
GitHub Check: test debug windows-x86_64
GitHub Check: test debug linux-x86_64
GitHub Check: test release linux-x86_64
GitHub Check: test debug macos-aarch64
GitHub Check: lint debug windows-x86_64
GitHub Check: lint debug macos-x86_64
GitHub Check: build libs
GitHub Check: lint debug linux-x86_64

🔇 Additional comments (3)

Cargo.toml (1)

315-315: Adding imara-diff workspace dependency looks consistent

Pinned version and placement match the existing dependency style; no issues from the workspace wiring side.

libs/resolver/Cargo.toml (1)

54-54: Resolver’s switch to imara-diff via workspace dep is coherent

The new imara-diff.workspace = true entry correctly matches the implementation in display.rs and the workspace dependency added in the root Cargo.toml.

libs/resolver/display.rs (1)

9-29: imara-diff integration and CRLF handling look correct

Using InternedInput::new + Diff::compute(Algorithm::Histogram, &input) followed by postprocess_lines(&input) matches imara-diff’s recommended text-diff usage, and the newline normalization + “Text differed by line endings.” fast-path preserves and clarifies the old CRLF-only-diff behavior exercised in test_newlines_differing.

Also applies to: 31-50

Simplify DiffBuilder to directly write lines per hunk instead of buffering. This fixes: - Missing flush between hunks causing all changes to merge into one block - Incorrect line numbers when unchanged lines exist between hunks - Token trailing newlines being duplicated in output Also fix pre-existing sys_traits dev-dependency missing getrandom feature and remove dead commented-out code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kajukitli

not approving this as-is

it definitely makes the implementation smaller/faster, but it also regresses the output semantics in a pretty user-visible way.

before, the diff grouped adjacent delete/insert chunks together and kept unchanged context lines in the output shape. with this version you're only iterating diff.hunks() and emitting removed/added lines directly, so the output is much flatter and loses some of the structure that made the formatter diff readable.

you can see it in the adjusted tests:

the multiline case no longer keeps the old/new console.log( pairing together in the same way
the single-line case drops the trailing blank added line entirely

for a formatter diff, readability matters more than raw throughput. if the goal is "proof of concept", then yeah the direction seems plausible, but i'd want the old output quality preserved before merging.

Address review feedback: show unchanged context lines between diff hunks so that related changes remain visually connected (e.g. console.log( appearing as both -/+ between surrounding changes). Context lines use non-highlighted colors to distinguish them from actual changes. Only lines immediately preceding a hunk are shown as context to avoid noise. Also update Cargo.lock for the getrandom feature addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kajukitli

rechecked after the follow-up commits

this is closer, but i still wouldn't merge it yet. the main semantic regression is still there: trailing newline-only insertions/deletions are getting collapsed away because write_*_line() strips the final \n and only emits one logical line per interned line token.

that's exactly why the test had to change from:

2 | +some line text test
3 | +

to just:

2 | +some line text test

for formatter diffs that last blank line is real output and users do care about it. same problem shows up in the multiline test churn too.

so yeah, the perf direction still seems good, but i think the diff renderer needs to preserve newline-only hunks before this is ready.

Interleave deleted and inserted lines within each hunk so that corresponding old/new line pairs appear adjacent in the output. This matches the expected format in spec tests (e.g. gitignore) and preserves readability for formatter diffs. Also fix rustfmt formatting in test code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove context lines between diff hunks that were causing spec test failures. Remove unused prev_after_end variable and dead code (write_context_line, fmt_rem_text, fmt_add_text). Update test expectations since imara-diff correctly identifies unchanged lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The imara-diff algorithm matches identical lines (like closing braces) as unchanged context rather than delete/insert pairs, producing more minimal diffs. Update .out files to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kajukitli

looked again and my take hasn't changed

the perf angle is still good, but the renderer is still stripping structural newline information:

let text = text.strip_suffix('\n').unwrap_or(text);

both write_rem_line() and write_add_line() do that, so newline-only changes still disappear from the rendered diff. that's why the old test case with the explicit blank added line is still gone.

for formatter diffs, losing a trailing blank line change is not cosmetic — it's a real output regression. i'd still keep this in "good direction, not merge-ready" territory until blank-line/newline-only hunks are preserved.

Add "\ No newline at end of file" marker (matching git convention) when a line lacks a trailing newline, so that trailing-newline changes are visible in the rendered diff instead of silently disappearing after strip_suffix('\n'). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kajukitli

Reviewed the changes. Initial findings didn't hold up on closer inspection. Looks good.

KnorpelSenf · 2026-03-12T21:06:17Z

The upstream crate so far only supports word-diffing. That's a regression from the current implementation of Deno. You need to do one of the following:

simply accept the regression and only support word-based diffs
wait for feat: character diffs pascalkuthe/imara-diff#38 to be merged
re-implement the logic from that PR in Deno
acknowledge that the current changes are insufficient and close this PR

Emit a `...` separator line between hunks that have gaps (skipped unchanged lines), making it clear when diff output jumps across non-adjacent lines. Also remove unnecessary getrandom feature from dev-dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsherret

LGTM! I think just word diffing is probably fine?

KnorpelSenf mentioned this pull request Sep 8, 2025

deno fmt --check gets stuck on large HTML files #30634

Open

KnorpelSenf commented Sep 8, 2025

View reviewed changes

libs/resolver/display.rs Outdated Show resolved Hide resolved

KnorpelSenf added 3 commits September 8, 2025 21:10

style: fix lints

9aab445

Merge branch 'main' into fast-file-diff

0770d7f

build: enabled unified_diff feature for imara

d257d1a

KnorpelSenf changed the title ~~perf: speed up file diffing~~ perf(fmt): speed up file diffing Sep 10, 2025

Merge branch 'main' into fast-file-diff

4bd1ca6

dsherret reviewed Sep 15, 2025

View reviewed changes

KnorpelSenf added 7 commits September 18, 2025 15:23

Merge remote-tracking branch 'fork/fast-file-diff' into fast-file-diff

1b1496c

Merge branch 'main' into fast-file-diff

f5d0f2b

feat: restore first bits of diff formatting

b834694

build: disable unused imara diff feature

7a863b4

test: revert temporary changes

0332a9a

fix: bad rename

85057ca

fix: lints

ed4188a

KnorpelSenf mentioned this pull request Sep 18, 2025

Add convenience feature for word diff pascalkuthe/imara-diff#1

Closed

KnorpelSenf added 2 commits September 27, 2025 14:02

Merge branch 'main' into fast-file-diff

ebc2b9c

Merge branch 'main' into fast-file-diff

b2fb495

Copilot AI review requested due to automatic review settings November 7, 2025 15:25

Copilot AI reviewed Nov 7, 2025

View reviewed changes

libs/resolver/display.rs Show resolved Hide resolved

Copilot AI reviewed Nov 7, 2025

View reviewed changes

KnorpelSenf requested a review from Copilot November 7, 2025 17:41

Copilot AI reviewed Nov 7, 2025

View reviewed changes

libs/resolver/display.rs Show resolved Hide resolved

Copilot AI reviewed Nov 7, 2025

View reviewed changes

Merge branch 'main' into fast-file-diff

15bfdd9

coderabbitai bot reviewed Nov 22, 2025

View reviewed changes

KnorpelSenf marked this pull request as draft November 22, 2025 12:30

Merge branch 'main' into fast-file-diff

60ed8e5

bartlomieju marked this pull request as ready for review March 12, 2026 11:01

kajukitli reviewed Mar 12, 2026

View reviewed changes

bartlomieju and others added 3 commits March 12, 2026 14:48

kajukitli reviewed Mar 12, 2026

View reviewed changes

kajukitli approved these changes Mar 12, 2026

View reviewed changes

dsherret approved these changes Mar 12, 2026

View reviewed changes

bartlomieju added 2 commits March 12, 2026 22:28

update tests

a24eb00

fix the test

1311270

Conversation

KnorpelSenf commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dsherret Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsherret Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

dsherret Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kajukitli left a comment

Choose a reason for hiding this comment

Uh oh!

kajukitli left a comment

Choose a reason for hiding this comment

Uh oh!

kajukitli left a comment

Choose a reason for hiding this comment

Uh oh!

kajukitli left a comment

Choose a reason for hiding this comment

Uh oh!

KnorpelSenf commented Mar 12, 2026

Uh oh!

dsherret left a comment

Choose a reason for hiding this comment

KnorpelSenf commented Sep 8, 2025 •

edited

Loading

dsherret Sep 15, 2025 •

edited

Loading

KnorpelSenf Sep 18, 2025 •

edited

Loading

KnorpelSenf Sep 18, 2025 •

edited

Loading

KnorpelSenf Nov 8, 2025 •

edited

Loading

coderabbitai bot commented Nov 22, 2025 •

edited

Loading