Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 74 additions & 19 deletions rust/crates/cpd-core/src/detect.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

use rayon::prelude::*;
use rustc_hash::{FxHashMap, FxHashSet};
use std::path::Path;
use std::path::{Path, PathBuf};

use crate::{
hash::{base_pow, hash_window, roll, token_hash},
Expand Down Expand Up @@ -69,20 +69,24 @@ impl CloneDedupKey {
/// Files are grouped by format; each format group is processed independently.
/// Rayon is used for outer parallelism (one task per format group).
pub fn detect(files: &[SourceFile], min_tokens: usize) -> Vec<CpdClone> {
detect_with_options(files, min_tokens, false, 0)
detect_with_options(files, min_tokens, false, 0, &[])
}

/// Detect clones with extended options.
///
/// - `skip_local`: skip clone pairs where both fragments share the same parent directory.
/// - `skip_local`: skip clone pairs where both fragments are under the same scan root.
/// - `min_lines`: reject clones whose fragment line span is shorter than this.
/// The line span is `end.line - start.line`; a clone is kept only if this value
/// is >= `min_lines`. This mirrors jscpd's `LinesLengthCloneValidator`.
/// - `scan_roots`: when `skip_local` is true, clone pairs where both fragments
/// are relative to the same scan root are skipped. Mirrors jscpd's
/// `SkipLocalValidator` which checks `path.some(dir => isRelative(fileA, dir) && isRelative(fileB, dir))`.
pub fn detect_with_options(
files: &[SourceFile],
min_tokens: usize,
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
) -> Vec<CpdClone> {
if files.is_empty() || min_tokens == 0 {
return vec![];
Expand Down Expand Up @@ -129,7 +133,7 @@ pub fn detect_with_options(
}
})
.collect();
detect_in_group(&prepared, min_tokens, skip_local, min_lines)
detect_in_group(&prepared, min_tokens, skip_local, min_lines, scan_roots)
})
.collect();

Expand Down Expand Up @@ -189,19 +193,22 @@ impl PreparedSource {
/// Detect clones from pre-prepared sources grouped by format.
///
/// Called by orchestrate.rs after `tokenize_to_detection` — skips re-hashing.
/// - `scan_roots`: when `skip_local` is true, clone pairs where both fragments
/// are relative to the same scan root are skipped.
pub fn detect_prepared(
format_groups: Vec<Vec<PreparedSource>>,
min_tokens: usize,
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
) -> Vec<CpdClone> {
if format_groups.is_empty() || min_tokens == 0 {
return vec![];
}

let mut clones: Vec<CpdClone> = format_groups
.into_par_iter()
.flat_map(|group| detect_in_group(&group, min_tokens, skip_local, min_lines))
.flat_map(|group| detect_in_group(&group, min_tokens, skip_local, min_lines, scan_roots))
.collect();

dedup_exact_clones(&mut clones);
Expand Down Expand Up @@ -233,6 +240,7 @@ fn detect_in_group(
min_tokens: usize,
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
) -> Vec<CpdClone> {
// Precompute window_power once for this format group.
// If per-language min_tokens is introduced, recompute per group (it is already scoped here).
Expand Down Expand Up @@ -318,6 +326,7 @@ fn detect_in_group(
prepared,
skip_local,
min_lines,
scan_roots,
&mut clones,
);
store.insert(window_hash, current);
Expand All @@ -332,6 +341,7 @@ fn detect_in_group(
prepared,
skip_local,
min_lines,
scan_roots,
&mut clones,
);
}
Expand All @@ -342,6 +352,7 @@ fn detect_in_group(
min_tokens,
skip_local,
min_lines,
scan_roots,
&mut clones,
);

Expand All @@ -352,6 +363,45 @@ fn detect_in_group(
// Open clone state machine helpers
// ---------------------------------------------------------------------------

/// Returns true if both files share a common scan root directory.
/// Mirrors jscpd's `SkipLocalValidator.shouldSkipClone`:
/// `path.some(dir => isRelative(fileA, dir) && isRelative(fileB, dir))`
fn should_skip_local(file_a: &str, file_b: &str, scan_roots: &[PathBuf]) -> bool {
scan_roots
.iter()
.any(|root| is_relative_to(file_a, root) && is_relative_to(file_b, root))
}

/// Returns true if `file_path` is contained within `dir`.
/// Mirrors the TypeScript `SkipLocalValidator.isRelative`:
/// `const rel = relative(dir, file); return rel !== '' && !rel.startsWith('..') && !isAbsolute(rel);`
fn is_relative_to(file_path: &str, dir: &PathBuf) -> bool {
let file = Path::new(file_path);
// Fast path: file path starts with the dir prefix
if let Ok(rel) = file.strip_prefix(dir) {
return !rel.as_os_str().is_empty();
}
// Mixed absolute/relative can never match via simple prefix
if file.is_absolute() != dir.is_absolute() {
return false;
}
// Walk up from the file path checking if any ancestor starts with dir
let mut ancestor = file;
loop {
if ancestor == dir.as_path() {
return false;
}
if ancestor.starts_with(dir) {
let rel = ancestor.strip_prefix(dir).unwrap_or(ancestor);
return !rel.as_os_str().is_empty();
}
ancestor = match ancestor.parent() {
Some(p) => p,
None => return false,
};
}
Comment on lines +369 to +402

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize both sides before is_relative_to compares them.

This helper only compares raw path components, so spellings like ./repo/src/a.js vs repo or repo/../repo/lib/b.js return false here even though the TypeScript path.relative() check referenced in the doc comment would still treat both files as under the same root. orchestrate.rs canonicalizes the CLI path, but detect_with_options and detect_prepared are public entry points, so skip_local still depends on caller-specific path spelling unless this helper normalizes lexically or the API explicitly requires normalized IDs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cpd-core/src/detect.rs` around lines 369 - 402, The
is_relative_to helper function compares raw path components without
normalization, causing false negatives when paths use different spellings like
./repo/src/a.js versus repo, or paths with .. components. Normalize both the
file_path and dir parameters before performing any prefix checks and ancestor
walking in the is_relative_to function. This ensures that semantically
equivalent paths with different spellings are correctly identified as being
under the same root, matching the behavior of the TypeScript path.relative()
logic referenced in the documentation comment.

}

struct OpenClone {
stored_occurrence: Occurrence,
current_start: usize,
Expand Down Expand Up @@ -391,6 +441,7 @@ fn flush_clone(
prepared: &[PreparedSource],
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
clones: &mut Vec<CpdClone>,
) {
let oc = match open {
Expand All @@ -409,13 +460,11 @@ fn flush_clone(
let ex_end = ex_start + match_len - 1;
let cur_end = cur_start + match_len - 1;

// skip_local: drop clone pairs where both fragments share the same parent directory.
if skip_local {
let dir_a = Path::new(&existing_file.id).parent();
let dir_b = Path::new(&current_file.id).parent();
if dir_a == dir_b {
return;
}
// skip_local: drop clone pairs where both fragments are under the same scan root.
// Mirrors jscpd's SkipLocalValidator which checks
// path.some(dir => isRelative(fileA, dir) && isRelative(fileB, dir))
if skip_local && should_skip_local(&existing_file.id, &current_file.id, scan_roots) {
return;
}

let fragment_a = match make_fragment(&existing_file.id, &existing_file.spans, ex_start, ex_end)
Expand Down Expand Up @@ -518,6 +567,7 @@ fn add_secondary_clones(
min_tokens: usize,
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
clones: &mut Vec<CpdClone>,
) {
if repeated_windows.is_empty() {
Expand Down Expand Up @@ -619,6 +669,7 @@ fn add_secondary_clones(
prepared,
skip_local,
min_lines,
scan_roots,
clones,
&mut coverage,
);
Expand Down Expand Up @@ -667,6 +718,7 @@ fn add_secondary_clones(
prepared,
skip_local,
min_lines,
scan_roots,
clones,
&mut coverage,
);
Expand All @@ -677,6 +729,7 @@ fn flush_secondary_clone(
prepared: &[PreparedSource],
skip_local: bool,
min_lines: usize,
scan_roots: &[PathBuf],
clones: &mut Vec<CpdClone>,
coverage: &mut LineCoverage,
) {
Expand All @@ -687,13 +740,15 @@ fn flush_secondary_clone(
let range_a = fragment_line_range(&oc.clone.fragment_a);
let range_b = fragment_line_range(&oc.clone.fragment_b);

// skip_local check
if skip_local {
let dir_a = Path::new(&prepared[oc.source_a].id).parent();
let dir_b = Path::new(&prepared[oc.source_b].id).parent();
if dir_a == dir_b {
return;
}
// skip_local: drop clone pairs where both fragments are under the same scan root.
if skip_local
&& should_skip_local(
&prepared[oc.source_a].id,
&prepared[oc.source_b].id,
scan_roots,
)
{
return;
}

// min_lines filter: only check fragment A, mirroring jscpd's LinesLengthCloneValidator.
Expand Down
30 changes: 26 additions & 4 deletions rust/crates/cpd-finder/src/orchestrate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,12 @@ pub fn run(config: &RunConfig) -> Result<RunResult, FinderError> {
}

let content = str::from_utf8(&map).ok()?;
let id = file.path.to_string_lossy().into_owned();
let id = file
.path
.canonicalize()
.unwrap_or_else(|_| file.path.clone())
.to_string_lossy()
.into_owned();
Comment on lines +183 to +188

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep normalization out of the exported id.

This id now feeds both RunResult.sources[*].id and the clone fragment source_ids produced later in cpd-core, and RunConfig has no flag here to opt callers into absolute/canonical paths. Canonicalizing the emitted identifier changes the public output contract just to make skip_local comparisons work. Preserve the discovered/display path for output and carry a separate normalized path for root comparisons.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cpd-finder/src/orchestrate.rs` around lines 183 - 188, The
current code canonicalizes the file path and uses it as the `id`, which changes
the public output contract unexpectedly. Instead, preserve the original
discovered/display path from file.path for the exported `id` to maintain the
current contract, and create a separate normalized/canonical path variable for
internal comparisons like skip_local checks. Update the logic to use the
original path for the id assignment and the canonicalized path only where root
comparisons are needed.


// Compute code-level ignore ranges from regex matches against source text.
// This matches v4 semantics: regex patterns are matched against source
Expand Down Expand Up @@ -304,9 +309,26 @@ pub fn run(config: &RunConfig) -> Result<RunResult, FinderError> {
// Sort groups by format name for determinism.
format_groups.sort_by(|a, b| a[0].format.cmp(&b[0].format));

// 4. Detect clones — skip_local is now handled inside flush_clone.
let clones =
pool.install(|| detect_prepared(format_groups, min_tokens, skip_local, config.min_lines));
// 4. Detect clones — skip_local uses scan roots to determine same-directory pairs.
// Both scan roots and file IDs must use the same path normalization so
// that prefix comparisons work. Canonicalize scan roots once here (resolves
// symlinks like macOS /var → /private/var), and canonicalize file paths in
// the parallel processing loop above. Fall back to the original path if
// canonicalize fails.
let scan_roots: Vec<std::path::PathBuf> = config
.paths
.iter()
.map(|p| std::fs::canonicalize(p).unwrap_or_else(|_| p.clone()))
.collect();
let clones = pool.install(|| {
detect_prepared(
format_groups,
min_tokens,
skip_local,
config.min_lines,
&scan_roots,
)
});

// 5. Compute statistics.
let statistics = statistics::compute(&source_files, &clones);
Expand Down
40 changes: 39 additions & 1 deletion rust/crates/cpd-finder/tests/skip_local_integration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,41 @@ fn same_directory_clones_removed_by_skip_local() {
);
}

#[test]
fn same_root_subdirectories_skipped_by_skip_local() {
let dir = setup_temp_dir("same_root_subs");

let dir_a = dir.join("subdir_a");
let dir_b = dir.join("subdir_b");
fs::create_dir_all(&dir_a).unwrap();
fs::create_dir_all(&dir_b).unwrap();

fs::write(dir_a.join("file_a.js"), duplicate_js()).unwrap();
fs::write(dir_b.join("file_b.js"), duplicate_js()).unwrap();

// Single scan root with files in different subdirectories.
// skip_local skips clones where both files share a scan root,
// even if they're in different subdirectories.
let config = RunConfig {
paths: vec![dir.clone()],
min_tokens: 5,
min_lines: 1,
mode: Mode::Mild,
skip_local: true,
..Default::default()
};

let result = run(&config).unwrap();
let _ = fs::remove_dir_all(&dir);

assert_eq!(
result.clones.len(),
0,
"skip_local=true must skip clones within the same scan root, got: {}",
result.clones.len()
);
}

#[test]
fn cross_directory_clones_survive_skip_local() {
let dir = setup_temp_dir("cross");
Expand All @@ -62,8 +97,11 @@ fn cross_directory_clones_survive_skip_local() {
fs::write(dir_a.join("file_a.js"), duplicate_js()).unwrap();
fs::write(dir_b.join("file_b.js"), duplicate_js()).unwrap();

// Use BOTH subdirectories as separate scan roots.
// skip_local skips clones where both files share a scan root.
// Files in different roots should survive.
let config = RunConfig {
paths: vec![dir.clone()],
paths: vec![dir_a.clone(), dir_b.clone()],
min_tokens: 5,
min_lines: 1,
mode: Mode::Mild,
Expand Down
2 changes: 1 addition & 1 deletion rust/crates/cpd/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ pub struct Cli {
pub list: bool,

/// Skip clones where both fragments are in the same directory
#[arg(long)]
#[arg(long, visible_alias = "skipLocal")]
pub skip_local: bool,
Comment on lines 180 to 182

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the help text to match the new skip_local semantics.

The detector no longer skips only same-directory pairs; it now drops any pair that shares a scan root, including different subdirectories under one scanned path. Leaving the old wording here makes --help describe the pre-change behavior.

Suggested text
-    /// Skip clones where both fragments are in the same directory
+    /// Skip clones where both fragments are under the same scan root
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Skip clones where both fragments are in the same directory
#[arg(long)]
#[arg(long, visible_alias = "skipLocal")]
pub skip_local: bool,
/// Skip clones where both fragments are under the same scan root
#[arg(long, visible_alias = "skipLocal")]
pub skip_local: bool,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cpd/src/cli.rs` around lines 180 - 182, Update the doc comment
for the skip_local field to accurately reflect the current behavior. The field's
help text currently states that it skips clones in the same directory, but the
actual behavior now skips any pair that shares a scan root (including different
subdirectories under one scanned path). Modify the comment for the skip_local
boolean field to describe this broader scope of skipping pairs that share a scan
root rather than just same-directory pairs.


/// Minimum percentage of duplication to report (0-100)
Expand Down
Loading