Skip to content

cross-format duplicate detection for related languages (e.g. JavaScript ↔ TypeScript) #810

@Jimbolino

Description

@Jimbolino

Problem

During incremental TypeScript migration (strangler-fig / rename .js.ts), teams often end up with near-identical file pairs that differ only in minor ways (// @ts-nocheck, a few type annotations, import path tweaks). These are meaningful duplication debt, but jscpd does not report them by default because .js is tokenized as javascript and .ts as typescript — separate comparison pools (see detector namespace below).

Current workaround

This works, but is non-obvious and replaces default extension mappings:

npx jscpd --formats-exts "javascript:js,ts,mjs,cjs,tsx,jsx" src

Downsides:

  • Requires knowing that formats must be merged, not just enabled via -f javascript,typescript
  • --formats-exts redefines mappings for listed formats; easy to accidentally drop extensions (e.g. .tsx, .mts)
  • No documented preset for the common JS/TS migration case
  • Mental model mismatch: users want “compare related languages,” not “reconfigure grammars”

Note: jscpd v5+ already supports cross-format detection for multi-block files (Vue, Svelte, Astro, Markdown). The JS/TS case is the same class of problem for monorepos mid-migration.

Where formats are defined today

Relevant source and docs for maintainers reviewing this request:

Format registry and extension mappings

What Link
Human-readable format list (generated from source) FORMATS.md
Source of truth — format names, default extensions, parent formats packages/tokenizer/src/formats.ts
Extension → format lookup (EXT_TO_FORMAT, getFormatByFile()) formats.ts lines 696–724
Per-language Prism grammars (tokenizers) packages/tokenizer/src/languages/

Default extension mappings for the JS/TS ecosystem (from formats.ts):

Format Extensions
javascript .js, .es, .es6, .mjs, .cjs
typescript .ts, .mts, .cts
tsx .tsx
jsx .jsx

Why .js and .ts are not compared today

Clone detection stores token hashes in a per-format namespace. Each file is detected, tokenized under its format, then compared only against tokens in the same namespace:

What Link
Detector sets store namespace per format packages/core/src/detector.ts (store.namespace(format) at lines 31 and 36)
In-memory store partitioned by namespace packages/core/src/store/memory.ts
File → format resolution at scan time packages/finder/src/in-files-detector.ts (uses getFormatByFile())

This is why -f javascript,typescript scans both but does not cross-compare them.

--formats-exts workaround wiring

What Link
CLI docs — Formats Extensions apps/jscpd/README.md#formats-extensions
Site docs — Custom Format Extensions jscpd.dev — Supported Formats
Site docs — formatsExts config option jscpd.dev — Configuration
CLI string → config object parser packages/finder/src/utils/options.ts (parseFormatsExtensions)
Options interface packages/core/src/interfaces/options.interface.ts
CLI option wiring apps/jscpd/src/options.ts

Note from docs: --formats-exts redefines default mappings for the formats you list — you must include every extension you still want covered.

Suggested implementation touchpoints for Options A / B

  • Option A (format groups): map grouped formats to a shared store namespace in detector.ts / memory.ts, while keeping native tokenizers from languages/.
  • Option B (global pool): use a single namespace for all formats (or a configurable default group) in the same detector/store layer.

Proposed solution

Add explicit options for cross-format duplicate detection without fully overriding formatsExts.

Option A: format equivalence groups (targeted)

Declare which related formats should share a comparison pool. Files keep their native tokenizer; only clone matching crosses format boundaries within each group.

CLI

# Compare clones across listed formats (union comparison pool)
jscpd --cross-formats "javascript,typescript" src/

# Or named presets for common cases
jscpd --cross-formats-preset js-ts src/

Config (.jscpd.json)

{
  "crossFormats": [
    ["javascript", "typescript"],
    ["javascript", "typescript", "tsx"]
  ]
}

Presets (optional)

{
  "crossFormatsPresets": {
    "js-ts": ["javascript", "typescript"],
    "js-ts-tsx": ["javascript", "typescript", "tsx"]
  }
}

Option B: single global pool (all formats)

For exploratory scans or polyglot codebases, add a flag that puts every detected format into one comparison pool:

CLI

# All scanned formats compared against each other
jscpd --cross-formats-all src/

# Shorthand alias (either name is fine)
jscpd --unified-pool src/

Config (.jscpd.json)

{
  "crossFormatsAll": true
}

This avoids hand-maintaining --formats-exts mappings when the goal is simply “find any duplicate logic, regardless of language.” Reports should still show each file’s detected format so users can judge relevance.

Trade-offs: a global pool may surface low-signal matches across unrelated languages (e.g. similar control-flow in PHP and JavaScript). Option A is preferable for CI; Option B is useful for one-off audits. Consider pairing Option B with --format to limit scope, e.g. jscpd --cross-formats-all --format javascript,typescript,python src/.

Expected behavior

  • Files keep their native format/tokenizer where possible (Option A and Option B)
  • Clone detection runs across grouped formats (Option A) or all formats (Option B) in a shared comparison pool
  • Pairs like foo.js / foo.ts are reported with both paths and formats shown
  • Default behavior unchanged when neither option is set
  • Option B does not replace or break existing per-format tokenization — it only changes which token streams are compared against each other

Use cases

  • Finding leftover .js copies after .ts migration (Option A)
  • CI gates during strangler-fig refactors (“no js/ts file pairs above N lines”) (Option A)
  • Detecting duplicated logic between .ts and .tsx variants (Option A)
  • Broad duplication audits across a polyglot repo without configuring format groups (Option B)
  • Finding copy-paste between backend and frontend when logic was ported across languages (Option B, with --format scope)

Alternatives considered

Approach Issue
-f javascript,typescript Enables both formats but does not cross-compare (separate namespaces)
--formats-exts "javascript:js,ts" Works but overrides mappings; poor discoverability; must list every extension
Normalizing TS → JS before scan External workaround; loses jscpd’s tokenizer benefits

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions