Problem
During incremental TypeScript migration (strangler-fig / rename .js → .ts), teams often end up with near-identical file pairs that differ only in minor ways (// @ts-nocheck, a few type annotations, import path tweaks). These are meaningful duplication debt, but jscpd does not report them by default because .js is tokenized as javascript and .ts as typescript — separate comparison pools (see detector namespace below).
Current workaround
This works, but is non-obvious and replaces default extension mappings:
npx jscpd --formats-exts "javascript:js,ts,mjs,cjs,tsx,jsx" src
Downsides:
- Requires knowing that formats must be merged, not just enabled via
-f javascript,typescript
--formats-exts redefines mappings for listed formats; easy to accidentally drop extensions (e.g. .tsx, .mts)
- No documented preset for the common JS/TS migration case
- Mental model mismatch: users want “compare related languages,” not “reconfigure grammars”
Note: jscpd v5+ already supports cross-format detection for multi-block files (Vue, Svelte, Astro, Markdown). The JS/TS case is the same class of problem for monorepos mid-migration.
Where formats are defined today
Relevant source and docs for maintainers reviewing this request:
Format registry and extension mappings
Default extension mappings for the JS/TS ecosystem (from formats.ts):
| Format |
Extensions |
javascript |
.js, .es, .es6, .mjs, .cjs |
typescript |
.ts, .mts, .cts |
tsx |
.tsx |
jsx |
.jsx |
Why .js and .ts are not compared today
Clone detection stores token hashes in a per-format namespace. Each file is detected, tokenized under its format, then compared only against tokens in the same namespace:
This is why -f javascript,typescript scans both but does not cross-compare them.
--formats-exts workaround wiring
Note from docs: --formats-exts redefines default mappings for the formats you list — you must include every extension you still want covered.
Suggested implementation touchpoints for Options A / B
- Option A (format groups): map grouped formats to a shared store namespace in
detector.ts / memory.ts, while keeping native tokenizers from languages/.
- Option B (global pool): use a single namespace for all formats (or a configurable default group) in the same detector/store layer.
Proposed solution
Add explicit options for cross-format duplicate detection without fully overriding formatsExts.
Option A: format equivalence groups (targeted)
Declare which related formats should share a comparison pool. Files keep their native tokenizer; only clone matching crosses format boundaries within each group.
CLI
# Compare clones across listed formats (union comparison pool)
jscpd --cross-formats "javascript,typescript" src/
# Or named presets for common cases
jscpd --cross-formats-preset js-ts src/
Config (.jscpd.json)
{
"crossFormats": [
["javascript", "typescript"],
["javascript", "typescript", "tsx"]
]
}
Presets (optional)
{
"crossFormatsPresets": {
"js-ts": ["javascript", "typescript"],
"js-ts-tsx": ["javascript", "typescript", "tsx"]
}
}
Option B: single global pool (all formats)
For exploratory scans or polyglot codebases, add a flag that puts every detected format into one comparison pool:
CLI
# All scanned formats compared against each other
jscpd --cross-formats-all src/
# Shorthand alias (either name is fine)
jscpd --unified-pool src/
Config (.jscpd.json)
{
"crossFormatsAll": true
}
This avoids hand-maintaining --formats-exts mappings when the goal is simply “find any duplicate logic, regardless of language.” Reports should still show each file’s detected format so users can judge relevance.
Trade-offs: a global pool may surface low-signal matches across unrelated languages (e.g. similar control-flow in PHP and JavaScript). Option A is preferable for CI; Option B is useful for one-off audits. Consider pairing Option B with --format to limit scope, e.g. jscpd --cross-formats-all --format javascript,typescript,python src/.
Expected behavior
- Files keep their native format/tokenizer where possible (Option A and Option B)
- Clone detection runs across grouped formats (Option A) or all formats (Option B) in a shared comparison pool
- Pairs like
foo.js / foo.ts are reported with both paths and formats shown
- Default behavior unchanged when neither option is set
- Option B does not replace or break existing per-format tokenization — it only changes which token streams are compared against each other
Use cases
- Finding leftover
.js copies after .ts migration (Option A)
- CI gates during strangler-fig refactors (“no js/ts file pairs above N lines”) (Option A)
- Detecting duplicated logic between
.ts and .tsx variants (Option A)
- Broad duplication audits across a polyglot repo without configuring format groups (Option B)
- Finding copy-paste between backend and frontend when logic was ported across languages (Option B, with
--format scope)
Alternatives considered
Problem
During incremental TypeScript migration (strangler-fig / rename
.js→.ts), teams often end up with near-identical file pairs that differ only in minor ways (// @ts-nocheck, a few type annotations, import path tweaks). These are meaningful duplication debt, but jscpd does not report them by default because.jsis tokenized asjavascriptand.tsastypescript— separate comparison pools (see detector namespace below).Current workaround
This works, but is non-obvious and replaces default extension mappings:
npx jscpd --formats-exts "javascript:js,ts,mjs,cjs,tsx,jsx" srcDownsides:
-f javascript,typescript--formats-extsredefines mappings for listed formats; easy to accidentally drop extensions (e.g..tsx,.mts)Note: jscpd v5+ already supports cross-format detection for multi-block files (Vue, Svelte, Astro, Markdown). The JS/TS case is the same class of problem for monorepos mid-migration.
Where formats are defined today
Relevant source and docs for maintainers reviewing this request:
Format registry and extension mappings
packages/tokenizer/src/formats.tsEXT_TO_FORMAT,getFormatByFile())formats.tslines 696–724packages/tokenizer/src/languages/Default extension mappings for the JS/TS ecosystem (from
formats.ts):javascript.js,.es,.es6,.mjs,.cjstypescript.ts,.mts,.ctstsx.tsxjsx.jsxWhy
.jsand.tsare not compared todayClone detection stores token hashes in a per-format namespace. Each file is detected, tokenized under its format, then compared only against tokens in the same namespace:
packages/core/src/detector.ts(store.namespace(format)at lines 31 and 36)packages/core/src/store/memory.tspackages/finder/src/in-files-detector.ts(usesgetFormatByFile())This is why
-f javascript,typescriptscans both but does not cross-compare them.--formats-extsworkaround wiringformatsExtsconfig optionpackages/finder/src/utils/options.ts(parseFormatsExtensions)packages/core/src/interfaces/options.interface.tsapps/jscpd/src/options.tsSuggested implementation touchpoints for Options A / B
detector.ts/memory.ts, while keeping native tokenizers fromlanguages/.Proposed solution
Add explicit options for cross-format duplicate detection without fully overriding
formatsExts.Option A: format equivalence groups (targeted)
Declare which related formats should share a comparison pool. Files keep their native tokenizer; only clone matching crosses format boundaries within each group.
CLI
Config (
.jscpd.json){ "crossFormats": [ ["javascript", "typescript"], ["javascript", "typescript", "tsx"] ] }Presets (optional)
{ "crossFormatsPresets": { "js-ts": ["javascript", "typescript"], "js-ts-tsx": ["javascript", "typescript", "tsx"] } }Option B: single global pool (all formats)
For exploratory scans or polyglot codebases, add a flag that puts every detected format into one comparison pool:
CLI
Config (
.jscpd.json){ "crossFormatsAll": true }This avoids hand-maintaining
--formats-extsmappings when the goal is simply “find any duplicate logic, regardless of language.” Reports should still show each file’s detected format so users can judge relevance.Trade-offs: a global pool may surface low-signal matches across unrelated languages (e.g. similar control-flow in PHP and JavaScript). Option A is preferable for CI; Option B is useful for one-off audits. Consider pairing Option B with
--formatto limit scope, e.g.jscpd --cross-formats-all --format javascript,typescript,python src/.Expected behavior
foo.js/foo.tsare reported with both paths and formats shownUse cases
.jscopies after.tsmigration (Option A).tsand.tsxvariants (Option A)--formatscope)Alternatives considered
-f javascript,typescript--formats-exts "javascript:js,ts"