Skip to content

Add similar document detection tool using MinHash shingling#1870

Closed
aecs4u wants to merge 62 commits intoqarmin:masterfrom
aecs4u:feature/similar-documents
Closed

Add similar document detection tool using MinHash shingling#1870
aecs4u wants to merge 62 commits intoqarmin:masterfrom
aecs4u:feature/similar-documents

Conversation

@aecs4u
Copy link
Copy Markdown

@aecs4u aecs4u commented Mar 25, 2026

Summary

New similar-docs subcommand that finds text/document files with similar content using MinHash word-level shingling. This is inspired by FDFF's similar document detection capability.

  • MinHash shingling: Computes document signatures using word-level shingles (default: 3-word windows) with 128 xxh3-based hash functions
  • Jaccard similarity: Estimates content similarity from MinHash signatures, with configurable threshold (default: 0.7)
  • Union-find clustering: Groups similar documents efficiently
  • 30+ file types: txt, md, rst, csv, json, xml, yaml, html, py, rs, js, ts, c, cpp, java, go, etc.
  • Bounded memory: Reads first 256KB per file
  • Parallel: Uses rayon for signature computation

CLI Usage

czkawka similar-docs -d /path/to/docs --similarity-threshold 0.7
czkawka similar-docs -d /path --num-hashes 256 --shingle-size 5

Files Changed

  • New module: czkawka_core/src/tools/similar_documents/ (mod.rs, core.rs, traits.rs)
  • Added SimilarDocuments to ToolType enum and CurrentStage progress stages
  • Added similar-docs CLI subcommand with parameters

Test plan

  • 5 unit tests for MinHash correctness (identical, similar, different, empty, short texts)
  • cargo check succeeds for both czkawka_core and czkawka_cli
  • All 201 tests pass
  • Manual test with real documents

Closes #1865

🤖 Generated with Claude Code

aecs4u and others added 30 commits March 23, 2026 10:21
This workflow file sets up CodeQL analysis for multiple languages on push and pull request events, as well as on a schedule.
Added a security policy document outlining supported versions and vulnerability reporting.
Add a new PySide6/Qt 6 GUI frontend (czkawka_pyside6) with feature
parity with the Krokiet (Slint) interface. Uses czkawka_cli as its
backend via subprocess with JSON output for results and --json-progress
for real-time progress data.

PySide6 frontend features:
- All 14 scanning tools with per-tool settings
- Two-bar progress (current stage + overall) with entry/byte counts
- Dark theme with Krokiet SVG icons
- Grouped results, selection modes, file actions
- Image preview, directory management, settings persistence
- Auto-detection of czkawka_cli binary

CLI --json-progress flag:
- Outputs ProgressData as JSON lines to stderr
- Added Serialize to ProgressData, CurrentStage, ToolType
- Added connect_progress_json() handler
- Added serde_json dependency

Closes qarmin#1847

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove forced dark palette and hardcoded color stylesheets; inherit
  the system theme (Breeze, Adwaita, etc.) so the app looks native
- Use QIcon.fromTheme() with standard XDG/FreeDesktop icon names
  (system-search, edit-delete, document-save, etc.) with embedded SVG
  fallbacks for systems without an icon theme
- Use QStandardPaths.AppConfigLocation for XDG-compliant config paths
- Add .desktop file (com.github.qarmin.czkawka-pyside6.desktop)
- Add AppStream metainfo.xml for software center integration
- Set desktopFileName and organizationDomain for proper KDE integration
- Replace all hardcoded setStyleSheet color values with system palette
  (setEnabled(False) for muted text, QFont for bold/size, QFrame for
  separators, style().standardIcon() for dialog icons)
- Remove forced QFont size — inherit system font settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix writeln! formatting to pass cargo fmt --check
- Show stage index in progress bar title (e.g., "[3/7] Calculating prehashes")
- All czkawka_core tests pass (5/5 progress_data tests OK)
- Krokiet compiles successfully
- All CLI subcommands verified working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results table improvements:
- Columns are resizable (drag header edges) with sensible defaults
- Click column header to sort (ascending/descending toggle)
- Sorting works within groups for grouped tools (duplicates, etc.)
- Numeric columns (Size, Date) sort by actual values, not strings
- Sort indicator arrow shown in header

Group header fix:
- Group headers now span across all columns (merged cell effect)
- setFirstColumnSpanned called after adding item to tree

Load results:
- New "Load" button in action bar to load previously saved JSON results
- Supports both PySide6 save format and raw czkawka_cli JSON output
- Save format now preserves group structure, checked state, and group IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Keyboard shortcuts: Ctrl+S (scan), Escape (stop), Ctrl+A (select all),
  Ctrl+D (delete), Ctrl+M (move), Ctrl+Shift+S (save), Ctrl+L (load),
  Ctrl+I (invert selection), F5 (scan)
- Drag-and-drop directories onto the bottom panel to add include paths
- Right-click column header to toggle column visibility
- Filter/search bar above results to narrow by filename or path
- CSV export support in save dialog
- Task-specific save filenames (e.g., czkawka_duplicate_files_20260323.txt)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- System tray icon with context menu (Show/Hide, Start Scan, Quit)
- Click tray icon to toggle window visibility
- Notification balloon when scan completes while window is hidden
- Uses window icon from XDG theme or project logo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ScanHistory class persists up to 100 scan records as JSON
- Records: timestamp, tool, directories, entries/groups found, duration
- Uses QStandardPaths for XDG-compliant storage location
- Integrated with scan completion in main window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ScanQueue class using deque for FIFO scan scheduling
- Queue multiple tool types and run them sequentially
- Signals: queue_updated, queue_finished, next_scan
- Integrated with scan completion to auto-trigger next queued scan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DiffDialog with side-by-side file panels showing name, path, size, date
- Image preview for supported formats (jpg, png, gif, etc.)
- Difference summary highlighting size/date/directory differences
- Accessible via right-click "Compare Selected" when 2 items are selected

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test coverage across 7 test files:
- test_models.py (16 tests): ActiveTab completeness, ResultEntry,
  ScanProgress, ToolSettings, AppSettings, enum values
- test_backend.py (25 tests): CLI command building for all 14 tools,
  common flags, JSON result parsing (flat/grouped/empty/missing),
  format_size and format_date utilities
- test_widgets.py (24 tests): ResultsView (tab switching, grouped/flat
  results, select all/none/invert/biggest/newest, sorting, filtering,
  clear), LeftPanel, ActionButtons (scanning state, tab visibility),
  ProgressWidget (start/stop, progress updates, formatters)
- test_new_features.py (20 tests): ScanHistory (CRUD, max records,
  persistence), ScanQueue (add/dedup/sequential/stop/signals),
  SaveLoad JSON roundtrip, CLI format parsing, DiffDialog
- test_main_window.py (13 tests): Integration tests for window creation,
  all 14 tabs, state, shortcuts, tray, history, queue, drag-drop,
  filter, icon, button visibility, results set/clear

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `czkawka_mcp` workspace member that implements a Model Context
Protocol (MCP) server over stdio, allowing AI agents (Claude Code,
Claude Desktop, etc.) to invoke all 14 czkawka analysis tools
programmatically with JSON parameters and structured JSON results.

All tools are read-only by default (dry_run=true, no deletions).
Uses rmcp crate and links czkawka_core directly (no subprocess).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- PySide6 README: document keyboard shortcuts, drag-drop, column
  visibility, filter bar, CSV export, system tray, scan history,
  scan queue, diff view, window geometry persistence, KDE6 compliance
- Add test suite documentation (98 tests across 5 test files)
- Update architecture diagram with new files (system_tray, scan_history,
  scan_queue, diff_dialog, tests/)
- Root README: note KDE6 compliance for PySide6 frontend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each included directory now has a "Ref" checkbox in the bottom panel.
When checked, the directory is marked as a reference — files in
reference directories are kept and never selected for deletion.

This matches the Krokiet (Slint) UI's reference path feature.

Changes:
- AppSettings: added reference_paths set, persisted in settings JSON
- BottomPanel: replaced QListWidget with QTreeWidget showing [Ref][Path]
  columns; Ref checkbox toggles reference_paths membership
- Backend: passes -r flag with reference directories for tools that
  support it (duplicates, similar images/videos/music)
- Non-grouped tools (empty files, broken files, etc.) ignore -r

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a new PySide6/Qt 6 GUI frontend (czkawka_pyside6) with feature
parity with the Krokiet (Slint) interface. Uses czkawka_cli as its
backend via subprocess with JSON output for results and --json-progress
for real-time progress data.

PySide6 frontend features:
- All 14 scanning tools with per-tool settings
- Two-bar progress (current stage + overall) with entry/byte counts
- Dark theme with Krokiet SVG icons
- Grouped results, selection modes, file actions
- Image preview, directory management, settings persistence
- Auto-detection of czkawka_cli binary

CLI --json-progress flag:
- Outputs ProgressData as JSON lines to stderr
- Added Serialize to ProgressData, CurrentStage, ToolType
- Added connect_progress_json() handler
- Added serde_json dependency

Closes qarmin#1847

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove forced dark palette and hardcoded color stylesheets; inherit
  the system theme (Breeze, Adwaita, etc.) so the app looks native
- Use QIcon.fromTheme() with standard XDG/FreeDesktop icon names
  (system-search, edit-delete, document-save, etc.) with embedded SVG
  fallbacks for systems without an icon theme
- Use QStandardPaths.AppConfigLocation for XDG-compliant config paths
- Add .desktop file (com.github.qarmin.czkawka-pyside6.desktop)
- Add AppStream metainfo.xml for software center integration
- Set desktopFileName and organizationDomain for proper KDE integration
- Replace all hardcoded setStyleSheet color values with system palette
  (setEnabled(False) for muted text, QFont for bold/size, QFrame for
  separators, style().standardIcon() for dialog icons)
- Remove forced QFont size — inherit system font settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix writeln! formatting to pass cargo fmt --check
- Show stage index in progress bar title (e.g., "[3/7] Calculating prehashes")
- All czkawka_core tests pass (5/5 progress_data tests OK)
- Krokiet compiles successfully
- All CLI subcommands verified working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results table improvements:
- Columns are resizable (drag header edges) with sensible defaults
- Click column header to sort (ascending/descending toggle)
- Sorting works within groups for grouped tools (duplicates, etc.)
- Numeric columns (Size, Date) sort by actual values, not strings
- Sort indicator arrow shown in header

Group header fix:
- Group headers now span across all columns (merged cell effect)
- setFirstColumnSpanned called after adding item to tree

Load results:
- New "Load" button in action bar to load previously saved JSON results
- Supports both PySide6 save format and raw czkawka_cli JSON output
- Save format now preserves group structure, checked state, and group IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QA fixes across PySide6 frontend and CLI:
- Fix mousePressEvent in left_panel.py (use eventFilter instead of lambda)
- Fix deprecated menu.exec_() -> menu.exec()
- Fix stdout deadlock: PIPE -> DEVNULL for unused stdout
- Fix final progress lines silently discarded after process exit
- Fix JSON injection in progress.rs stage_name serialization
- Fix temp file leak with try/finally cleanup
- Remove .svg from preview (QPixmap can't load SVG)
- Remove hardcoded dev path from icons.py
- Add settings_changed signal to video/music sliders
- Add error handling for subprocess open-file/folder calls
- Add JSON parse error logging in backend
- Remove unused QMessageBox import

New features:
- Dry-run checkbox in Delete and Move dialogs
- Selection size display (selected/total with human-readable sizes)
- Accurate file count: replace stale cached estimate with live
  background os.walk counter during collection phase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collapse nested if-let into a single condition using let-chains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a new PySide6/Qt 6 GUI frontend (czkawka_pyside6) with feature
parity with the Krokiet (Slint) interface. Uses czkawka_cli as its
backend via subprocess with JSON output for results and --json-progress
for real-time progress data.

PySide6 frontend features:
- All 14 scanning tools with per-tool settings
- Two-bar progress (current stage + overall) with entry/byte counts
- Dark theme with Krokiet SVG icons
- Grouped results, selection modes, file actions
- Image preview, directory management, settings persistence
- Auto-detection of czkawka_cli binary

CLI --json-progress flag:
- Outputs ProgressData as JSON lines to stderr
- Added Serialize to ProgressData, CurrentStage, ToolType
- Added connect_progress_json() handler
- Added serde_json dependency

Closes qarmin#1847

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove forced dark palette and hardcoded color stylesheets; inherit
  the system theme (Breeze, Adwaita, etc.) so the app looks native
- Use QIcon.fromTheme() with standard XDG/FreeDesktop icon names
  (system-search, edit-delete, document-save, etc.) with embedded SVG
  fallbacks for systems without an icon theme
- Use QStandardPaths.AppConfigLocation for XDG-compliant config paths
- Add .desktop file (com.github.qarmin.czkawka-pyside6.desktop)
- Add AppStream metainfo.xml for software center integration
- Set desktopFileName and organizationDomain for proper KDE integration
- Replace all hardcoded setStyleSheet color values with system palette
  (setEnabled(False) for muted text, QFont for bold/size, QFrame for
  separators, style().standardIcon() for dialog icons)
- Remove forced QFont size — inherit system font settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix writeln! formatting to pass cargo fmt --check
- Show stage index in progress bar title (e.g., "[3/7] Calculating prehashes")
- All czkawka_core tests pass (5/5 progress_data tests OK)
- Krokiet compiles successfully
- All CLI subcommands verified working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results table improvements:
- Columns are resizable (drag header edges) with sensible defaults
- Click column header to sort (ascending/descending toggle)
- Sorting works within groups for grouped tools (duplicates, etc.)
- Numeric columns (Size, Date) sort by actual values, not strings
- Sort indicator arrow shown in header

Group header fix:
- Group headers now span across all columns (merged cell effect)
- setFirstColumnSpanned called after adding item to tree

Load results:
- New "Load" button in action bar to load previously saved JSON results
- Supports both PySide6 save format and raw czkawka_cli JSON output
- Save format now preserves group structure, checked state, and group IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `czkawka_mcp` workspace member that implements a Model Context
Protocol (MCP) server over stdio, allowing AI agents (Claude Code,
Claude Desktop, etc.) to invoke all 14 czkawka analysis tools
programmatically with JSON parameters and structured JSON results.

All tools are read-only by default (dry_run=true, no deletions).
Uses rmcp crate and links czkawka_core directly (no subprocess).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply stable rustfmt formatting and suppress clippy::unnecessary_wraps
on ok_result/err_result helpers (Result return type required by #[tool] macro).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EmanueleCannizzaro and others added 28 commits March 24, 2026 11:24
Results table improvements:
- Columns are resizable (drag header edges) with sensible defaults
- Click column header to sort (ascending/descending toggle)
- Sorting works within groups for grouped tools (duplicates, etc.)
- Numeric columns (Size, Date) sort by actual values, not strings
- Sort indicator arrow shown in header

Group header fix:
- Group headers now span across all columns (merged cell effect)
- setFirstColumnSpanned called after adding item to tree

Load results:
- New "Load" button in action bar to load previously saved JSON results
- Supports both PySide6 save format and raw czkawka_cli JSON output
- Save format now preserves group structure, checked state, and group IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QA fixes across PySide6 frontend and CLI:
- Fix mousePressEvent in left_panel.py (use eventFilter instead of lambda)
- Fix deprecated menu.exec_() -> menu.exec()
- Fix stdout deadlock: PIPE -> DEVNULL for unused stdout
- Fix final progress lines silently discarded after process exit
- Fix JSON injection in progress.rs stage_name serialization
- Fix temp file leak with try/finally cleanup
- Remove .svg from preview (QPixmap can't load SVG)
- Remove hardcoded dev path from icons.py
- Add settings_changed signal to video/music sliders
- Add error handling for subprocess open-file/folder calls
- Add JSON parse error logging in backend
- Remove unused QMessageBox import

New features:
- Dry-run checkbox in Delete and Move dialogs
- Selection size display (selected/total with human-readable sizes)
- Accurate file count: replace stale cached estimate with live
  background os.walk counter during collection phase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collapse nested if-let into a single condition using let-chains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1: Enable LTO and codegen-units=1 in release profile for 5-15%
runtime improvement on CPU-bound operations.

Phase 2: Replace BTreeMap<String, T> with HashMap<String, T> across
cache internals, duplicate hashing intermediates, and all tool modules.
These maps are used only for lookups (O(1) vs O(log n)) and ordering
is never needed. Also removes unnecessary String-allocating sort in
dir_traversal (sort by PathBuf directly), and optimizes
diff_loaded_and_prechecked_files with linear scan for small groups
and HashMap for larger ones (resolves TODO).

Phase 3: Move sequential filter into parallel iterator in
similar_images (resolves TODO). Switch CLI progress channel from
unbounded to bounded(256) to prevent theoretical unbounded growth.

Adds deterministic sort in same_music fingerprint comparison to
ensure consistent grouping regardless of HashMap iteration order.

All 303 tests pass. Clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	czkawka_cli/src/progress.rs
#	czkawka_pyside6/app/backend.py
#	czkawka_pyside6/app/dialogs/delete_dialog.py
#	czkawka_pyside6/app/dialogs/move_dialog.py
#	czkawka_pyside6/app/icons.py
#	czkawka_pyside6/app/left_panel.py
#	czkawka_pyside6/app/main_window.py
#	czkawka_pyside6/app/preview_panel.py
#	czkawka_pyside6/app/progress_widget.py
#	czkawka_pyside6/app/results_view.py
#	czkawka_pyside6/app/tool_settings.py
# Conflicts:
#	czkawka_cli/src/main.rs
#	czkawka_cli/src/progress.rs
- Apply rustfmt to prehash_save_cache_at_exit signature
- Sort entries within each same_music duplicate group by path to
  ensure deterministic results regardless of HashMap iteration order,
  fixing regression test failure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a new FUZZY_NAME search method that uses Jaro-Winkler distance
to find files with similar (but not identical) names, e.g.
report_final.pdf vs report_final_v2.pdf.

- Add FuzzyName variant to CheckingMethod enum in czkawka_core
- Implement check_files_fuzzy_name() with union-find grouping,
  optimized by pre-grouping files by extension
- Add strsim crate dependency for Jaro-Winkler similarity
- Add --name-similarity-threshold CLI flag (0.0-1.0, default 0.85)
- Update all frontends: krokiet, kalka, czkawka_gui (wildcard-safe)
- Add threshold slider to kalka duplicate settings panel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When two items are selected in the results tree (Ctrl+click), the
preview panel switches to comparison mode showing both files side by
side in a resizable splitter. Single selection reverts to the normal
single-image preview.

- Refactor PreviewPanel into stacked single/_ImageSlot and comparison modes
- Add current_items_changed signal to ResultsView (fires on tree selection)
- Wire up in main_window to auto-switch between single/comparison preview

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend kalka's save dialog with proper multi-format export:
- CSV export with column headers from TAB_COLUMNS, group labels,
  and a convenience "Full Path" column
- JSON export (existing, refactored to static method)
- Text export (existing, refactored to static method)

The file type is auto-detected from the chosen extension/filter.
The active_tab parameter (optional) provides column ordering for CSV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend kalka's preview panel beyond images to support:
- Text files: plain text preview with syntax-aware extensions
  (.py, .rs, .json, .xml, .yaml, etc.), truncated at 64KB
- Video files: thumbnail extraction via ffmpeg (first frame at 3s),
  with graceful fallback if ffmpeg is not installed
- PDF files: first-page render via PySide6.QtPdf (QPdfDocument),
  with graceful fallback if QtPdf module is unavailable

The panel auto-detects file type by extension and switches between
image mode (QPixmap) and text mode (QPlainTextEdit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Redesign the select dialog with two sections:
- Quick selection: Select All / Unselect All / Invert (direct apply)
- Smart selection: checkboxes for Biggest/Smallest/Newest/Oldest/
  Shortest/Longest path criteria that can be combined with AND/OR
  logic for more powerful file selection within duplicate groups

Emits custom_criteria_selected(modes, combinator) signal when
multiple criteria are combined.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	kalka/app/preview_panel.py
Enable dragging folders from the system file manager onto the
included/excluded directory lists in the bottom panel.

- Add _DroppableListWidget subclass with drag/drop support
- Filter drops to only accept local directories
- Deduplicate against existing entries
- Visual drag feedback via Qt's built-in drop indicators

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a "Low priority scanning" checkbox in settings that prepends
nice -n 19 and ionice -c 3 to the czkawka_cli process on Linux,
preventing scans from slowing down other applications.

- Add low_priority_scan setting to AppSettings
- Prepend nice/ionice in backend.py when spawning CLI process
- Add checkbox with tooltip in settings panel
- Only activates on Linux (checks sys.platform and shutil.which)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… scan-queue, scan-history, system-tray-geometry, ux-improvements

# Conflicts:
#	README.md
#	czkawka_cli/src/progress.rs
#	czkawka_mcp/src/main.rs
#	czkawka_pyside6/README.md
#	czkawka_pyside6/app/backend.py
#	czkawka_pyside6/app/bottom_panel.py
#	czkawka_pyside6/app/dialogs/__init__.py
#	czkawka_pyside6/app/dialogs/delete_dialog.py
#	czkawka_pyside6/app/dialogs/move_dialog.py
#	czkawka_pyside6/app/dialogs/save_dialog.py
#	czkawka_pyside6/app/icons.py
#	czkawka_pyside6/app/left_panel.py
#	czkawka_pyside6/app/main_window.py
#	czkawka_pyside6/app/models.py
#	czkawka_pyside6/app/preview_panel.py
#	czkawka_pyside6/app/progress_widget.py
#	czkawka_pyside6/app/results_view.py
#	czkawka_pyside6/app/state.py
#	czkawka_pyside6/app/tool_settings.py
- Rename PySide6 frontend from czkawka_pyside6 to kalka with i18n support (20 languages)
- Optimize duplicate finder: use HashMap over BTreeMap, dynamic rayon parallelism
- Optimize similar images: reduce cloning with swap_remove, dynamic chunk sizes
- Optimize dir traversal: scale max_len with available CPU cores

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t codes

- similar_images: check both reference_directories and reference_files
  in is_in_reference_folder(), preventing reference files from leaking
  into normal candidate results

- invalid_symlinks: report broken symlink chains (A -> B -> missing)
  instead of silently dropping them by returning None

- ExcludedItems/Directories: clear existing expressions and device IDs
  before repopulating, preventing stale config across reconfigured scans

- CLI: track file write errors in CliOutput and exit with code 1 on
  persistence failure; log non-UTF-8 output paths instead of silently
  ignoring them

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmin#1860)

Add similarity_score field to MusicEntry (f64, lower = more similar)
that captures the best fingerprint match score from
rusty_chromaprint::match_fingerprints(). Previously, this score was
only used for filtering and discarded.

- Images already expose `difference` in ImagesEntry (no change needed)
- Music now exposes `similarity_score` in JSON output
- Videos use binary matching (no score available from vid_dup_finder_lib)
- Field uses #[serde(default)] for backward-compatible deserialization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Save and load entire scan configurations (active tab, app settings,
tool settings) as named profiles stored in {config_dir}/profiles/.

- Add save_profile(), load_profile(), list_profiles(), delete_profile()
  methods to AppState
- Profiles are JSON files that capture all settings including enum values
- Enum values are serialized as strings and restored on load
- Backward-compatible: new settings missing from old profiles use defaults

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…min#1859)

When enabled, files within the same included directory are not compared
against each other — only files across different included directories
are considered duplicates. Useful for comparing a "known good" archive
against a messy folder without finding duplicates within the archive.

- Add no_self_compare field to DuplicateFinderParameters with builder
- Add included_directory_index() helper to Directories
- Add filter_same_directory_groups() post-grouping filter in core.rs
  that filters all result types (name, size, size+name, hash, fuzzy)
- Add --no-self-compare CLI flag to czkawka_cli duplicates subcommand

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The kalka frontend replaces czkawka_pyside6 with i18n support,
integrated side-by-side comparison, scan profiles, and other
improvements. Experimental features (scan history, scan queue,
system tray) were intentionally dropped during the migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in#1864)

Add --fuzzy-tag-comparison and --tag-similarity-threshold flags to the
same-music tool. When enabled, music tags are compared using Jaro-Winkler
string similarity with union-find clustering instead of exact hash grouping.

- Add normalize_tag() that strips "The "/"A "/"An " prefixes, lowercases,
  and removes non-alphanumeric characters
- Modify check_music_item() to use pairwise fuzzy comparison when enabled
- Add fuzzy_tag_comparison and tag_similarity_threshold to SameMusicParameters
- Default threshold: 0.85 (catches "Beatles" vs "The Beatles" etc.)
- Uses strsim crate (already in Cargo.toml) for Jaro-Winkler distance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

New `similar-docs` subcommand that finds text/document files with
similar content using MinHash word-level shingling. This catches
documents with rearranged paragraphs, minor edits, or near-duplicates.

Core implementation (czkawka_core):
- New `similar_documents` tool module with DocumentEntry, traits, core
- MinHash signature computation using xxh3 hashing with configurable
  seed permutations (default: 128 hash functions)
- Word-level shingling (default: 3-word shingles) for content similarity
- Jaccard similarity estimation from MinHash signatures
- Union-find clustering for grouping similar documents
- Supports 30+ text/source file extensions (txt, md, py, rs, etc.)
- Reads first 256KB per file to bound memory/time
- Parallel signature computation via rayon

CLI integration (czkawka_cli):
- Add `similar-docs` subcommand with --similarity-threshold (0.0-1.0),
  --num-hashes, --shingle-size parameters
- Progress reporting with SimilarDocumentsHashing/Comparing stages
- Full JSON output support

5 unit tests for MinHash correctness (identical, similar, different,
empty, short text cases).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qarmin
Copy link
Copy Markdown
Owner

qarmin commented Mar 25, 2026

Again 17k commit...

@qarmin qarmin closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[czkawka_core] Similar document/archive content detection

3 participants