Similar Images Cleanup for Desktop/Web #8511

korjavin · 2025-12-27T10:42:28Z

Description

Adds "Similar Images" feature to desktop/web, matching existing mobile functionality. Users can find and clean up visually similar photos to free up storage space.

What's New

New menu item: Help → Similar Images (also accessible from sidebar)
HNSW-based similarity search: Uses hnswlib-wasm for efficient vector search on large libraries
Persistent index caching: IDBFS-backed persistence for 120x faster subsequent loads
Category-based filtering: Three tabs (Close/Similar/Related) matching mobile UX
Smart grouping: Automatically groups visually similar images
Photos only: Excludes videos from analysis (only compares images and live photos)
Safe deletion: Preserves files with captions/edits, creates symlinks where needed
Result caching: IndexedDB cache for instant reload
File size display: Shows file size below each thumbnail for better decision making

Performance Characteristics

First Load (~7 minutes for 130k images):

Load embeddings from IndexedDB
Build HNSW index (batched for UI responsiveness)
Save index to IDBFS (Emscripten virtual filesystem → IndexedDB)
Search for similar images
Cache results

Subsequent Loads (~2-5 seconds for 130k images):

Load index from IDBFS (120x faster than rebuilding)
Search for similar images using cached index
Return cached results if file set unchanged

Cache Invalidation: Smart hash-based detection - index rebuilds automatically when:

Files are added/removed from library
Embeddings are reprocessed
User explicitly clears cache (future feature)

Implementation Details

Performance: Uses HNSW (Hierarchical Navigable Small World) approximate nearest neighbor search for efficient similarity detection. Handles libraries from small to 100k+ images with O(n log n) complexity.

Library Choice: Selected hnswlib-wasm after evaluating several options:

usearch - Node.js only, not browser-compatible
client-vector-search - No HNSW support yet
hnswlib-wasm ✅ - Browser-ready, WebAssembly-based, same algorithm family as mobile (USearch), supports IDBFS persistence

Index Persistence: Leverages Emscripten's IDBFS (IndexedDB File System) to persist binary HNSW index data:

First load: Build index + save to virtual filesystem + sync to IndexedDB (~6 min)
Subsequent loads: Sync from IndexedDB + load index (~3 sec)
Metadata stored separately in ML DB for cache validation (file ID hashes, label mappings)
Automatic invalidation when file set changes

Dynamic Sizing: HNSW index automatically sizes itself based on library size (rounds up to nearest 10k), handling libraries from small to 100k+ images.

Architecture: Follows existing patterns from dedup.ts for deletion logic (trash handling, symlink creation, file preservation). Uses reducer pattern for UI state management.

Progress Reporting: Batched vector conversion with setTimeout(0) to keep UI responsive during index building. Progress callbacks report incremental updates every 1% during search operations.

Console Output

Detailed progress logging throughout analysis:

First Load (building index):

[Similar Images] Loaded 126171 CLIP embeddings
[Similar Images] Found 126171 eligible files with embeddings
[Similar Images] Creating HNSW index for 126171 vectors...
[HNSW] Creating new index with capacity: 130000
[HNSW] Adding 126171 vectors to index...
[HNSW] Mapping 126171 labels to file IDs...
[Similar Images] Successfully added 126171 vectors
[HNSW] Saving index to virtual filesystem: clip_hnsw.bin
[HNSW] Index saved to IDBFS
[Similar Images] Searching for similar images...
[HNSW] Searched 12617/126171 vectors (10%)
[HNSW] Searched 25234/126171 vectors (20%)
...
[Similar Images] Created 1234 groups using HNSW

Subsequent Loads (loading cached index):

[Similar Images] Loaded 126171 CLIP embeddings
[Similar Images] Found valid cached index (126171 vectors)
[HNSW] Loading index from IDBFS: clip_hnsw.bin
[HNSW] Index loaded successfully (126171 vectors)
[Similar Images] Searching for similar images...
[HNSW] Searched 12617/126171 vectors (10%)
...
[Similar Images] Created 1234 groups using HNSW

UI Features

Smooth progress bar: Updates throughout analysis (0-100%) with detailed phases:
- Vector loading and preparation (0-58%)
- Index building/loading (58-65%)
- Similarity search with incremental updates (65-80%)
- Grouping and finalization (80-100%)
Category tabs: Three-tab filter (Close/Similar/Related) matching mobile app
- Close: Distance ≤ 0.1% (very similar)
- Similar: 0.1% < Distance ≤ 2% (moderately similar)
- Related: Distance > 2% (somewhat similar)
- Instant switching - no re-analysis needed
Virtualized list: Handles thousands of results efficiently (react-window)
File size display: Shows size below each thumbnail for informed decisions
Flexible selection: Select/deselect individual items or entire groups
Preview mode: Review selections before deletion
One-click cleanup: Safe deletion with trash and symlink handling

Testing

41 unit tests covering core similarity logic
Extensively tested with personal library (120k+ photos)
All TypeScript compilation checks pass

Development Notes

About this PR: This code was developed primarily with an AI agent as the tech stack (TypeScript/React/Electron/WebAssembly) is outside my usual expertise. However, I've thoroughly tested the implementation on my personal library (15k+ photos) to ensure it works correctly.

I'm eager to have this feature merged as I've been missing it in the desktop app. Feedback and suggestions are very welcome!

Files Changed

New Files:

web/packages/new/photos/services/similar-images.ts - Core service with HNSW persistence
web/packages/new/photos/services/similar-images-types.ts - Type definitions including cache metadata
web/packages/new/photos/services/similar-images-delete.ts - Deletion logic
web/packages/new/photos/services/ml/hnsw.ts - HNSW wrapper with saveIndex/loadIndex methods
web/packages/new/photos/pages/similar-images.tsx - UI page
web/packages/new/photos/services/__tests__/similar-images.test.ts - Unit tests

Modified Files:

web/packages/new/photos/services/ml/db.ts - Schema v2→v3, added hnsw-index-metadata store, hash helpers
web/apps/photos/src/components/Sidebar.tsx - Added navigation item
web/packages/base/locales/en-US/translation.json - Added 19 translation keys
desktop/src/main/menu.ts - Added Help menu item
web/packages/new/package.json - Added hnswlib-wasm dependency

Technical Notes

IndexedDB Schema Migration: ML database upgraded from v1 to v3:

New object store: hnsw-index-metadata (stores cache validation data)
Stores file ID hashes for invalidation, label mappings for reconstruction
Separate from IDBFS data (binary index file stored in Emscripten virtual filesystem)

IDBFS Integration: Uses Emscripten's IDBFS to persist WASM-generated binary data:

syncFileSystem('write') - Flush virtual FS to IndexedDB after index build
syncFileSystem('read') - Hydrate virtual FS from IndexedDB before index load
Binary index file (~50-100MB for 130k vectors) stored efficiently in IndexedDB

Cache Invalidation Logic:

Generate hash from sorted file IDs
Compare with cached metadata hash
If match → load index from IDBFS
If mismatch → rebuild index + save new metadata

Future Enhancements

Incremental index updates for new files (avoid full rebuild)
Lazy loading with background refresh (show stale results immediately)
Manual cache invalidation button in settings
Staleness indicators in UI when showing cached results

Tests

CLAassistant · 2025-12-27T10:42:35Z

All committers have signed the CLA.

socket-security · 2025-12-27T10:43:09Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance
	npm/hugeicons-react@0.3.0
	npm/similarity-transformation@0.0.1
	npm/@tsconfig/node22@22.0.2
	npm/react-leaflet@5.0.0
	npm/typescript-eslint@8.35.1
	npm/zxcvbn@4.4.2
	npm/sanitize-filename@1.6.3
	npm/@types/auto-launch@5.0.5
	npm/prettier-plugin-packagejson@2.5.0 ⏵ 2.5.17				⁺²
	npm/hnswlib-wasm@0.8.2
	npm/vitest@3.2.4
	npm/leaflet-defaulticon-compatibility@0.1.2
	npm/i18next-resources-to-backend@1.2.1
	npm/heic-convert@2.1.0
	npm/react-top-loading-bar@3.0.2
	npm/wasm-pack@0.13.1
	npm/memoize-one@6.0.0
	npm/jszip@3.10.1
	npm/jssha@3.3.1
	npm/nanoid@3.3.7 ⏵ 5.1.6		⁺²	^-1
	npm/photoswipe@5.4.4
	npm/localforage@1.10.0
	npm/vite-plugin-wasm@3.5.0
	npm/get-user-locale@3.0.0
	npm/prettier-plugin-organize-imports@3.2.4 ⏵ 4.1.0
	npm/idb@8.0.3
	npm/next@15.5.9
	npm/exifreader@4.32.0
	npm/eslint-plugin-react@7.34.2 ⏵ 7.37.5	⁺²
	npm/react-otp-input@3.1.1
	npm/react@18.3.1 ⏵ 19.2.0
	npm/uuid@13.0.0
	npm/formik@2.4.6
See 23 more rows in the dashboard

View full report

socket-security · 2025-12-27T10:43:10Z

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action	Severity	Alert (click "▶" to expand/collapse)
Warn		HTTP dependency: npm `@electron/rebuild` depends on https://github.com/electron/node-gyp#06b29aafb7708acef8b3669835c8a7857ebc92d2 Dependency: @electron/node-gyp @https://github.com/electron/node-gyp#06b29aafb7708acef8b3669835c8a7857ebc92d2 Location: Package overview From: `?` → `npm/@electron/[email protected]` ℹ Read more on: This package \| This alert \| What are http dependencies? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `[email protected]`. Suggestion: Publish the HTTP URL dependency to a public or private package repository and consume it from there. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/@electron/[email protected]`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: npm `libheif-js` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: `?` → `npm/[email protected]` → `npm/[email protected]` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `[email protected]`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/[email protected]`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: npm `zxcvbn` is 98.0% likely obfuscated Confidence: 0.98 Location: Package overview From: web/packages/accounts/package.json → `npm/[email protected]` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `[email protected]`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/[email protected]`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

web/packages/new/photos/services/similar-images.ts

## Problem Previously, any file change (even deleting 1 photo) triggered a complete HNSW index rebuild, requiring 6+ minutes for large libraries (130k+ photos). This defeated the purpose of index persistence and made the feature unusable for normal workflows. ## Solution Implemented incremental index updates that detect changes and only update what's necessary: ### Key Changes 1. **Added incremental update methods to HNSWIndex class** (`hnsw.ts`): - `addVector()`: Adds single vector using `addItems([item], replaceDeleted=true)` - `removeVector()`: Soft-deletes vector using `markDelete(label)` - Both methods update internal file ID ↔ label mappings 2. **Smart cache loading logic** (`similar-images.ts`): - Detects added/removed files by comparing cached vs current file IDs - Three code paths: - No changes (hash match) → Load cache directly - Small changes (capacity sufficient) → Load + apply incremental updates - Large changes (capacity exceeded) → Full rebuild - Uses Set difference operations for O(n) change detection 3. **Robust error handling**: - If cached index load fails, clears corrupted index AND metadata - Prevents repeated load attempts on corrupted cache by clearing metadata - Graceful fallback to full rebuild when incremental update fails - Ensures system never fails - always falls back to working state - Handles file changes from any source (local deletions, sync from other devices) ### Performance Impact | Scenario | Before | After | Speedup | |----------|--------|-------|---------| | Delete 1 photo | ~6 min | ~2-5 sec | **~100x faster** | | Add 10 photos | ~6 min | ~5-10 sec | **~60x faster** | | Add 1000 photos | ~6 min | ~30-60 sec | **~8x faster** | | No changes | ~3 sec | ~3 sec | Same | ### Technical Details - **Soft Deletion**: `markDelete()` marks vectors as deleted without removing from index structure. Deleted vectors won't appear in search results. - **Label Reuse**: `addItems(items, replaceDeleted=true)` efficiently reuses deleted label slots, maintaining index efficiency. - **Capacity Check**: Validates that loaded index has sufficient capacity before attempting incremental updates. Falls back to full rebuild if needed. - **Error Recovery**: When index load fails, system automatically: 1. Clears the corrupted in-memory index (`clearCLIPHNSWIndex()`) 2. Deletes corrupted metadata from IndexedDB (`clearHNSWIndexMetadata()`) 3. Falls back to full rebuild with fresh index This prevents infinite retry loops on corrupted cache and ensures reliability. - **IDBFS Debugging**: Added debug logging and file existence checks to diagnose persistence issues. Uses `checkFileExists()` to verify files before/after operations. - **Critical Fix ente-io#1**: Don't call `initIndex()` before `readIndex()`. The init() method now accepts `skipInit` parameter to avoid creating an empty index when loading from file. - **Critical Fix ente-io#2**: Prevent concurrent IDBFS syncs. When `skipInit=true`, don't sync in `init()` - let `loadIndex()` handle it. Multiple concurrent syncs cause race conditions and corrupted filesystem state ("2 FS.syncfs operations in flight" warning). ### Console Output Example ``` [Similar Images] Found cached index (84724 vectors) [Similar Images] Loading index from IDBFS for incremental update... [Similar Images] Changes: +2390 files, -102 files [HNSW] Loading index from IDBFS: clip_hnsw.bin [HNSW] Index loaded successfully (84724 vectors) [Similar Images] Incremental update completed [HNSW] Saving updated index to IDBFS: clip_hnsw.bin [Similar Images] Updated index saved ``` ### Files Modified - `web/packages/new/photos/services/ml/hnsw.ts` (+40 lines) - `web/packages/new/photos/services/similar-images.ts` (+120 lines) ### Testing - ✅ TypeScript compilation passes - ✅ Handles capacity edge cases (insufficient capacity → rebuild) - ✅ Handles corrupted index (failed load → clear → rebuild) - ⏳ Manual testing in progress (user verification) Co-authored-by: Claude Sonnet 4.5 <[email protected]>

- Fix layout overlap between groups - Improve selection logic: deselected first image by default - Add visual feedback for selected images (darkened) - scrolls to top on tab change - Add 'Select All / Deselect All' button - Fix bottom bar button sizing alignment

anandbaburajan · 2026-01-07T09:17:44Z

Hi @korjavin. Thanks a lot for the feature! I finally got some time to look into this. Please let me know when/if it's done from your side, and clean-up the unwanted files (mobile changes, commit message, .md files, etc), and I'll try it out.

korjavin · 2026-01-07T09:38:15Z

Hi @anandbaburajan thank you.

I did the clean-up.

Initially I left those md files to simplify the review, but we have them in git history now.

This feature works for me now, I use it on my collection, but as I stated not my tech stack, I am open to feedback.

korjavin added 2 commits December 27, 2025 11:11

Similar Images Cleanup for Desktop/Web initial/naive implementation

6c4e1f2

Add caching between page opening.

e48e083

chatgpt-codex-connector bot reviewed Dec 27, 2025

View reviewed changes

web/packages/new/photos/services/similar-images.ts Show resolved Hide resolved

korjavin and others added 12 commits December 27, 2025 13:40

Fix codex suggestions

78f8c78

Similar Images Cleanup for Desktop/Web initial/naive implementation

bccd7a3

Add caching between page opening.

d6a8dd2

Fix codex suggestions

909d322

chore: add sandbox scripts for safer desktop development

a4b4185

Merge feature/desktop-similar-embeddings into main

3441f20

chore: remove similar images link from help menu

d34cfaf

chore: remove similar images from system help menu

4697968

fix: allow undefined success return from HNSW readIndex

9a63c10

Clean-up

1c283cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Similar Images Cleanup for Desktop/Web #8511

Similar Images Cleanup for Desktop/Web #8511

Uh oh!

korjavin commented Dec 27, 2025

Uh oh!

CLAassistant commented Dec 27, 2025 •

edited

Loading

Uh oh!

socket-security bot commented Dec 27, 2025 •

edited

Loading

Uh oh!

socket-security bot commented Dec 27, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

anandbaburajan commented Jan 7, 2026

Uh oh!

korjavin commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Similar Images Cleanup for Desktop/Web #8511

Are you sure you want to change the base?

Similar Images Cleanup for Desktop/Web #8511

Uh oh!

Conversation

korjavin commented Dec 27, 2025

Description

What's New

Performance Characteristics

Implementation Details

Console Output

UI Features

Testing

Development Notes

Files Changed

Technical Notes

Future Enhancements

Tests

Uh oh!

CLAassistant commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security bot commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security bot commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

anandbaburajan commented Jan 7, 2026

Uh oh!

korjavin commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Dec 27, 2025 •

edited

Loading

socket-security bot commented Dec 27, 2025 •

edited

Loading

socket-security bot commented Dec 27, 2025 •

edited

Loading