Replies: 2 comments 1 reply
-
|
Hi @pauloxnet, Thank you for the thoughtful questions and for taking the time to understand the architecture! 1. Why ResNet152 instead of reusing Immich's CLIP embeddings?The key difference lies in what these models capture:
For duplicate detection, we need pixel-level visual similarity rather than semantic similarity. Two completely different beach photos might be very "similar" to CLIP (both are beaches), but ResNet152 can distinguish them as different images. CLIP is optimized for queries like "sunset at beach" → finds beach photos. But for finding actual duplicates or near-duplicates, visual feature extraction produces fewer false positives. 2. Why Qdrant instead of pgvector?A few reasons for this design choice:
3. GPU SupportMediaKit already supports GPU acceleration! Check the README under "Choose CPU or GPU Version" for Docker (NVIDIA CUDA) and source installation (macOS MPS, Windows CUDA) options. 4. Reusing Immich's GPU environmentIntegrating into Immich's container would increase architectural complexity and coupling. The standalone approach allows independent updates, easier debugging, and flexibility for different deployment scenarios. As mentioned in the README, this project started simply as a tool to help organize my family's large photo collection. Of course, anyone interested could take a similar architecture and extend it into a desktop app or other applications - but that's a personal choice. After all, there are always many different approaches to solving the same problem. Thanks again for the detailed questions! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @pauloxnet, Thanks for the continued interest! MediaKit is simply a tool I built to solve a specific problem — managing duplicate photos in my family's library. The current architecture (Qdrant + SQLite + read-only Immich access) was chosen to keep things lightweight and simple for this purpose. Features like alternative vector backends, additional GPU support, or photo quality scoring are beyond the current scope of the project. However, since this is an open-source project, contributions are always welcome if any of these ideas interest you! Thanks for understanding, and happy organizing! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
First of all, thank you for this amazing project. I’ve just started using Immich locally and I’ve recently discovered immich mediakit. The idea behind it is brilliant, and I really appreciate the clean Python implementation and the flexibility of running it on CPU or GPUs.
While reading through the documentation and looking at how mediakit processes all assets from scratch with ResNet152 and stores vectors in Qdrant, a few questions came to my mind. These are not criticisms at all I’m just very interested in how the architecture fits together and whether some parts of Immich could be reused to speed things up.
Here are my thoughts and questions:
From what I understand, Immich already computes embeddings for each asset during import and stores them in PostgreSQL using pgvector. Since mediakit also performs a full re-scan to compute embeddings with its own model, I was wondering:
Would it be technically possible to reuse Gimmick's existing pgvector embeddings instead of recomputing everything from scratch?
Is the embedding model used by Immich too different (or not expressive enough) to support the advanced similarity and duplicate detection features that mediakit provides?
If ResNet152 is required to achieve better similarity or duplicate detection:
Is this because the embeddings generated by Immich (based on its CLIP-like model) are not suitable for the type of comparisons mediakit performs?
Or is it simply a design choice to ensure consistency in the vectors used for Qdrant indexing?
Since Immich already ships with PostgreSQL + pgvector:
Would it be feasible (even theoretically) to store mediakit embeddings directly in Immich PostgreSQL instance, avoiding the need to set up a separate Qdrant vector database?
Are there specific Qdrant features that are necessary for your workflow (e.g., HNSW tuning, shard management, filtering, hybrid scoring) that pgvector cannot provide?
Immich has a very convenient GPU-enabled Docker setup (supporting NVIDIA, AMD, Intel, and Apple Silicon).
Would it make sense — or would it even be possible — for mediakit to reuse Immich’s existing GPU-capable environment instead of requiring a separate container or Python environment?
From what I understand, Immich uses CLIP-based models (e.g., ViT-B-32 and other CLIP/SigLIP variants) to generate semantic embeddings stored in PostgreSQL via pgvector/VectorChord, while immich-mediakit uses ResNet152 to extract more “pixel-level” visual features stored in Qdrant.
This makes me wonder whether Immich’s existing embeddings could also be reused for this purpose, or whether the architectural choice of ResNet152 + Qdrant is essential for the quality of mediakit’s advanced duplicate-finding workflow.
Again, thank you for your work. immich mediakit looks extremely promising, especially for users with large photo libraries who want more advanced deduplication workflows. I’m asking these questions just out of curiosity and to better understand the reasoning behind the architecture — I’m really excited about the project and would love to follow its evolution.
Please let me know if any of the above points make sense or if I misunderstood anything.
Thanks again for your time and effort!
Beta Was this translation helpful? Give feedback.
All reactions