Skip to content

Fix face clustering collapse#758

Closed
ritigya03 wants to merge 3 commits intoAOSSIE-Org:mainfrom
ritigya03:fix-face-clustering-collapse
Closed

Fix face clustering collapse#758
ritigya03 wants to merge 3 commits intoAOSSIE-Org:mainfrom
ritigya03:fix-face-clustering-collapse

Conversation

@ritigya03
Copy link
Contributor

@ritigya03 ritigya03 commented Dec 13, 2025

Solves issue: BUG: Incorrect clustering for a folder with many images #722

PR Title

Fix face clustering collapsing unrelated faces into a single cluster

Description

Problem

In some cases, AI face clustering grouped a large number of unrelated faces into a single cluster.
This typically occurred on larger datasets when invalid, zero, or unnormalized face embeddings were passed to DBSCAN, causing cosine distance to behave incorrectly and collapse clusters.

Root Cause

  • Face embeddings were used directly without normalization while using cosine distance.
  • Invalid or near-zero embeddings were not filtered out.
  • This led DBSCAN to treat many unrelated faces as identical, producing oversized clusters.

Solution

This PR introduces minimal, targeted fixes to the face clustering utility:

  • Skips invalid (None) and near-zero face embeddings.
  • Normalizes face embeddings before running DBSCAN with cosine distance.
  • Uses safer DBSCAN defaults to reduce accidental over-clustering.
  • Adds a warning log when a high number of duplicate embeddings is detected (debug-only, no behavior change).

These changes ensure that only meaningful embeddings participate in clustering, preventing unrelated faces from collapsing into a single cluster.

Impact

  • Correct clustering behavior on large datasets.
  • No change in behavior for already-working small datasets.
  • No database schema or API changes.
  • Backward-compatible and safe for existing installations.

Testing

  • Verified on small datasets (unchanged correct behavior).
  • Verified on larger datasets where clustering previously collapsed into a single cluster; now produces multiple meaningful clusters.
  • Manual global reclustering tested successfully.

Checklist

  • No breaking changes
  • Minimal, isolated fix
  • Existing functionality preserved
  • Manual testing completed

Team Name - EtherX

  • Ritigya Gupta
  • Heeral Mandolia
  • Sirjan Singh

Summary by CodeRabbit

  • Improvements

    • Enhanced face clustering: stronger embedding validation, normalization and duplicate handling for more reliable grouping and fewer mis-clusters.
  • Bug Fixes

    • Corrected onboarding step numbering and progress calculation to use zero-based indexing consistently across UI labels and progress bars.
  • Documentation

    • Tightened API schema for image metadata and clarified an input parameter schema title in the OpenAPI docs.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added backend bug Something isn't working medium labels Dec 13, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 13, 2025

Walkthrough

Adjusts face-embedding clustering by adding validation, normalization, duplicate detection, and tightening DBSCAN params; tightens OpenAPI metadata schema and modifies onboarding UI step indicators from 1-based to 0-based indexing.

Changes

Cohort / File(s) Summary
Backend Clustering Logic
backend/app/utils/face_clusters.py
Added validation/preprocessing to filter None or near-zero embeddings and convert to numpy arrays; normalize embeddings for cosine distance; detect/report duplicate embeddings; hardened cosine distance calc against zero norms; changed DBSCAN defaults: eps 0.35, min_samples 3; added early-return paths for empty/invalid sets.
OpenAPI Schema
docs/backend/backend_python/openapi.json
Wrapped the input_type parameter schema in an allOf referencing #/components/schemas/InputType and added title "Input Type"; removed additionalProperties from ImageInCluster.Metadata, tightening allowed metadata fields.
Frontend Onboarding Step Numbering
frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx, frontend/src/components/OnboardingSteps/FolderSetupStep.tsx, frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx
Switched displayed step label and progress calculation from 1-based (stepIndex + 1) to 0-based (stepIndex), updating progress percentage and step text.
Frontend Onboarding Spacing
frontend/src/components/OnboardingSteps/OnboardingStep.tsx
Minor whitespace adjustments around the switch-case block; no behavioral changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review embedding validation and early-return logic in backend/app/utils/face_clusters.py for edge cases (all-invalid embeddings, mixed types).
  • Verify cosine normalization and zero-norm guards preserve intended distance semantics.
  • Confirm DBSCAN parameter changes (eps/min_samples) align with expected clustering behavior.
  • Check frontend onboarding changes for off-by-one regressions where step indices are consumed elsewhere.
  • Validate OpenAPI schema change doesn't break clients expecting additionalProperties on metadata.

Poem

🐰 Soft hops through normalized space,
I tidy vectors, give them grace.
Steps now start from zero's smile,
Schemas trimmed to sit in style.
Clusters hum — a quiet race. 🎋

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix face clustering collapse' directly corresponds to the main objective of addressing a bug where face clustering incorrectly grouped unrelated faces into a single cluster on larger datasets.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx (1)

54-68: Critical: Revert to 1-based step indexing for user display.

Same issue as in AvatarSelectionStep.tsx - the removal of + 1 breaks user-facing display. Users will see "Step 0 of 3" with 0% progress at the first step.

Apply this diff to restore correct 1-based indexing:

-  const progressPercent = Math.round(((stepIndex ) / totalSteps) * 100);
+  const progressPercent = Math.round(((stepIndex + 1) / totalSteps) * 100);
   return (
     <>
       <Card className="flex max-h-full w-1/2 flex-col border p-4">
         <CardHeader className="p-3">
           <div className="text-muted-foreground mb-1 flex justify-between text-xs">
             <span>
-              Step {stepIndex } of {totalSteps}
+              Step {stepIndex + 1} of {totalSteps}
             </span>
             <span>{progressPercent}%</span>
           </div>
frontend/src/components/OnboardingSteps/FolderSetupStep.tsx (1)

67-83: Critical: Revert to 1-based step indexing for user display.

Same issue as in the other onboarding step components - the removal of + 1 breaks the step indicator and progress bar for users.

Apply this diff to restore correct 1-based indexing:

-  const progressPercent = Math.round(((stepIndex) / totalSteps) * 100);
+  const progressPercent = Math.round(((stepIndex + 1) / totalSteps) * 100);

   return (
     <>
       <Card className="flex max-h-full w-1/2 flex-col border p-4">
         <CardHeader className="p-3">
           <div className="text-muted-foreground mb-1 flex justify-between text-xs">
             <span>
-              Step {stepIndex} of {totalSteps}
+              Step {stepIndex + 1} of {totalSteps}
             </span>
             <span>{progressPercent}%</span>
           </div>
🧹 Nitpick comments (2)
docs/backend/backend_python/openapi.json (1)

1120-1127: Consider simplifying the parameter schema pattern.

The change wraps a single $ref in an allOf construct and adds metadata (title, default) at the parameter level. While valid in OpenAPI 3.1.0, this pattern is unconventional. Consider whether you should instead:

  1. Enhance the InputType schema directly (at lines 2261–2267) with title and default properties, then use a plain $ref, or
  2. Keep the direct $ref pattern if the title and default are parameter-specific (for documentation clarity alone).

The current approach works but may confuse downstream code generators or API tooling that expect simpler schema references.

If stricter metadata handling is the goal, consider this alternative:

{
  "name": "input_type",
  "in": "query",
  "required": false,
  "schema": {
    "$ref": "#/components/schemas/InputType"
  },
  "description": "Choose input type: 'path' or 'base64'",
  "example": "path"
}

Then enhance the InputType schema in components/schemas to include a default:

"InputType": {
  "type": "string",
  "enum": ["path", "base64"],
  "title": "InputType",
  "default": "path"
}
backend/app/utils/face_clusters.py (1)

351-356: Remove commented-out code.

The old line is dead code and should be removed for clarity.

     # Normalize the face embedding
-    # face_norm = face_embedding / np.linalg.norm(face_embedding)
     norm = np.linalg.norm(face_embedding)
     if norm < 1e-6:
         return np.ones(len(cluster_means))  # max distance
 
     face_norm = face_embedding / norm
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and a34cb74.

⛔ Files ignored due to path filters (1)
  • frontend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (6)
  • backend/app/utils/face_clusters.py (3 hunks)
  • docs/backend/backend_python/openapi.json (1 hunks)
  • frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx (1 hunks)
  • frontend/src/components/OnboardingSteps/FolderSetupStep.tsx (1 hunks)
  • frontend/src/components/OnboardingSteps/OnboardingStep.tsx (1 hunks)
  • frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx (1 hunks)
🔇 Additional comments (2)
backend/app/utils/face_clusters.py (2)

161-162: LGTM!

The adjusted DBSCAN parameters (eps=0.35, min_samples=3) are reasonable for reducing over-clustering with cosine distance on normalized embeddings. This aligns with the PR objective to prevent unrelated faces from collapsing into a single cluster.


210-220: LGTM!

Normalizing embeddings before DBSCAN with cosine metric is the correct approach. The duplicate detection is a useful debug heuristic that won't affect clustering behavior.

@rahulharpal1603
Copy link
Contributor

Hi, sorry. Even after the changes, the problem still persists. So I am closing this PR.

@ritigya03 ritigya03 deleted the fix-face-clustering-collapse branch December 28, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments