Fix face clustering collapse by ritigya03 · Pull Request #758 · AOSSIE-Org/PictoPy

ritigya03 · 2025-12-13T14:34:20Z

Solves issue: BUG: Incorrect clustering for a folder with many images #722

PR Title

Fix face clustering collapsing unrelated faces into a single cluster

Description

Problem

In some cases, AI face clustering grouped a large number of unrelated faces into a single cluster.
This typically occurred on larger datasets when invalid, zero, or unnormalized face embeddings were passed to DBSCAN, causing cosine distance to behave incorrectly and collapse clusters.

Root Cause

Face embeddings were used directly without normalization while using cosine distance.
Invalid or near-zero embeddings were not filtered out.
This led DBSCAN to treat many unrelated faces as identical, producing oversized clusters.

Solution

This PR introduces minimal, targeted fixes to the face clustering utility:

Skips invalid (None) and near-zero face embeddings.
Normalizes face embeddings before running DBSCAN with cosine distance.
Uses safer DBSCAN defaults to reduce accidental over-clustering.
Adds a warning log when a high number of duplicate embeddings is detected (debug-only, no behavior change).

These changes ensure that only meaningful embeddings participate in clustering, preventing unrelated faces from collapsing into a single cluster.

Impact

Correct clustering behavior on large datasets.
No change in behavior for already-working small datasets.
No database schema or API changes.
Backward-compatible and safe for existing installations.

Testing

Verified on small datasets (unchanged correct behavior).
Verified on larger datasets where clustering previously collapsed into a single cluster; now produces multiple meaningful clusters.
Manual global reclustering tested successfully.

Checklist

No breaking changes
Minimal, isolated fix
Existing functionality preserved
Manual testing completed

Team Name - EtherX

Ritigya Gupta
Heeral Mandolia
Sirjan Singh

Summary by CodeRabbit

Improvements
- Enhanced face clustering: stronger embedding validation, normalization and duplicate handling for more reliable grouping and fewer mis-clusters.
Bug Fixes
- Corrected onboarding step numbering and progress calculation to use zero-based indexing consistently across UI labels and progress bars.
Documentation
- Tightened API schema for image metadata and clarified an input parameter schema title in the OpenAPI docs.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-13T14:34:50Z

Walkthrough

Adjusts face-embedding clustering by adding validation, normalization, duplicate detection, and tightening DBSCAN params; tightens OpenAPI metadata schema and modifies onboarding UI step indicators from 1-based to 0-based indexing.

Changes

Cohort / File(s)	Summary
Backend Clustering Logic `backend/app/utils/face_clusters.py`	Added validation/preprocessing to filter None or near-zero embeddings and convert to numpy arrays; normalize embeddings for cosine distance; detect/report duplicate embeddings; hardened cosine distance calc against zero norms; changed DBSCAN defaults: eps 0.35, min_samples 3; added early-return paths for empty/invalid sets.
OpenAPI Schema `docs/backend/backend_python/openapi.json`	Wrapped the `input_type` parameter schema in an `allOf` referencing `#/components/schemas/InputType` and added title "Input Type"; removed `additionalProperties` from `ImageInCluster.Metadata`, tightening allowed metadata fields.
Frontend Onboarding Step Numbering `frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx`, `frontend/src/components/OnboardingSteps/FolderSetupStep.tsx`, `frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx`	Switched displayed step label and progress calculation from 1-based (`stepIndex + 1`) to 0-based (`stepIndex`), updating progress percentage and step text.
Frontend Onboarding Spacing `frontend/src/components/OnboardingSteps/OnboardingStep.tsx`	Minor whitespace adjustments around the switch-case block; no behavioral changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review embedding validation and early-return logic in backend/app/utils/face_clusters.py for edge cases (all-invalid embeddings, mixed types).
Verify cosine normalization and zero-norm guards preserve intended distance semantics.
Confirm DBSCAN parameter changes (eps/min_samples) align with expected clustering behavior.
Check frontend onboarding changes for off-by-one regressions where step indices are consumed elsewhere.
Validate OpenAPI schema change doesn't break clients expecting additionalProperties on metadata.

Poem

🐰 Soft hops through normalized space,
I tidy vectors, give them grace.
Steps now start from zero's smile,
Schemas trimmed to sit in style.
Clusters hum — a quiet race. 🎋

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix face clustering collapse' directly corresponds to the main objective of addressing a bug where face clustering incorrectly grouped unrelated faces into a single cluster on larger datasets.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx (1)
54-68: Critical: Revert to 1-based step indexing for user display.

Same issue as in AvatarSelectionStep.tsx - the removal of + 1 breaks user-facing display. Users will see "Step 0 of 3" with 0% progress at the first step.

Apply this diff to restore correct 1-based indexing:
-  const progressPercent = Math.round(((stepIndex ) / totalSteps) * 100);
+  const progressPercent = Math.round(((stepIndex + 1) / totalSteps) * 100);
   return (
     <>
       <Card className="flex max-h-full w-1/2 flex-col border p-4">
         <CardHeader className="p-3">
           <div className="text-muted-foreground mb-1 flex justify-between text-xs">
             <span>
-              Step {stepIndex } of {totalSteps}
+              Step {stepIndex + 1} of {totalSteps}
             </span>
             <span>{progressPercent}%</span>
           </div>
frontend/src/components/OnboardingSteps/FolderSetupStep.tsx (1)
67-83: Critical: Revert to 1-based step indexing for user display.

Same issue as in the other onboarding step components - the removal of + 1 breaks the step indicator and progress bar for users.

Apply this diff to restore correct 1-based indexing:
-  const progressPercent = Math.round(((stepIndex) / totalSteps) * 100);
+  const progressPercent = Math.round(((stepIndex + 1) / totalSteps) * 100);

   return (
     <>
       <Card className="flex max-h-full w-1/2 flex-col border p-4">
         <CardHeader className="p-3">
           <div className="text-muted-foreground mb-1 flex justify-between text-xs">
             <span>
-              Step {stepIndex} of {totalSteps}
+              Step {stepIndex + 1} of {totalSteps}
             </span>
             <span>{progressPercent}%</span>
           </div>

🧹 Nitpick comments (2)

docs/backend/backend_python/openapi.json (1)
1120-1127: Consider simplifying the parameter schema pattern.

The change wraps a single $ref in an allOf construct and adds metadata (title, default) at the parameter level. While valid in OpenAPI 3.1.0, this pattern is unconventional. Consider whether you should instead:

Enhance the InputType schema directly (at lines 2261–2267) with title and default properties, then use a plain $ref, or

Keep the direct $ref pattern if the title and default are parameter-specific (for documentation clarity alone).

The current approach works but may confuse downstream code generators or API tooling that expect simpler schema references.

If stricter metadata handling is the goal, consider this alternative:
{
  "name": "input_type",
  "in": "query",
  "required": false,
  "schema": {
    "$ref": "#/components/schemas/InputType"
  },
  "description": "Choose input type: 'path' or 'base64'",
  "example": "path"
}
Then enhance the InputType schema in components/schemas to include a default:
"InputType": {
  "type": "string",
  "enum": ["path", "base64"],
  "title": "InputType",
  "default": "path"
}
backend/app/utils/face_clusters.py (1)
351-356: Remove commented-out code.

The old line is dead code and should be removed for clarity.
     # Normalize the face embedding
-    # face_norm = face_embedding / np.linalg.norm(face_embedding)
     norm = np.linalg.norm(face_embedding)
     if norm < 1e-6:
         return np.ones(len(cluster_means))  # max distance
 
     face_norm = face_embedding / norm
-

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and a34cb74.

⛔ Files ignored due to path filters (1)

frontend/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (6)

backend/app/utils/face_clusters.py (3 hunks)
docs/backend/backend_python/openapi.json (1 hunks)
frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx (1 hunks)
frontend/src/components/OnboardingSteps/FolderSetupStep.tsx (1 hunks)
frontend/src/components/OnboardingSteps/OnboardingStep.tsx (1 hunks)
frontend/src/components/OnboardingSteps/ThemeSelectionStep.tsx (1 hunks)

🔇 Additional comments (2)

backend/app/utils/face_clusters.py (2)

161-162: LGTM!

The adjusted DBSCAN parameters (eps=0.35, min_samples=3) are reasonable for reducing over-clustering with cosine distance on normalized embeddings. This aligns with the PR objective to prevent unrelated faces from collapsing into a single cluster.

210-220: LGTM!

Normalizing embeddings before DBSCAN with cosine metric is the correct approach. The duplicate detection is a useful debug heuristic that won't affect clustering behavior.

backend/app/utils/face_clusters.py

frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx

rahulharpal1603 · 2025-12-24T05:52:22Z

Hi, sorry. Even after the changes, the problem still persists. So I am closing this PR.

ritigya03 added 2 commits December 13, 2025 16:14

Step count progress bar overflow fixed

786ec3d

Fix face clustering collapsing unrelated faces

a34cb74

github-actions bot added backend bug Something isn't working medium labels Dec 13, 2025

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

backend/app/utils/face_clusters.py Show resolved Hide resolved

frontend/src/components/OnboardingSteps/AvatarSelectionStep.tsx Show resolved Hide resolved

Handle empty embeddings after filtering

c732b2a

rahulharpal1603 closed this Dec 24, 2025

ritigya03 deleted the fix-face-clustering-collapse branch December 28, 2025 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix face clustering collapse#758

Fix face clustering collapse#758
ritigya03 wants to merge 3 commits intoAOSSIE-Org:mainfrom
ritigya03:fix-face-clustering-collapse

ritigya03 commented Dec 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

rahulharpal1603 commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Uh oh!

Conversation

ritigya03 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title

Description

Problem

Root Cause

Solution

Impact

Testing

Checklist

Team Name - EtherX

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rahulharpal1603 commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ritigya03 commented Dec 13, 2025 •

edited

Loading

coderabbitai bot commented Dec 13, 2025 •

edited

Loading