303284: Add deduplication engine reference PK to Individual model and… by arsen-vs · Pull Request #5799 · unicef/hope

arsen-vs · 2026-03-10T22:21:09Z

… related services

Introduced a new field deduplication_engine_reference_pk in the Individual model to facilitate communication with the biometric deduplication engine.
Updated the CreateLaxIndividuals endpoint to include this reference in the validated data.
Enhanced the BiometricDeduplicationService to utilize the new reference PK for individual identification during deduplication processes.
Added tests to ensure the correct usage of the deduplication reference PK in various service methods and endpoints.

codecov · 2026-03-11T02:04:51Z

Codecov Report

❌ Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.11%. Comparing base (b743590) to head (d1c08fa).
⚠️ Report is 184 commits behind head on develop.

Files with missing lines	Patch %	Lines
...istration_data/services/biometric_deduplication.py	94.11%	0 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (95.45%) is below the target coverage (97.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #5799   +/-   ##
========================================
  Coverage    91.10%   91.11%           
========================================
  Files          500      500           
  Lines        34371    34390   +19     
  Branches      3547     3549    +2     
========================================
+ Hits         31315    31333   +18     
  Misses        2268     2268           
- Partials       788      789    +1

Flag	Coverage Δ
e2e	`52.69% <31.81%> (-0.02%)`	⬇️
unit	`90.73% <95.45%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/hope/apps/household/migrations/0034_migration.py

src/hope/apps/registration_data/services/biometric_deduplication.py

… related services - Introduced a new field `deduplication_engine_reference_pk` in the Individual model to facilitate communication with the biometric deduplication engine. - Updated the CreateLaxIndividuals endpoint to include this reference in the validated data. - Enhanced the BiometricDeduplicationService to utilize the new reference PK for individual identification during deduplication processes. - Added tests to ensure the correct usage of the deduplication reference PK in various service methods and endpoints.

…ual model - Introduced a unique constraint on the combination of `program` and `deduplication_engine_reference_pk` fields in the Individual model to ensure data integrity. - Updated the BiometricDeduplicationService and related views to utilize the new constraint when reporting individuals' statuses. - Enhanced tests to verify the correct implementation of the unique constraint and its impact on reporting logic.

…aintainability - Cleaned up the code in `biometric_deduplication.py` by adjusting the formatting of dictionary comprehensions for better clarity. - Added an import statement for the `Individual` model in the test file to ensure proper functionality of the tests related to biometric deduplication.

- Modified test assertions in `test_rdi_merge_helpers.py`, `test_rdi_merge.py`, and `test_views_registration_data_import_actions.py` to compare individual IDs using primary keys instead of UUIDs, ensuring consistency in reporting logic. - Enhanced readability of assertions by utilizing set comprehensions for ID comparisons.

- Introduced IndividualFactory import to enhance test coverage for individual-related functionality. - Added uuid import to facilitate unique identifier generation in the pending household fixture.

…ne reference PK - Introduced a new test case `test_report_withdrawn_with_iterable_uses_deduplication_engine_reference_pk` to validate the behavior of the `report_individuals_status` method in the `BiometricDeduplicationService`. - The test ensures that the correct deduplication engine reference PK is used when reporting the status of withdrawn individuals, enhancing the coverage of the biometric deduplication functionality.

johniak · 2026-03-31T14:40:14Z

src/hope/apps/registration_data/services/biometric_deduplication.py

                        status_code=item["status_code"],
-                        first=item["first"]["reference_pk"] or None,
-                        second=item["second"]["reference_pk"] or None,
+                        first=self._resolve_individual_id_from_reference(


This implementation has a bug.

fetch_biometric_deduplication_results_and_process should save all duplicate pairs across the whole program, not only pairs from individuals inside the current RDI scope.

The bug is visible in this scenario:

RDI with 1 HH and 1 individual with face "A" is created. deduplication_engine_reference_pk = "EXT-JAN".

RDI is sent to the deduplication engine (no duplicates yet).

RDI is merged.

A new RDI is created with 1 HH and 1 individual with the same face "A". deduplication_engine_reference_pk = "EXT-JAN2".

The new RDI is sent to the deduplication engine.

Results are fetched (fetch_biometric_deduplication_results_and_process).

Expected: 1 pair. Actual: 0 pairs.
The pair should be between "EXT-JAN" (already merged) and "EXT-JAN2".

No pair is saved.

Why this happens:

reference_to_individual_id is built only from individuals in RDIs that are in review and dedup in progress, not from the whole program.
So in this case it contains only {"EXT-JAN2": <uuid>}.

The dedup result item contains:
first.reference_pk = "EXT-JAN", second.reference_pk = "EXT-JAN2".

When creating SimilarityPair, _resolve_individual_id_from_reference is called with reference_to_individual_id.
This produces something like:

SimilarityPair( ..., first="EXT-JAN", # fallback (not resolved) second=<uuid>, # resolved )

SimilarityPair.first/second are later used as FK IDs to Individual when saving DeduplicationEngineSimilarityPair.
Because one side remains an external reference, the pair is skipped.

Besides fixing this bug, we should also add a test for this exact scenario.

johniak · 2026-03-31T14:49:38Z

src/hope/api/endpoints/rdi/lax.py

            self.selected_rdi.program.programme_code,
        )
        validated_data = dict(serializer.validated_data)
+        validated_data["deduplication_engine_reference_pk"] = external_individual_id


We should add stronger input validation for individual_id in the LAX flow.

Please consider checking:

max_length=255 at serializer level (currently only DB enforces this).

normalization (strip) before save.

reject blank/whitespace-only values.

detect duplicates of individual_id within the same request payload.

validate uniqueness in program with a clear API validation error (instead of only DB IntegrityError).

add tests for all cases above (too long value, blank value, duplicated value in one payload, duplicated value in program).

johniak · 2026-03-31T14:51:23Z

src/hope/models/individual.py

            ),
+            UniqueConstraint(
+                fields=["program", "deduplication_engine_reference_pk"],
+                condition=Q(is_removed=False) & Q(deduplication_engine_reference_pk__isnull=False),


The new uniqueness condition checks IS NOT NULL but not non-empty string. If "" is ever saved, only one blank value per program is allowed.

johniak · 2026-03-31T14:52:39Z

tests/unit/apps/registration_data/test_biometric_deduplication_service.py

@@ -149,6 +150,35 @@ def test_upload_individuals_success(
    assert rdi.deduplication_engine_status == RegistrationDataImport.DEDUP_ENGINE_UPLOADED




As I mentioned in previous comment we have a test gap. New test covers mapping EXT-* refs for pending individuals only. Missing test for the critical mixed case where one side is current pending and the other side is an already merged individual identified by external dedup ref. That case is where current logic is most likely to regress.

johniak · 2026-03-31T15:03:09Z

src/hope/api/endpoints/rdi/lax.py


    class Meta:
        model = PendingIndividual
        exclude = [


Because we rely on exclude, which is technically fragile, we can currently send both individual_id and deduplication_engine_reference_pk. In practice, however, we always save individual_id into deduplication_engine_reference_pk. This does not raise an error, but the behavior is confusing. Swagger will also expose this field as a possible input.

…lidation and error handling - Updated the `IndividualSerializer` to include a maximum length constraint and whitespace trimming for the `individual_id` field. - Added validation to ensure `individual_id` is not blank and does not duplicate existing entries in the same program. - Enhanced the `CreateLaxIndividuals` class to track seen `individual_id`s, preventing duplicates in the request payload. - Introduced new tests to validate the behavior of the updated serializer and error handling for individual creation scenarios.

MarekBiczysko requested changes Mar 11, 2026

View reviewed changes

src/hope/apps/household/migrations/0034_migration.py Show resolved Hide resolved

src/hope/apps/registration_data/services/biometric_deduplication.py Outdated Show resolved Hide resolved

src/hope/apps/registration_data/services/biometric_deduplication.py Outdated Show resolved Hide resolved

johniak marked this pull request as draft March 16, 2026 10:06

ArsenPidhoretskyi added 5 commits March 20, 2026 10:56

Add IndividualFactory import and uuid in test_rdi_merge_helpers.py

d1c08fa

- Introduced IndividualFactory import to enhance test coverage for individual-related functionality. - Added uuid import to facilitate unique identifier generation in the pending household fixture.

arsen-vs force-pushed the feature/303284-Use-ind-pk-instead-of-originating-id branch from 1c0888c to d1c08fa Compare March 20, 2026 08:59

arsen-vs requested review from MarekBiczysko, datamik, domdinicola, roma-valor, saxix and vitali-yanushchyk-valor March 31, 2026 08:11

arsen-vs marked this pull request as ready for review March 31, 2026 08:12

johniak requested changes Mar 31, 2026

View reviewed changes

johniak marked this pull request as draft March 31, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

303284: Add deduplication engine reference PK to Individual model and…#5799

303284: Add deduplication engine reference PK to Individual model and…#5799
arsen-vs wants to merge 7 commits intodevelopfrom
feature/303284-Use-ind-pk-instead-of-originating-id

arsen-vs commented Mar 10, 2026

Uh oh!

codecov bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johniak Mar 31, 2026

Uh oh!

johniak Mar 31, 2026

Uh oh!

johniak Mar 31, 2026

Uh oh!

johniak Mar 31, 2026

Uh oh!

johniak Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -149,6 +150,35 @@ def test_upload_individuals_success(
		assert rdi.deduplication_engine_status == RegistrationDataImport.DEDUP_ENGINE_UPLOADED

Conversation

arsen-vs commented Mar 10, 2026

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johniak Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

johniak Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

johniak Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

johniak Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

johniak Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 11, 2026 •

edited

Loading