303284: Add deduplication engine reference PK to Individual model and…#5799
303284: Add deduplication engine reference PK to Individual model and…#5799
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (95.45%) is below the target coverage (97.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #5799 +/- ##
========================================
Coverage 91.10% 91.11%
========================================
Files 500 500
Lines 34371 34390 +19
Branches 3547 3549 +2
========================================
+ Hits 31315 31333 +18
Misses 2268 2268
- Partials 788 789 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/hope/apps/registration_data/services/biometric_deduplication.py
Outdated
Show resolved
Hide resolved
src/hope/apps/registration_data/services/biometric_deduplication.py
Outdated
Show resolved
Hide resolved
… related services - Introduced a new field `deduplication_engine_reference_pk` in the Individual model to facilitate communication with the biometric deduplication engine. - Updated the CreateLaxIndividuals endpoint to include this reference in the validated data. - Enhanced the BiometricDeduplicationService to utilize the new reference PK for individual identification during deduplication processes. - Added tests to ensure the correct usage of the deduplication reference PK in various service methods and endpoints.
…ual model - Introduced a unique constraint on the combination of `program` and `deduplication_engine_reference_pk` fields in the Individual model to ensure data integrity. - Updated the BiometricDeduplicationService and related views to utilize the new constraint when reporting individuals' statuses. - Enhanced tests to verify the correct implementation of the unique constraint and its impact on reporting logic.
…aintainability - Cleaned up the code in `biometric_deduplication.py` by adjusting the formatting of dictionary comprehensions for better clarity. - Added an import statement for the `Individual` model in the test file to ensure proper functionality of the tests related to biometric deduplication.
- Modified test assertions in `test_rdi_merge_helpers.py`, `test_rdi_merge.py`, and `test_views_registration_data_import_actions.py` to compare individual IDs using primary keys instead of UUIDs, ensuring consistency in reporting logic. - Enhanced readability of assertions by utilizing set comprehensions for ID comparisons.
- Introduced IndividualFactory import to enhance test coverage for individual-related functionality. - Added uuid import to facilitate unique identifier generation in the pending household fixture.
1c0888c to
d1c08fa
Compare
…ne reference PK - Introduced a new test case `test_report_withdrawn_with_iterable_uses_deduplication_engine_reference_pk` to validate the behavior of the `report_individuals_status` method in the `BiometricDeduplicationService`. - The test ensures that the correct deduplication engine reference PK is used when reporting the status of withdrawn individuals, enhancing the coverage of the biometric deduplication functionality.
| status_code=item["status_code"], | ||
| first=item["first"]["reference_pk"] or None, | ||
| second=item["second"]["reference_pk"] or None, | ||
| first=self._resolve_individual_id_from_reference( |
There was a problem hiding this comment.
This implementation has a bug.
fetch_biometric_deduplication_results_and_process should save all duplicate pairs across the whole program, not only pairs from individuals inside the current RDI scope.
The bug is visible in this scenario:
- RDI with 1 HH and 1 individual with face "A" is created.
deduplication_engine_reference_pk = "EXT-JAN". - RDI is sent to the deduplication engine (no duplicates yet).
- RDI is merged.
- A new RDI is created with 1 HH and 1 individual with the same face "A".
deduplication_engine_reference_pk = "EXT-JAN2". - The new RDI is sent to the deduplication engine.
- Results are fetched (
fetch_biometric_deduplication_results_and_process). - Expected: 1 pair. Actual: 0 pairs.
The pair should be between "EXT-JAN" (already merged) and "EXT-JAN2". - No pair is saved.
Why this happens:
reference_to_individual_id is built only from individuals in RDIs that are in review and dedup in progress, not from the whole program.
So in this case it contains only {"EXT-JAN2": <uuid>}.
The dedup result item contains:
first.reference_pk = "EXT-JAN", second.reference_pk = "EXT-JAN2".
When creating SimilarityPair, _resolve_individual_id_from_reference is called with reference_to_individual_id.
This produces something like:
SimilarityPair(
...,
first="EXT-JAN", # fallback (not resolved)
second=<uuid>, # resolved
)
SimilarityPair.first/second are later used as FK IDs to Individual when saving DeduplicationEngineSimilarityPair.
Because one side remains an external reference, the pair is skipped.
Besides fixing this bug, we should also add a test for this exact scenario.
| self.selected_rdi.program.programme_code, | ||
| ) | ||
| validated_data = dict(serializer.validated_data) | ||
| validated_data["deduplication_engine_reference_pk"] = external_individual_id |
There was a problem hiding this comment.
We should add stronger input validation for individual_id in the LAX flow.
Please consider checking:
- max_length=255 at serializer level (currently only DB enforces this).
- normalization (strip) before save.
- reject blank/whitespace-only values.
- detect duplicates of individual_id within the same request payload.
- validate uniqueness in program with a clear API validation error (instead of only DB IntegrityError).
- add tests for all cases above (too long value, blank value, duplicated value in one payload, duplicated value in program).
src/hope/models/individual.py
Outdated
| ), | ||
| UniqueConstraint( | ||
| fields=["program", "deduplication_engine_reference_pk"], | ||
| condition=Q(is_removed=False) & Q(deduplication_engine_reference_pk__isnull=False), |
There was a problem hiding this comment.
The new uniqueness condition checks IS NOT NULL but not non-empty string. If "" is ever saved, only one blank value per program is allowed.
| @@ -149,6 +150,35 @@ def test_upload_individuals_success( | |||
| assert rdi.deduplication_engine_status == RegistrationDataImport.DEDUP_ENGINE_UPLOADED | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
As I mentioned in previous comment we have a test gap. New test covers mapping EXT-* refs for pending individuals only. Missing test for the critical mixed case where one side is current pending and the other side is an already merged individual identified by external dedup ref. That case is where current logic is most likely to regress.
|
|
||
| class Meta: | ||
| model = PendingIndividual | ||
| exclude = [ |
There was a problem hiding this comment.
Because we rely on exclude, which is technically fragile, we can currently send both individual_id and deduplication_engine_reference_pk. In practice, however, we always save individual_id into deduplication_engine_reference_pk. This does not raise an error, but the behavior is confusing. Swagger will also expose this field as a possible input.
…lidation and error handling - Updated the `IndividualSerializer` to include a maximum length constraint and whitespace trimming for the `individual_id` field. - Added validation to ensure `individual_id` is not blank and does not duplicate existing entries in the same program. - Enhanced the `CreateLaxIndividuals` class to track seen `individual_id`s, preventing duplicates in the request payload. - Introduced new tests to validate the behavior of the updated serializer and error handling for individual creation scenarios.
… related services
deduplication_engine_reference_pkin the Individual model to facilitate communication with the biometric deduplication engine.