Skip to content

303284: Add deduplication engine reference PK to Individual model and…#5799

Draft
arsen-vs wants to merge 7 commits intodevelopfrom
feature/303284-Use-ind-pk-instead-of-originating-id
Draft

303284: Add deduplication engine reference PK to Individual model and…#5799
arsen-vs wants to merge 7 commits intodevelopfrom
feature/303284-Use-ind-pk-instead-of-originating-id

Conversation

@arsen-vs
Copy link
Copy Markdown

… related services

  • Introduced a new field deduplication_engine_reference_pk in the Individual model to facilitate communication with the biometric deduplication engine.
  • Updated the CreateLaxIndividuals endpoint to include this reference in the validated data.
  • Enhanced the BiometricDeduplicationService to utilize the new reference PK for individual identification during deduplication processes.
  • Added tests to ensure the correct usage of the deduplication reference PK in various service methods and endpoints.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.11%. Comparing base (b743590) to head (d1c08fa).
⚠️ Report is 184 commits behind head on develop.

Files with missing lines Patch % Lines
...istration_data/services/biometric_deduplication.py 94.11% 0 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (95.45%) is below the target coverage (97.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #5799   +/-   ##
========================================
  Coverage    91.10%   91.11%           
========================================
  Files          500      500           
  Lines        34371    34390   +19     
  Branches      3547     3549    +2     
========================================
+ Hits         31315    31333   +18     
  Misses        2268     2268           
- Partials       788      789    +1     
Flag Coverage Δ
e2e 52.69% <31.81%> (-0.02%) ⬇️
unit 90.73% <95.45%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@johniak johniak marked this pull request as draft March 16, 2026 10:06
ArsenPidhoretskyi added 5 commits March 20, 2026 10:56
… related services

- Introduced a new field `deduplication_engine_reference_pk` in the Individual model to facilitate communication with the biometric deduplication engine.
- Updated the CreateLaxIndividuals endpoint to include this reference in the validated data.
- Enhanced the BiometricDeduplicationService to utilize the new reference PK for individual identification during deduplication processes.
- Added tests to ensure the correct usage of the deduplication reference PK in various service methods and endpoints.
…ual model

- Introduced a unique constraint on the combination of `program` and `deduplication_engine_reference_pk` fields in the Individual model to ensure data integrity.
- Updated the BiometricDeduplicationService and related views to utilize the new constraint when reporting individuals' statuses.
- Enhanced tests to verify the correct implementation of the unique constraint and its impact on reporting logic.
…aintainability

- Cleaned up the code in `biometric_deduplication.py` by adjusting the formatting of dictionary comprehensions for better clarity.
- Added an import statement for the `Individual` model in the test file to ensure proper functionality of the tests related to biometric deduplication.
- Modified test assertions in `test_rdi_merge_helpers.py`, `test_rdi_merge.py`, and `test_views_registration_data_import_actions.py` to compare individual IDs using primary keys instead of UUIDs, ensuring consistency in reporting logic.
- Enhanced readability of assertions by utilizing set comprehensions for ID comparisons.
- Introduced IndividualFactory import to enhance test coverage for individual-related functionality.
- Added uuid import to facilitate unique identifier generation in the pending household fixture.
@arsen-vs arsen-vs force-pushed the feature/303284-Use-ind-pk-instead-of-originating-id branch from 1c0888c to d1c08fa Compare March 20, 2026 08:59
…ne reference PK

- Introduced a new test case `test_report_withdrawn_with_iterable_uses_deduplication_engine_reference_pk` to validate the behavior of the `report_individuals_status` method in the `BiometricDeduplicationService`.
- The test ensures that the correct deduplication engine reference PK is used when reporting the status of withdrawn individuals, enhancing the coverage of the biometric deduplication functionality.
@arsen-vs arsen-vs marked this pull request as ready for review March 31, 2026 08:12
status_code=item["status_code"],
first=item["first"]["reference_pk"] or None,
second=item["second"]["reference_pk"] or None,
first=self._resolve_individual_id_from_reference(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation has a bug.

fetch_biometric_deduplication_results_and_process should save all duplicate pairs across the whole program, not only pairs from individuals inside the current RDI scope.

The bug is visible in this scenario:

  1. RDI with 1 HH and 1 individual with face "A" is created. deduplication_engine_reference_pk = "EXT-JAN".
  2. RDI is sent to the deduplication engine (no duplicates yet).
  3. RDI is merged.
  4. A new RDI is created with 1 HH and 1 individual with the same face "A". deduplication_engine_reference_pk = "EXT-JAN2".
  5. The new RDI is sent to the deduplication engine.
  6. Results are fetched (fetch_biometric_deduplication_results_and_process).
  7. Expected: 1 pair. Actual: 0 pairs.
    The pair should be between "EXT-JAN" (already merged) and "EXT-JAN2".
  8. No pair is saved.

Why this happens:

reference_to_individual_id is built only from individuals in RDIs that are in review and dedup in progress, not from the whole program.
So in this case it contains only {"EXT-JAN2": <uuid>}.

The dedup result item contains:
first.reference_pk = "EXT-JAN", second.reference_pk = "EXT-JAN2".

When creating SimilarityPair, _resolve_individual_id_from_reference is called with reference_to_individual_id.
This produces something like:

SimilarityPair(
  ...,
  first="EXT-JAN",   # fallback (not resolved)
  second=<uuid>,     # resolved
)

SimilarityPair.first/second are later used as FK IDs to Individual when saving DeduplicationEngineSimilarityPair.
Because one side remains an external reference, the pair is skipped.

Besides fixing this bug, we should also add a test for this exact scenario.

self.selected_rdi.program.programme_code,
)
validated_data = dict(serializer.validated_data)
validated_data["deduplication_engine_reference_pk"] = external_individual_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add stronger input validation for individual_id in the LAX flow.

Please consider checking:

  1. max_length=255 at serializer level (currently only DB enforces this).
  2. normalization (strip) before save.
  3. reject blank/whitespace-only values.
  4. detect duplicates of individual_id within the same request payload.
  5. validate uniqueness in program with a clear API validation error (instead of only DB IntegrityError).
  6. add tests for all cases above (too long value, blank value, duplicated value in one payload, duplicated value in program).

),
UniqueConstraint(
fields=["program", "deduplication_engine_reference_pk"],
condition=Q(is_removed=False) & Q(deduplication_engine_reference_pk__isnull=False),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new uniqueness condition checks IS NOT NULL but not non-empty string. If "" is ever saved, only one blank value per program is allowed.

@@ -149,6 +150,35 @@ def test_upload_individuals_success(
assert rdi.deduplication_engine_status == RegistrationDataImport.DEDUP_ENGINE_UPLOADED


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in previous comment we have a test gap. New test covers mapping EXT-* refs for pending individuals only. Missing test for the critical mixed case where one side is current pending and the other side is an already merged individual identified by external dedup ref. That case is where current logic is most likely to regress.


class Meta:
model = PendingIndividual
exclude = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we rely on exclude, which is technically fragile, we can currently send both individual_id and deduplication_engine_reference_pk. In practice, however, we always save individual_id into deduplication_engine_reference_pk. This does not raise an error, but the behavior is confusing. Swagger will also expose this field as a possible input.

@johniak johniak marked this pull request as draft March 31, 2026 15:04
…lidation and error handling

- Updated the `IndividualSerializer` to include a maximum length constraint and whitespace trimming for the `individual_id` field.
- Added validation to ensure `individual_id` is not blank and does not duplicate existing entries in the same program.
- Enhanced the `CreateLaxIndividuals` class to track seen `individual_id`s, preventing duplicates in the request payload.
- Introduced new tests to validate the behavior of the updated serializer and error handling for individual creation scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants