feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821

chrisvoncsefalvay · 2025-12-25T01:30:14Z

Change Description

Adds a recogniser for the new Medicare Beneficiary ID.

Issue reference

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

…entifier

chrisvoncsefalvay · 2025-12-25T01:31:32Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR adds a new recognizer for the US Medicare Beneficiary Identifier (MBI), a structured 11-character alphanumeric identifier that replaced the old SSN-based Medicare identifiers. The implementation follows Presidio's pattern-based recognizer approach with proper validation against CMS specifications.

Key Changes:

Adds UsMbiRecognizer with regex patterns matching the CMS MBI format (positions-specific character type validation, excluding confusing letters S, L, O, I, B, Z)
Implements two confidence levels: weak (0.3) for non-dashed format and medium (0.5) for dashed format
Includes comprehensive test coverage with 20+ test cases covering valid formats, invalid patterns, edge cases, and context validation

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py	Core recognizer implementation with position-specific regex patterns matching CMS MBI specification, including proper character exclusions and two scoring levels
presidio-analyzer/tests/test_us_mbi_recognizer.py	Comprehensive test suite covering valid MBI formats (with/without dashes), invalid patterns (wrong characters, positions, length), lowercase handling, and metadata validation
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/init.py	Updates US recognizer module exports to include UsMbiRecognizer
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py	Adds UsMbiRecognizer to main predefined recognizers module exports
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml	Adds UsMbiRecognizer configuration with enabled: false (correct default for country-specific recognizers)

...o-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py

Copilot · 2025-12-30T17:23:48Z

...o-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py

+"""Recognizer for US Medicare Beneficiary Identifier (MBI)."""
+
+from typing import List, Optional
+
+from presidio_analyzer import Pattern, PatternRecognizer
+
+
+class UsMbiRecognizer(PatternRecognizer):
+    """Recognise US Medicare Beneficiary Identifier (MBI) using regex.
+
+    The MBI is an 11-character identifier used by Medicare. The format follows
+    specific rules defined by CMS (Centers for Medicare & Medicaid Services):
+    https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi
+
+    Format: C A AN N A AN N A A N N
+    Where:
+    - C = numeric character (0-9)
+    - A = alphabetic character (excluding S, L, O, I, B, Z)
+    - AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z)
+
+    Position rules:
+    - Positions 1, 4, 7, 10, 11: numeric (0-9)
+    - Positions 2, 5, 8, 9: alphabetic
+    - Positions 3, 6: alphanumeric
+
+    Example: 1EG4-TE5-MK73 (dashes are for display only)
+
+    :param patterns: List of patterns to be used by this recognizer
+    :param context: List of context words to increase confidence in detection
+    :param supported_language: Language this recognizer supports
+    :param supported_entity: The entity this recognizer can detect
+    """
+
+    # Valid letters: A-Z excluding S, L, O, I, B, Z
+    # Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y
+    VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY"
+    VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY"
+
+    # Regex building blocks
+    _NUM = "[0-9]"
+    _ALPHA = f"[{VALID_LETTERS}]"
+    _ALPHANUM = f"[{VALID_ALPHANUMERIC}]"
+
+    # Full MBI pattern:
+    # Pos: 1   2      3        4   5      6        7   8      9      10  11
+    #      NUM ALPHA  ALPHANUM NUM ALPHA  ALPHANUM NUM ALPHA  ALPHA  NUM NUM
+
+    # Pattern without dashes (11 consecutive characters)
+    _MBI_NO_DASH = (
+        f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}"
+        f"{_ALPHA}{_ALPHANUM}{_NUM}"
+        f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
+    )
+
+    # Pattern with dashes in XXXX-XXX-XXXX format
+    _MBI_WITH_DASH = (
+        f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-"
+        f"{_ALPHA}{_ALPHANUM}{_NUM}-"
+        f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
+    )
+
+    PATTERNS = [
+        Pattern(
+            "MBI (weak)",
+            rf"\b{_MBI_NO_DASH}\b",
+            0.3,
+        ),
+        Pattern(
+            "MBI (medium)",
+            rf"\b{_MBI_WITH_DASH}\b",
+            0.5,
+        ),
+    ]
+
+    CONTEXT = [
+        "medicare",
+        "mbi",
+        "beneficiary",
+        "cms",
+        "medicaid",
+        "hic",  # Health Insurance Claim number (predecessor)
+        "hicn",
+    ]
+
+    def __init__(
+        self,
+        patterns: Optional[List[Pattern]] = None,
+        context: Optional[List[str]] = None,
+        supported_language: str = "en",
+        supported_entity: str = "US_MBI",
+    ):
+        patterns = patterns if patterns else self.PATTERNS
+        context = context if context else self.CONTEXT
+        super().__init__(
+            supported_entity=supported_entity,
+            patterns=patterns,
+            context=context,
+            supported_language=supported_language,
+        )


Missing documentation: The US_MBI entity should be added to the USA section of docs/supported_entities.md. Per CONTRIBUTING.md guidelines: "If your recognizer supports a new entity, consider updating the supported entities list". Add an entry like: |US_MBI|A US Medicare Beneficiary Identifier (MBI) with 11 alphanumeric characters.|Pattern match and context|

Copilot · 2025-12-30T17:23:48Z

...o-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py

+"""Recognizer for US Medicare Beneficiary Identifier (MBI)."""
+
+from typing import List, Optional
+
+from presidio_analyzer import Pattern, PatternRecognizer
+
+
+class UsMbiRecognizer(PatternRecognizer):
+    """Recognise US Medicare Beneficiary Identifier (MBI) using regex.
+
+    The MBI is an 11-character identifier used by Medicare. The format follows
+    specific rules defined by CMS (Centers for Medicare & Medicaid Services):
+    https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi
+
+    Format: C A AN N A AN N A A N N
+    Where:
+    - C = numeric character (0-9)
+    - A = alphabetic character (excluding S, L, O, I, B, Z)
+    - AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z)
+
+    Position rules:
+    - Positions 1, 4, 7, 10, 11: numeric (0-9)
+    - Positions 2, 5, 8, 9: alphabetic
+    - Positions 3, 6: alphanumeric
+
+    Example: 1EG4-TE5-MK73 (dashes are for display only)
+
+    :param patterns: List of patterns to be used by this recognizer
+    :param context: List of context words to increase confidence in detection
+    :param supported_language: Language this recognizer supports
+    :param supported_entity: The entity this recognizer can detect
+    """
+
+    # Valid letters: A-Z excluding S, L, O, I, B, Z
+    # Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y
+    VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY"
+    VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY"
+
+    # Regex building blocks
+    _NUM = "[0-9]"
+    _ALPHA = f"[{VALID_LETTERS}]"
+    _ALPHANUM = f"[{VALID_ALPHANUMERIC}]"
+
+    # Full MBI pattern:
+    # Pos: 1   2      3        4   5      6        7   8      9      10  11
+    #      NUM ALPHA  ALPHANUM NUM ALPHA  ALPHANUM NUM ALPHA  ALPHA  NUM NUM
+
+    # Pattern without dashes (11 consecutive characters)
+    _MBI_NO_DASH = (
+        f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}"
+        f"{_ALPHA}{_ALPHANUM}{_NUM}"
+        f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
+    )
+
+    # Pattern with dashes in XXXX-XXX-XXXX format
+    _MBI_WITH_DASH = (
+        f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-"
+        f"{_ALPHA}{_ALPHANUM}{_NUM}-"
+        f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
+    )
+
+    PATTERNS = [
+        Pattern(
+            "MBI (weak)",
+            rf"\b{_MBI_NO_DASH}\b",
+            0.3,
+        ),
+        Pattern(
+            "MBI (medium)",
+            rf"\b{_MBI_WITH_DASH}\b",
+            0.5,
+        ),
+    ]
+
+    CONTEXT = [
+        "medicare",
+        "mbi",
+        "beneficiary",
+        "cms",
+        "medicaid",
+        "hic",  # Health Insurance Claim number (predecessor)
+        "hicn",
+    ]
+
+    def __init__(
+        self,
+        patterns: Optional[List[Pattern]] = None,
+        context: Optional[List[str]] = None,
+        supported_language: str = "en",
+        supported_entity: str = "US_MBI",
+    ):
+        patterns = patterns if patterns else self.PATTERNS
+        context = context if context else self.CONTEXT
+        super().__init__(
+            supported_entity=supported_entity,
+            patterns=patterns,
+            context=context,
+            supported_language=supported_language,
+        )


Missing CHANGELOG entry: Per CONTRIBUTING.md guidelines, "Update CHANGELOG.md: Under Unreleased section, use the category which is most suitable for your change". This new recognizer should be documented in CHANGELOG.md under the "Unreleased" section, similar to how KrRrnRecognizer and ThTninRecognizer were documented.

@copilot open a new pull request to apply changes based on this feedback

…ntry_specific/us/us_mbi_recognizer.py US spelling of recognise. Co-authored-by: Copilot <[email protected]>

chrisvoncsefalvay · 2025-12-30T17:27:27Z

@copilot open a new pull request to apply changes based on the comments in this thread

feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary Id…

e7a9f13

…entifier

github-actions bot added the external label Dec 25, 2025

Merge branch 'main' into chrisvoncsefalvay/feat/us-mbi-recognizer

11400b8

omri374 requested a review from Copilot December 30, 2025 17:20

Copilot started reviewing on behalf of omri374 December 30, 2025 17:21 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

chrisvoncsefalvay and others added 2 commits December 30, 2025 10:26

Update presidio-analyzer/presidio_analyzer/predefined_recognizers/cou…

cccafc2

…ntry_specific/us/us_mbi_recognizer.py US spelling of recognise. Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into chrisvoncsefalvay/feat/us-mbi-recognizer

56df872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821

feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821

Uh oh!

chrisvoncsefalvay commented Dec 25, 2025

Uh oh!

chrisvoncsefalvay commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

chrisvoncsefalvay Dec 30, 2025

Uh oh!

chrisvoncsefalvay commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821

Are you sure you want to change the base?

feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821

Uh oh!

Conversation

chrisvoncsefalvay commented Dec 25, 2025

Change Description

Issue reference

Checklist

Uh oh!

chrisvoncsefalvay commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

chrisvoncsefalvay Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

chrisvoncsefalvay commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant