-
Notifications
You must be signed in to change notification settings - Fork 903
feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID #1821
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a new recognizer for the US Medicare Beneficiary Identifier (MBI), a structured 11-character alphanumeric identifier that replaced the old SSN-based Medicare identifiers. The implementation follows Presidio's pattern-based recognizer approach with proper validation against CMS specifications.
Key Changes:
- Adds
UsMbiRecognizerwith regex patterns matching the CMS MBI format (positions-specific character type validation, excluding confusing letters S, L, O, I, B, Z) - Implements two confidence levels: weak (0.3) for non-dashed format and medium (0.5) for dashed format
- Includes comprehensive test coverage with 20+ test cases covering valid formats, invalid patterns, edge cases, and context validation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py | Core recognizer implementation with position-specific regex patterns matching CMS MBI specification, including proper character exclusions and two scoring levels |
| presidio-analyzer/tests/test_us_mbi_recognizer.py | Comprehensive test suite covering valid MBI formats (with/without dashes), invalid patterns (wrong characters, positions, length), lowercase handling, and metadata validation |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/init.py | Updates US recognizer module exports to include UsMbiRecognizer |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Adds UsMbiRecognizer to main predefined recognizers module exports |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Adds UsMbiRecognizer configuration with enabled: false (correct default for country-specific recognizers) |
...o-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py
Outdated
Show resolved
Hide resolved
| """Recognizer for US Medicare Beneficiary Identifier (MBI).""" | ||
|
|
||
| from typing import List, Optional | ||
|
|
||
| from presidio_analyzer import Pattern, PatternRecognizer | ||
|
|
||
|
|
||
| class UsMbiRecognizer(PatternRecognizer): | ||
| """Recognise US Medicare Beneficiary Identifier (MBI) using regex. | ||
|
|
||
| The MBI is an 11-character identifier used by Medicare. The format follows | ||
| specific rules defined by CMS (Centers for Medicare & Medicaid Services): | ||
| https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi | ||
|
|
||
| Format: C A AN N A AN N A A N N | ||
| Where: | ||
| - C = numeric character (0-9) | ||
| - A = alphabetic character (excluding S, L, O, I, B, Z) | ||
| - AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z) | ||
|
|
||
| Position rules: | ||
| - Positions 1, 4, 7, 10, 11: numeric (0-9) | ||
| - Positions 2, 5, 8, 9: alphabetic | ||
| - Positions 3, 6: alphanumeric | ||
|
|
||
| Example: 1EG4-TE5-MK73 (dashes are for display only) | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| """ | ||
|
|
||
| # Valid letters: A-Z excluding S, L, O, I, B, Z | ||
| # Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y | ||
| VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY" | ||
| VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY" | ||
|
|
||
| # Regex building blocks | ||
| _NUM = "[0-9]" | ||
| _ALPHA = f"[{VALID_LETTERS}]" | ||
| _ALPHANUM = f"[{VALID_ALPHANUMERIC}]" | ||
|
|
||
| # Full MBI pattern: | ||
| # Pos: 1 2 3 4 5 6 7 8 9 10 11 | ||
| # NUM ALPHA ALPHANUM NUM ALPHA ALPHANUM NUM ALPHA ALPHA NUM NUM | ||
|
|
||
| # Pattern without dashes (11 consecutive characters) | ||
| _MBI_NO_DASH = ( | ||
| f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}" | ||
| f"{_ALPHA}{_ALPHANUM}{_NUM}" | ||
| f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}" | ||
| ) | ||
|
|
||
| # Pattern with dashes in XXXX-XXX-XXXX format | ||
| _MBI_WITH_DASH = ( | ||
| f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-" | ||
| f"{_ALPHA}{_ALPHANUM}{_NUM}-" | ||
| f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}" | ||
| ) | ||
|
|
||
| PATTERNS = [ | ||
| Pattern( | ||
| "MBI (weak)", | ||
| rf"\b{_MBI_NO_DASH}\b", | ||
| 0.3, | ||
| ), | ||
| Pattern( | ||
| "MBI (medium)", | ||
| rf"\b{_MBI_WITH_DASH}\b", | ||
| 0.5, | ||
| ), | ||
| ] | ||
|
|
||
| CONTEXT = [ | ||
| "medicare", | ||
| "mbi", | ||
| "beneficiary", | ||
| "cms", | ||
| "medicaid", | ||
| "hic", # Health Insurance Claim number (predecessor) | ||
| "hicn", | ||
| ] | ||
|
|
||
| def __init__( | ||
| self, | ||
| patterns: Optional[List[Pattern]] = None, | ||
| context: Optional[List[str]] = None, | ||
| supported_language: str = "en", | ||
| supported_entity: str = "US_MBI", | ||
| ): | ||
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| super().__init__( | ||
| supported_entity=supported_entity, | ||
| patterns=patterns, | ||
| context=context, | ||
| supported_language=supported_language, | ||
| ) |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation: The US_MBI entity should be added to the USA section of docs/supported_entities.md. Per CONTRIBUTING.md guidelines: "If your recognizer supports a new entity, consider updating the supported entities list". Add an entry like: |US_MBI|A US Medicare Beneficiary Identifier (MBI) with 11 alphanumeric characters.|Pattern match and context|
| """Recognizer for US Medicare Beneficiary Identifier (MBI).""" | ||
|
|
||
| from typing import List, Optional | ||
|
|
||
| from presidio_analyzer import Pattern, PatternRecognizer | ||
|
|
||
|
|
||
| class UsMbiRecognizer(PatternRecognizer): | ||
| """Recognise US Medicare Beneficiary Identifier (MBI) using regex. | ||
|
|
||
| The MBI is an 11-character identifier used by Medicare. The format follows | ||
| specific rules defined by CMS (Centers for Medicare & Medicaid Services): | ||
| https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi | ||
|
|
||
| Format: C A AN N A AN N A A N N | ||
| Where: | ||
| - C = numeric character (0-9) | ||
| - A = alphabetic character (excluding S, L, O, I, B, Z) | ||
| - AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z) | ||
|
|
||
| Position rules: | ||
| - Positions 1, 4, 7, 10, 11: numeric (0-9) | ||
| - Positions 2, 5, 8, 9: alphabetic | ||
| - Positions 3, 6: alphanumeric | ||
|
|
||
| Example: 1EG4-TE5-MK73 (dashes are for display only) | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| """ | ||
|
|
||
| # Valid letters: A-Z excluding S, L, O, I, B, Z | ||
| # Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y | ||
| VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY" | ||
| VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY" | ||
|
|
||
| # Regex building blocks | ||
| _NUM = "[0-9]" | ||
| _ALPHA = f"[{VALID_LETTERS}]" | ||
| _ALPHANUM = f"[{VALID_ALPHANUMERIC}]" | ||
|
|
||
| # Full MBI pattern: | ||
| # Pos: 1 2 3 4 5 6 7 8 9 10 11 | ||
| # NUM ALPHA ALPHANUM NUM ALPHA ALPHANUM NUM ALPHA ALPHA NUM NUM | ||
|
|
||
| # Pattern without dashes (11 consecutive characters) | ||
| _MBI_NO_DASH = ( | ||
| f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}" | ||
| f"{_ALPHA}{_ALPHANUM}{_NUM}" | ||
| f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}" | ||
| ) | ||
|
|
||
| # Pattern with dashes in XXXX-XXX-XXXX format | ||
| _MBI_WITH_DASH = ( | ||
| f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-" | ||
| f"{_ALPHA}{_ALPHANUM}{_NUM}-" | ||
| f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}" | ||
| ) | ||
|
|
||
| PATTERNS = [ | ||
| Pattern( | ||
| "MBI (weak)", | ||
| rf"\b{_MBI_NO_DASH}\b", | ||
| 0.3, | ||
| ), | ||
| Pattern( | ||
| "MBI (medium)", | ||
| rf"\b{_MBI_WITH_DASH}\b", | ||
| 0.5, | ||
| ), | ||
| ] | ||
|
|
||
| CONTEXT = [ | ||
| "medicare", | ||
| "mbi", | ||
| "beneficiary", | ||
| "cms", | ||
| "medicaid", | ||
| "hic", # Health Insurance Claim number (predecessor) | ||
| "hicn", | ||
| ] | ||
|
|
||
| def __init__( | ||
| self, | ||
| patterns: Optional[List[Pattern]] = None, | ||
| context: Optional[List[str]] = None, | ||
| supported_language: str = "en", | ||
| supported_entity: str = "US_MBI", | ||
| ): | ||
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| super().__init__( | ||
| supported_entity=supported_entity, | ||
| patterns=patterns, | ||
| context=context, | ||
| supported_language=supported_language, | ||
| ) |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing CHANGELOG entry: Per CONTRIBUTING.md guidelines, "Update CHANGELOG.md: Under Unreleased section, use the category which is most suitable for your change". This new recognizer should be documented in CHANGELOG.md under the "Unreleased" section, similar to how KrRrnRecognizer and ThTninRecognizer were documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
…ntry_specific/us/us_mbi_recognizer.py US spelling of recognise. Co-authored-by: Copilot <[email protected]>
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Change Description
Adds a recogniser for the new Medicare Beneficiary ID.
Issue reference
Checklist