Skip to content

Conversation

@chrisvoncsefalvay
Copy link

Change Description

Adds a recogniser for the new Medicare Beneficiary ID.

Issue reference

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@chrisvoncsefalvay
Copy link
Author

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new recognizer for the US Medicare Beneficiary Identifier (MBI), a structured 11-character alphanumeric identifier that replaced the old SSN-based Medicare identifiers. The implementation follows Presidio's pattern-based recognizer approach with proper validation against CMS specifications.

Key Changes:

  • Adds UsMbiRecognizer with regex patterns matching the CMS MBI format (positions-specific character type validation, excluding confusing letters S, L, O, I, B, Z)
  • Implements two confidence levels: weak (0.3) for non-dashed format and medium (0.5) for dashed format
  • Includes comprehensive test coverage with 20+ test cases covering valid formats, invalid patterns, edge cases, and context validation

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_mbi_recognizer.py Core recognizer implementation with position-specific regex patterns matching CMS MBI specification, including proper character exclusions and two scoring levels
presidio-analyzer/tests/test_us_mbi_recognizer.py Comprehensive test suite covering valid MBI formats (with/without dashes), invalid patterns (wrong characters, positions, length), lowercase handling, and metadata validation
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/init.py Updates US recognizer module exports to include UsMbiRecognizer
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py Adds UsMbiRecognizer to main predefined recognizers module exports
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Adds UsMbiRecognizer configuration with enabled: false (correct default for country-specific recognizers)

Comment on lines 1 to 99
"""Recognizer for US Medicare Beneficiary Identifier (MBI)."""

from typing import List, Optional

from presidio_analyzer import Pattern, PatternRecognizer


class UsMbiRecognizer(PatternRecognizer):
"""Recognise US Medicare Beneficiary Identifier (MBI) using regex.

The MBI is an 11-character identifier used by Medicare. The format follows
specific rules defined by CMS (Centers for Medicare & Medicaid Services):
https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi

Format: C A AN N A AN N A A N N
Where:
- C = numeric character (0-9)
- A = alphabetic character (excluding S, L, O, I, B, Z)
- AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z)

Position rules:
- Positions 1, 4, 7, 10, 11: numeric (0-9)
- Positions 2, 5, 8, 9: alphabetic
- Positions 3, 6: alphanumeric

Example: 1EG4-TE5-MK73 (dashes are for display only)

:param patterns: List of patterns to be used by this recognizer
:param context: List of context words to increase confidence in detection
:param supported_language: Language this recognizer supports
:param supported_entity: The entity this recognizer can detect
"""

# Valid letters: A-Z excluding S, L, O, I, B, Z
# Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y
VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY"
VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY"

# Regex building blocks
_NUM = "[0-9]"
_ALPHA = f"[{VALID_LETTERS}]"
_ALPHANUM = f"[{VALID_ALPHANUMERIC}]"

# Full MBI pattern:
# Pos: 1 2 3 4 5 6 7 8 9 10 11
# NUM ALPHA ALPHANUM NUM ALPHA ALPHANUM NUM ALPHA ALPHA NUM NUM

# Pattern without dashes (11 consecutive characters)
_MBI_NO_DASH = (
f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}"
f"{_ALPHA}{_ALPHANUM}{_NUM}"
f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
)

# Pattern with dashes in XXXX-XXX-XXXX format
_MBI_WITH_DASH = (
f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-"
f"{_ALPHA}{_ALPHANUM}{_NUM}-"
f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
)

PATTERNS = [
Pattern(
"MBI (weak)",
rf"\b{_MBI_NO_DASH}\b",
0.3,
),
Pattern(
"MBI (medium)",
rf"\b{_MBI_WITH_DASH}\b",
0.5,
),
]

CONTEXT = [
"medicare",
"mbi",
"beneficiary",
"cms",
"medicaid",
"hic", # Health Insurance Claim number (predecessor)
"hicn",
]

def __init__(
self,
patterns: Optional[List[Pattern]] = None,
context: Optional[List[str]] = None,
supported_language: str = "en",
supported_entity: str = "US_MBI",
):
patterns = patterns if patterns else self.PATTERNS
context = context if context else self.CONTEXT
super().__init__(
supported_entity=supported_entity,
patterns=patterns,
context=context,
supported_language=supported_language,
)
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation: The US_MBI entity should be added to the USA section of docs/supported_entities.md. Per CONTRIBUTING.md guidelines: "If your recognizer supports a new entity, consider updating the supported entities list". Add an entry like: |US_MBI|A US Medicare Beneficiary Identifier (MBI) with 11 alphanumeric characters.|Pattern match and context|

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 99
"""Recognizer for US Medicare Beneficiary Identifier (MBI)."""

from typing import List, Optional

from presidio_analyzer import Pattern, PatternRecognizer


class UsMbiRecognizer(PatternRecognizer):
"""Recognise US Medicare Beneficiary Identifier (MBI) using regex.

The MBI is an 11-character identifier used by Medicare. The format follows
specific rules defined by CMS (Centers for Medicare & Medicaid Services):
https://www.cms.gov/medicare/new-medicare-card/understanding-new-medicare-beneficiary-identifier-mbi

Format: C A AN N A AN N A A N N
Where:
- C = numeric character (0-9)
- A = alphabetic character (excluding S, L, O, I, B, Z)
- AN = alphanumeric character (numeric or alphabetic, excluding S, L, O, I, B, Z)

Position rules:
- Positions 1, 4, 7, 10, 11: numeric (0-9)
- Positions 2, 5, 8, 9: alphabetic
- Positions 3, 6: alphanumeric

Example: 1EG4-TE5-MK73 (dashes are for display only)

:param patterns: List of patterns to be used by this recognizer
:param context: List of context words to increase confidence in detection
:param supported_language: Language this recognizer supports
:param supported_entity: The entity this recognizer can detect
"""

# Valid letters: A-Z excluding S, L, O, I, B, Z
# Valid letters are: A, C, D, E, F, G, H, J, K, M, N, P, Q, R, T, U, V, W, X, Y
VALID_LETTERS = "ACDEFGHJKMNPQRTUVWXY"
VALID_ALPHANUMERIC = "0-9ACDEFGHJKMNPQRTUVWXY"

# Regex building blocks
_NUM = "[0-9]"
_ALPHA = f"[{VALID_LETTERS}]"
_ALPHANUM = f"[{VALID_ALPHANUMERIC}]"

# Full MBI pattern:
# Pos: 1 2 3 4 5 6 7 8 9 10 11
# NUM ALPHA ALPHANUM NUM ALPHA ALPHANUM NUM ALPHA ALPHA NUM NUM

# Pattern without dashes (11 consecutive characters)
_MBI_NO_DASH = (
f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}"
f"{_ALPHA}{_ALPHANUM}{_NUM}"
f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
)

# Pattern with dashes in XXXX-XXX-XXXX format
_MBI_WITH_DASH = (
f"{_NUM}{_ALPHA}{_ALPHANUM}{_NUM}-"
f"{_ALPHA}{_ALPHANUM}{_NUM}-"
f"{_ALPHA}{_ALPHA}{_NUM}{_NUM}"
)

PATTERNS = [
Pattern(
"MBI (weak)",
rf"\b{_MBI_NO_DASH}\b",
0.3,
),
Pattern(
"MBI (medium)",
rf"\b{_MBI_WITH_DASH}\b",
0.5,
),
]

CONTEXT = [
"medicare",
"mbi",
"beneficiary",
"cms",
"medicaid",
"hic", # Health Insurance Claim number (predecessor)
"hicn",
]

def __init__(
self,
patterns: Optional[List[Pattern]] = None,
context: Optional[List[str]] = None,
supported_language: str = "en",
supported_entity: str = "US_MBI",
):
patterns = patterns if patterns else self.PATTERNS
context = context if context else self.CONTEXT
super().__init__(
supported_entity=supported_entity,
patterns=patterns,
context=context,
supported_language=supported_language,
)
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing CHANGELOG entry: Per CONTRIBUTING.md guidelines, "Update CHANGELOG.md: Under Unreleased section, use the category which is most suitable for your change". This new recognizer should be documented in CHANGELOG.md under the "Unreleased" section, similar to how KrRrnRecognizer and ThTninRecognizer were documented.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

@chrisvoncsefalvay
Copy link
Author

@copilot open a new pull request to apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant