Skip to content

feat: add Philippines TIN (PH_TIN) recognizer#2016

Open
aaronaco wants to merge 1 commit intomicrosoft:mainfrom
aaronaco:feat/philippines-tin-number-recognizer
Open

feat: add Philippines TIN (PH_TIN) recognizer#2016
aaronaco wants to merge 1 commit intomicrosoft:mainfrom
aaronaco:feat/philippines-tin-number-recognizer

Conversation

@aaronaco
Copy link
Copy Markdown

@aaronaco aaronaco commented May 2, 2026

Change Description

This PR introduces the PhTinRecognizer (PH_TIN) to detect the Philippines Taxpayer Identification Number issued by the Bureau of Internal Revenue (BIR), as part of the broader proposal to add Philippines-specific PII recognizers.

Key changes included:

  • Added regex patterns to correctly identify both 9-digit (individual) and 12-digit (corporate/branch) TIN formats, including hyphenated variations (e.g., XXX-XXX-XXX and XXX-XXX-XXX-XXX).
  • Implemented the official BIR Weighted Modulo 11 check digit algorithm within validate_result to eliminate false positives and ensure high-confidence matches.
  • Incorporated Philippine-specific context words (e.g., "TIN", "BIR", "revenue district office").
  • Added comprehensive unit tests in tests/test_ph_tin_recognizer.py verifying valid/invalid structures, checksums, and context enhancement.
  • Included the new recognizer in the context enhancer test dataset (tests/data/context_sentences_tests.txt and tests/test_context_support.py).
  • Registered PhTinRecognizer in default_recognizers.yaml, default_analyzer_full.yaml, slim.yaml, and test configurations, ensuring it is enabled: false by default per project guidelines for country-specific recognizers.
  • Updated docs/supported_entities.md and CHANGELOG.md to reflect the new feature.

Issue reference

Fixes #2015

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@aaronaco
Copy link
Copy Markdown
Author

aaronaco commented May 2, 2026

@microsoft-github-policy-service agree

@SharonHart SharonHart requested a review from Copilot May 4, 2026 09:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a new Philippines-specific PII recognizer (PH_TIN) to detect and validate Philippine Taxpayer Identification Numbers (TIN) using regex + context + checksum validation, and wires it into configuration, tests, and docs.

Changes:

  • Added PhTinRecognizer with regex patterns, PH-specific context terms, and weighted modulo-11 validation.
  • Integrated the recognizer into analyzer registries/configs and context-sentence datasets.
  • Added/updated unit tests and documentation (supported entities + changelog).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/philippines/ph_tin_recognizer.py Adds the PhTinRecognizer implementation (patterns, context, checksum validation).
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/philippines/init.py Exposes Philippines recognizers package exports.
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py Registers PhTinRecognizer in the predefined recognizers public API.
presidio-analyzer/tests/test_ph_tin_recognizer.py Adds unit tests for TIN detection + validation.
presidio-analyzer/tests/test_context_support.py Adds PH_TIN to the context enhancer test harness and updates dataset size check.
presidio-analyzer/tests/data/context_sentences_tests.txt Adds PH TIN context sentences for the context enhancer dataset.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Registers recognizer in default recognizers list (disabled by default).
presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml Registers recognizer in full analyzer config (disabled by default).
presidio-analyzer/presidio_analyzer/conf/slim.yaml Registers recognizer in slim analyzer config (disabled by default).
e2e-tests/resources/test_ollama_enabled_recognizers.yaml Registers recognizer in e2e test recognizers config (disabled by default).
docs/supported_entities.md Documents PH_TIN as a supported entity.
CHANGELOG.md Adds a changelog entry for the new PH_TIN recognizer.

The 9th digit is a check digit calculated using a weighted modulo 11 algorithm.
The last 3 digits (in the 12-digit version) represent the branch code (default 000).

Format: XXX-XXX-XXX-XXX or XXXXXXXXXXXX
Comment on lines +181 to +184
- name: PhTinRecognizer
supported_languages:
- en
type: predefined
Comment thread CHANGELOG.md
#### Added
- Canadian SIN (`CA_SIN`) recognizer for the Canadian Social Insurance Number, using regex pattern matching, context words (English and French), and Luhn checksum validation. Disabled by default.

- Philippines TIN (`PH_TIN`) recognizer for the Philippines Taxpayer Identification Number, using regex pattern matching, context words, and weighted modulo 11 checksum validation.
Comment on lines +25 to +35
("My TIN is 000-123-456-000", 1, [(10, 25)], [(0.1, 1.0)]),
("BIR TIN: 000123456", 1, [(9, 18)], [(0.1, 1.0)]),
("Tax ID: 000-123-456-001", 1, [(8, 23)], [(0.1, 1.0)]),
# Valid 9-digit with hyphens
("TIN 000-123-456", 1, [(4, 15)], [(0.1, 1.0)]),
# Invalid TINs (wrong checksum)
("Invalid TIN 000-123-457-000", 0, [], []),
("Not a TIN 123456789", 0, [], []),
# Context tests
("TIN: 000-123-456-000", 1, [(5, 20)], [(0.1, 1.0)]),
("Please use 000-123-456-000 as your ID", 1, [(11, 26)], [(0.1, 1.0)]),
if not pattern_text.isdigit():
return False

if len(pattern_text) not in [9, 12]:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add Philippines (PH) country-specific predefined recognizers

2 participants