Proposal for the detection module architecture based on the project brief.
Suggested three-tier approach:
- Regex/pattern matching for structured PII (emails, IPs, phone numbers via phonenumbers library)
- spaCy NER for names, organizations, locations
- HuggingFace transformer fallback for context-dependent cases
The phonenumbers library by Google handles 200+ regional formats
which solves false positives from raw regex on phone numbers.
Detection tier selection could be configurable per policy:
real-time middleware prioritizes speed (tier 1+2),
batch analysis can use higher recall (all three tiers).
Happy to prototype this if mentors want a PoC.
Proposal for the detection module architecture based on the project brief.
Suggested three-tier approach:
The phonenumbers library by Google handles 200+ regional formats
which solves false positives from raw regex on phone numbers.
Detection tier selection could be configurable per policy:
real-time middleware prioritizes speed (tier 1+2),
batch analysis can use higher recall (all three tiers).
Happy to prototype this if mentors want a PoC.