Summary
DEFAULT_ENTITY_LABELS in src/anonymizer/engine/constants.py currently mixes two fundamentally different kinds of entities: content entities (values that directly identify a person) and structural entities (format types, categories, or metadata that don't inherently carry identifying information). This blurs what the pipeline is actually trying to protect and can cause over-detection, noise in the sensitivity disposition, and poor GLiNER performance on genuinely sensitive content.
Current list
The list currently includes, among others:
| Category |
Labels |
| Personal identifiers |
first_name, last_name, date_of_birth, ssn, national_id, age |
| Contact |
email, phone_number, fax_number, street_address, postce, city, state, country |
| Financial |
credit_debit_card, cvv, account_number, bank_routing_number, tax_id, monetary_amount |
| Professional |
occupation, employee_id, company_name, university, degree, field_of_study |
| Medical |
medical_record_number, health_plan_beneficiary_number, blood_type, biometric_identifier |
| Digital |
api_key, password, device_identifier, ipv4, ipv6, mac_address, http_cookie, url |
| Credentials |
certificate_license_number, license_plate, vehicle_identifier, unique_id, customer_id, user_name, pin |
| Demographic |
gender, sexuality, race_ethnicity, religious_belief, political_view, language, nationality, employment_status, education_level |
| Location |
coordinate, landmark, place_name, court_name, prison_detention_facility, organization_name |
| Temporal |
date, date_time, time |
The problem
Several of these are structural — they describe a format or category rather than an identifying value:
monetary_amount — a number format, not an identity marker
date, date_time, time — extremely generic; not PII unless combined with other context
url — usually a website reference, not personal
coordinate, landmark, place_name — descriptive/geographic, not inherently identifying
language, nationality — demographic metadata, rarely a direct identifier
court_name, prison_detention_facility — institutional names, not personal content
Including these inflates the entity list passed to GLiNER and the augmenter, which contributes to false positives (e.g. known age FP issues), dilutes the sensitivity disposition signal, and makes the replacement map noisy.
Proposed approach
Related
Summary
DEFAULT_ENTITY_LABELSinsrc/anonymizer/engine/constants.pycurrently mixes two fundamentally different kinds of entities: content entities (values that directly identify a person) and structural entities (format types, categories, or metadata that don't inherently carry identifying information). This blurs what the pipeline is actually trying to protect and can cause over-detection, noise in the sensitivity disposition, and poor GLiNER performance on genuinely sensitive content.Current list
The list currently includes, among others:
first_name,last_name,date_of_birth,ssn,national_id,ageemail,phone_number,fax_number,street_address,postce,city,state,countrycredit_debit_card,cvv,account_number,bank_routing_number,tax_id,monetary_amountoccupation,employee_id,company_name,university,degree,field_of_studymedical_record_number,health_plan_beneficiary_number,blood_type,biometric_identifierapi_key,password,device_identifier,ipv4,ipv6,mac_address,http_cookie,urlcertificate_license_number,license_plate,vehicle_identifier,unique_id,customer_id,user_name,pingender,sexuality,race_ethnicity,religious_belief,political_view,language,nationality,employment_status,education_levelcoordinate,landmark,place_name,court_name,prison_detention_facility,organization_namedate,date_time,timeThe problem
Several of these are structural — they describe a format or category rather than an identifying value:
monetary_amount— a number format, not an identity markerdate,date_time,time— extremely generic; not PII unless combined with other contexturl— usually a website reference, not personalcoordinate,landmark,place_name— descriptive/geographic, not inherently identifyinglanguage,nationality— demographic metadata, rarely a direct identifiercourt_name,prison_detention_facility— institutional names, not personal contentIncluding these inflates the entity list passed to GLiNER and the augmenter, which contributes to false positives (e.g. known age FP issues), dilutes the sensitivity disposition signal, and makes the replacement map noisy.
Proposed approach
DEFAULT_ENTITY_LABELS— either drop them, make them opt-in, or move to a separateSTRUCTURAL_ENTITY_LABELSlist for use only in specific modesRelated