Skip to content

refactor: audit and restructure DEFAULT_ENTITY_LABELS — separate content from structural entities #112

@lipikaramaswamy

Description

@lipikaramaswamy

Summary

DEFAULT_ENTITY_LABELS in src/anonymizer/engine/constants.py currently mixes two fundamentally different kinds of entities: content entities (values that directly identify a person) and structural entities (format types, categories, or metadata that don't inherently carry identifying information). This blurs what the pipeline is actually trying to protect and can cause over-detection, noise in the sensitivity disposition, and poor GLiNER performance on genuinely sensitive content.

Current list

The list currently includes, among others:

Category Labels
Personal identifiers first_name, last_name, date_of_birth, ssn, national_id, age
Contact email, phone_number, fax_number, street_address, postce, city, state, country
Financial credit_debit_card, cvv, account_number, bank_routing_number, tax_id, monetary_amount
Professional occupation, employee_id, company_name, university, degree, field_of_study
Medical medical_record_number, health_plan_beneficiary_number, blood_type, biometric_identifier
Digital api_key, password, device_identifier, ipv4, ipv6, mac_address, http_cookie, url
Credentials certificate_license_number, license_plate, vehicle_identifier, unique_id, customer_id, user_name, pin
Demographic gender, sexuality, race_ethnicity, religious_belief, political_view, language, nationality, employment_status, education_level
Location coordinate, landmark, place_name, court_name, prison_detention_facility, organization_name
Temporal date, date_time, time

The problem

Several of these are structural — they describe a format or category rather than an identifying value:

  • monetary_amount — a number format, not an identity marker
  • date, date_time, time — extremely generic; not PII unless combined with other context
  • url — usually a website reference, not personal
  • coordinate, landmark, place_name — descriptive/geographic, not inherently identifying
  • language, nationality — demographic metadata, rarely a direct identifier
  • court_name, prison_detention_facility — institutional names, not personal content

Including these inflates the entity list passed to GLiNER and the augmenter, which contributes to false positives (e.g. known age FP issues), dilutes the sensitivity disposition signal, and makes the replacement map noisy.

Proposed approach

  • Audit the full list and classify each label as content (directly identifies a person) vs. structural (format, categor)
  • Move structural labels out of DEFAULT_ENTITY_LABELS — either drop them, make them opt-in, or move to a separate STRUCTURAL_ENTITY_LABELS list for use only in specific modes
  • Document the rationale for each label's inclusion in the default list
  • Update GLiNER calls, augmenter prompt, and fix-GLiNER prompt to reflect the refined list
  • Re-evaluate age FP rate and monetary_amount noise after the change

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions