refactor: audit and restructure DEFAULT_ENTITY_LABELS — separate content from structural entities

## Summary

`DEFAULT_ENTITY_LABELS` in `src/anonymizer/engine/constants.py` currently mixes two fundamentally different kinds of entities: **content entities** (values that directly identify a person) and **structural entities** (format types, categories, or metadata that don't inherently carry identifying information). This blurs what the pipeline is actually trying to protect and can cause over-detection, noise in the sensitivity disposition, and poor GLiNER performance on genuinely sensitive content.

## Current list

The list currently includes, among others:

| Category | Labels |
|----------|--------|
| Personal identifiers | `first_name`, `last_name`, `date_of_birth`, `ssn`, `national_id`, `age` |
| Contact | `email`, `phone_number`, `fax_number`, `street_address`, `postce`, `city`, `state`, `country` |
| Financial | `credit_debit_card`, `cvv`, `account_number`, `bank_routing_number`, `tax_id`, `monetary_amount` |
| Professional | `occupation`, `employee_id`, `company_name`, `university`, `degree`, `field_of_study` |
| Medical | `medical_record_number`, `health_plan_beneficiary_number`, `blood_type`, `biometric_identifier` |
| Digital | `api_key`, `password`, `device_identifier`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `url` |
| Credentials | `certificate_license_number`, `license_plate`, `vehicle_identifier`, `unique_id`, `customer_id`, `user_name`, `pin` |
| Demographic | `gender`, `sexuality`, `race_ethnicity`, `religious_belief`, `political_view`, `language`, `nationality`, `employment_status`, `education_level` |
| Location | `coordinate`, `landmark`, `place_name`, `court_name`, `prison_detention_facility`, `organization_name` |
| Temporal | `date`, `date_time`, `time` |

## The problem

Several of these are **structural** — they describe a format or category rather than an identifying value:

- `monetary_amount` — a number format, not an identity marker
- `date`, `date_time`, `time` — extremely generic; not PII unless combined with other context
- `url` — usually a website reference, not personal
- `coordinate`, `landmark`, `place_name` — descriptive/geographic, not inherently identifying
- `language`, `nationality` — demographic metadata, rarely a direct identifier
- `court_name`, `prison_detention_facility` — institutional names, not personal content

Including these inflates the entity list passed to GLiNER and the augmenter, which contributes to false positives (e.g. known age FP issues), dilutes the sensitivity disposition signal, and makes the replacement map noisy.

## Proposed approach

- [ ] Audit the full list and classify each label as **content** (directly identifies a person) vs. **structural** (format, categor)
- [ ] Move structural labels out of `DEFAULT_ENTITY_LABELS` — either drop them, make them opt-in, or move to a separate `STRUCTURAL_ENTITY_LABELS` list for use only in specific modes
- [ ] Document the rationale for each label's inclusion in the default list
- [ ] Update GLiNER calls, augmenter prompt, and fix-GLiNER prompt to reflect the refined list
- [ ] Re-evaluate age FP rate and monetary_amount noise after the change

## Related
- Age FP issues addressed in PR #50 — root cause may partly be structural labels polluting the GLiNER input
- #46 (domain-aware hyphen handling) is downstream of this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: audit and restructure DEFAULT_ENTITY_LABELS — separate content from structural entities #112

Summary

Current list

The problem

Proposed approach

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Labels
Personal identifiers	`first_name`, `last_name`, `date_of_birth`, `ssn`, `national_id`, `age`
Contact	`email`, `phone_number`, `fax_number`, `street_address`, `postce`, `city`, `state`, `country`
Financial	`credit_debit_card`, `cvv`, `account_number`, `bank_routing_number`, `tax_id`, `monetary_amount`
Professional	`occupation`, `employee_id`, `company_name`, `university`, `degree`, `field_of_study`
Medical	`medical_record_number`, `health_plan_beneficiary_number`, `blood_type`, `biometric_identifier`
Digital	`api_key`, `password`, `device_identifier`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `url`
Credentials	`certificate_license_number`, `license_plate`, `vehicle_identifier`, `unique_id`, `customer_id`, `user_name`, `pin`
Demographic	`gender`, `sexuality`, `race_ethnicity`, `religious_belief`, `political_view`, `language`, `nationality`, `employment_status`, `education_level`
Location	`coordinate`, `landmark`, `place_name`, `court_name`, `prison_detention_facility`, `organization_name`
Temporal	`date`, `date_time`, `time`

refactor: audit and restructure DEFAULT_ENTITY_LABELS — separate content from structural entities #112

Description

Summary

Current list

The problem

Proposed approach

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions