False Positives: Standard UUIDs misclassified as `secret` or `account_number`

### Description
The Privacy Filter model systematically misclassifies standard UUIDs (v4 and others) as sensitive information (`secret` or `account_number`) depending on the surrounding context. 

While the documentation notes that the model can over-redact high-entropy strings, UUIDs are universally standard, non-sensitive identifiers in logs, database records, and API responses. The current behavior causes a high false-positive rate for `secret` and `account_number` redactions when processing standard application logs or JSON traces.

### Reproduction / Evidence
We ran a smoke test using `openai/privacy-filter` via `transformers` `pipeline("token-classification")` to observe how different contexts affect the model's classification of UUIDs.

**1. Bare Sentence Context:**
Text: `"User 123e4567-e89b-12d3-a456-426614174000 just logged in from 192.168.1.1."`
**Result:** The UUID is flagged as a **`secret`** (scores ranging from ~0.60 to 0.99 for its sub-tokens).

**2. JSON / Structured Log Context:**
Text: `{"log_level": "DEBUG", "trace_id": "9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d", "msg": "connection established"}`
**Result:** The entire UUID is highly confidently flagged as an **`account_number`** (scores > 0.99).

**3. Mixed Secret / UUID Context:**
Text: `"The api_key is sk-1234567890abcdef and the user_uuid is d74251da-5847-4e67-9c6f-78d10b898bd0."`
**Result:** Both the API key and the UUID are flagged as a **`secret`** with an absolute **1.0000** confidence score. The model fails to differentiate between the actual secret token and the benign UUID identifier.

**4. Standard Log Line (Exception):**
Text: `"[INFO] request_id=550e8400-e29b-41d4-a716-446655440000 status=200 action=payment_processed"`
**Result:** In this specific format, the model correctly ignores the UUID.

### Impact
When applying this filter in a production environment to strip PII or secrets from application logs, the false positive rate on UUIDs makes it difficult to use without heavy post-processing or maintaining allowlists for common UUID patterns.

### Suggestion
It would be highly beneficial if future versions of the model (or the tokenizer/pipeline logic) could recognize the standard 8-4-4-4-12 hex format of UUIDs and lower the confidence threshold for them being `secret` or `account_number`, unless explicitly prefixed by a highly sensitive key.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

False Positives: Standard UUIDs misclassified as `secret` or `account_number` #34

Description

Reproduction / Evidence

Impact

Suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

False Positives: Standard UUIDs misclassified as secret or account_number #34

Description

Description

Reproduction / Evidence

Impact

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

False Positives: Standard UUIDs misclassified as `secret` or `account_number` #34