Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ deidentify(text, method="shift_dates", date_shift_days=180)
<sub><b>Batch processing</b> — up to <b>3.3×</b> higher throughput on CPU and <b>2.2×</b> on MLX vs. one document at a time.</sub>
</div>

[Complete PII notebook](examples/notebooks/PII_Detection_Complete_Guide.ipynb) · [Smart merging](docs/pii-smart-merging.md) · [Anonymization](docs/anonymization.md)
[Complete PII notebook](examples/notebooks/PII_Detection_Complete_Guide.ipynb) · [Smart merging](docs/pii-smart-merging.md) · [Anonymization quickstart](docs/anonymization.md#quickstart-choosing-a-method)

<details>
<summary><b>Privacy Filter family</b> — three model families on the OpenAI Privacy Filter architecture</summary>
Expand Down
107 changes: 106 additions & 1 deletion docs/anonymization.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,112 @@ PII entities:
| `shift_dates` | Dates only — shifted by N days | You want to preserve relative time. |

This document focuses on `replace`, which was upgraded in v1.3.0 to a full
Faker-backed obfuscation engine.
Faker-backed obfuscation engine. If you just want to compare all five
methods side by side, start with the quickstart below.

## Quickstart: choosing a method

### `mask` — clear placeholders

```python
from openmed import deidentify

result = deidentify(
"Patient John Doe (DOB: 01/15/1970) called from 555-1234",
method="mask",
)
print(result.deidentified_text)
# Patient [first_name] [last_name] (DOB: [date]) called from [phone_number]
```

Placeholder names come from the model's own entity labels, so they vary by
model (the default `OpenMed-PII-SuperClinical-Small-44M-v1` model used here
splits names into `first_name`/`last_name` rather than a single `NAME`).

Not reversible by itself — pass `keep_mapping=True` and use `reidentify()`
(see below) if you need to restore the original text later.

### `remove` — delete PII entirely

```python
result = deidentify("Call 555-1234", method="remove")
print(repr(result.deidentified_text))
# 'Call '
```

Use this when you don't need positional alignment with the original text
(e.g. exporting de-identified text for search indexing).

### `replace` — realistic fake surrogates

```python
result = deidentify(
"Email: test@example.com",
method="replace",
consistent=True,
seed=42,
)
print(result.deidentified_text)
# Email: asnyder@example.com
```

Best for sharing data with downstream tools that expect well-formed values
(e.g. an email field that should still look like an email). See
[The new `replace` engine](#the-new-replace-engine) below for locale and
determinism options.

### `hash` — consistent, irreversible digests

```python
result = deidentify("Patient John Doe", method="hash")
print(result.deidentified_text)
# Patient first_name_a8cfcd74 last_name_fd53ef83
```

The same input always hashes to the same digest, so repeated mentions of
the same value link together across documents — without storing the
original anywhere.

### `shift_dates` — preserve intervals, hide absolute dates

```python
result = deidentify(
"DOB 01/15/2020",
method="shift_dates",
date_shift_days=30,
)
print(result.deidentified_text)
# DOB [date]
```

The intent is for every date in a document to shift by the same offset, so
durations between dates (e.g. "3 days after admission") stay correct. With
the default English model, however, dates currently get masked instead of
shifted — the model's raw label for dates is lowercase `date`, but the
redaction code only shifts entities labeled exactly `DATE`. Tracked in
#408.

### Reversing a de-identification: `reidentify()`

Pass `keep_mapping=True` to get back a `mapping` you can hand to
`reidentify()` later:

```python
from openmed import deidentify, reidentify

text = "Dr. Alice Smith met Bob Jones today"
result = deidentify(text, method="mask", keep_mapping=True)
print(result.deidentified_text)
# Dr. [first_name] [last_name] met [first_name_2] [last_name_2] today

restored = reidentify(result.deidentified_text, result.mapping)
assert restored == text
```

Repeated entities of the same type (two `first_name`s above) get a numbered
placeholder (`[first_name]`, `[first_name_2]`, ...) so each one maps back to
its own original value — this was a known limitation (#204) fixed by #222;
`reidentify()` now round-trips correctly even when a type repeats.

## The new `replace` engine

Expand Down
18 changes: 16 additions & 2 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,28 @@ Prefer a quick script entrypoint? Run a one-file smoke script:
uv run python examples/pii_model_comparison.py
```

## 3. Copy code snippets from the docs
## 3. De-identify PII

```python
from openmed import deidentify

result = deidentify("Patient John Doe, DOB 01/15/1970", method="mask")
print(result.deidentified_text)
# Patient [first_name] [last_name], DOB [date]
```

`deidentify()` supports five methods (`mask`, `remove`, `replace`, `hash`,
`shift_dates`) — see the [Anonymization quickstart](anonymization.md#quickstart-choosing-a-method)
for a runnable example of each, plus how to reverse one with `reidentify()`.

## 4. Copy code snippets from the docs

All code blocks ship with Material for MkDocs copy buttons. Invoking the command palette (`/` or `cmd/ctrl + K`) lets you
search for “GLiNER,” “OpenMedConfig,” or “token classification,” then copy the snippet that appears in the preview pane.
If you rely on AI copilots (ChatGPT, Copilot, etc.), point them at the published docs URL so they crawl the same
structured Markdown and surface canonical answers.

## 4. Optional: pin configuration
## 5. Optional: pin configuration

```python
from openmed.core import OpenMedConfig, ModelLoader
Expand Down