Skip to content

docs: add per-method de-identification quickstart#409

Open
abdouloued wants to merge 3 commits into
maziyarpanahi:masterfrom
abdouloued:fix/pii-method-quickstart-docs
Open

docs: add per-method de-identification quickstart#409
abdouloued wants to merge 3 commits into
maziyarpanahi:masterfrom
abdouloued:fix/pii-method-quickstart-docs

Conversation

@abdouloued

Copy link
Copy Markdown

Closes #282.

What

Adds a "Quickstart: choosing a method" section to docs/anonymization.md with a runnable example for each deidentify() method (mask, remove, replace, hash, shift_dates), plus a reidentify() round-trip example. Links to it from docs/getting-started.md.

Every example output was verified by actually running it against the real default model (OpenMed-PII-SuperClinical-Small-44M-v1), not just read from source or checked against mocked unit tests.

Note on #204

The issue asked for a callout documenting the reidentify round-trip limitation tied to #204. That limitation looks already fixed (test_roundtrip_two_persons_mask, commit eb10454, "Closes #222 (addresses #204)"). I documented it as resolved with a forward link to #204/#222 instead of as a current limitation — already discussed with @maziyarpanahi on the issue.

Bonus find

While verifying the shift_dates example against the real model, I found it silently falls back to masking instead of shifting dates for the default English model (entity label stays lowercase date, but the redaction code only matches uppercase DATE). Filed separately as #408 and linked from the doc rather than showing a fabricated shifted-date output.

Verification

  • mkdocs build --strict passes.
  • Every printed snippet output in this PR is copy-pasted from an actual python -c run against the default model, not hand-written.
  • Full test suite green aside from pre-existing, unrelated failures (confirmed identical via git stash).

Add runnable mask/remove/replace/hash/shift_dates snippets plus a
reidentify() round-trip example to anonymization.md, and link to it
from getting-started.md. Verified each snippet's output against the
existing test suite rather than guessing.

Addresses maziyarpanahi#282
…utput

The previous commit's example outputs were written from reading the
redaction logic rather than running it against the default model, and
didn't match reality: the default model emits lowercase, split labels
(first_name/last_name/date/phone_number) rather than NAME/DATE/PHONE.

Also discovered shift_dates silently masks instead of shifting dates
for the default model (entity_type stays lowercase "date", but the
redaction code only matches uppercase "DATE") - documented as a known
limitation linking to maziyarpanahi#408 rather than showing a fabricated shifted
date.
@maziyarpanahi maziyarpanahi self-requested a review June 20, 2026 09:55
@maziyarpanahi maziyarpanahi added help wanted Extra attention is needed good first issue Good for newcomers roadmap-v2 OpenMed V2 roadmap backlog improvement Hardening / refactor of existing code P2 Medium labels Jun 22, 2026
@maziyarpanahi

Copy link
Copy Markdown
Owner

Thank you @abdouloued. I reviewed this against #282 and the issue discussion, then added a maintainer follow-up commit: docs: tighten deidentification quickstart examples.

What changed:

  • linked the README privacy section directly to the new anonymization quickstart, satisfying the remaining issue scope;
  • updated the remove example to show repr() so the documented trailing space is explicit;
  • replaced the placeholder replace output with the deterministic seeded output from the runtime check: asnyder@example.com.

Verification:

  • /Users/maziyar/Developer/openmed/.venv/bin/python snippet check for mask, remove, replace, hash, shift_dates, and reidentify() -> matched the documented outputs
  • uv run mkdocs build --strict --site-dir /tmp/openmed-pr409-site -> passed
  • /Users/maziyar/Developer/openmed/.venv/bin/pre-commit run --files docs/anonymization.md docs/getting-started.md README.md -> passed

I copied #282 labels onto the PR. The branch is mergeable with no conflicts. GitHub has not reported hosted checks for this fork head, so the validation above is local.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good first issue Good for newcomers help wanted Extra attention is needed improvement Hardening / refactor of existing code P2 Medium roadmap-v2 OpenMed V2 roadmap backlog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a per-method de-identification quickstart doc (mask, remove, replace, hash, shift_dates) Fix reidentify() round-trip when an entity type repeats

2 participants