docs: add per-method de-identification quickstart#409
Open
abdouloued wants to merge 3 commits into
Open
Conversation
Add runnable mask/remove/replace/hash/shift_dates snippets plus a reidentify() round-trip example to anonymization.md, and link to it from getting-started.md. Verified each snippet's output against the existing test suite rather than guessing. Addresses maziyarpanahi#282
…utput The previous commit's example outputs were written from reading the redaction logic rather than running it against the default model, and didn't match reality: the default model emits lowercase, split labels (first_name/last_name/date/phone_number) rather than NAME/DATE/PHONE. Also discovered shift_dates silently masks instead of shifting dates for the default model (entity_type stays lowercase "date", but the redaction code only matches uppercase "DATE") - documented as a known limitation linking to maziyarpanahi#408 rather than showing a fabricated shifted date.
Owner
|
Thank you @abdouloued. I reviewed this against #282 and the issue discussion, then added a maintainer follow-up commit: What changed:
Verification:
I copied #282 labels onto the PR. The branch is mergeable with no conflicts. GitHub has not reported hosted checks for this fork head, so the validation above is local. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #282.
What
Adds a "Quickstart: choosing a method" section to
docs/anonymization.mdwith a runnable example for eachdeidentify()method (mask,remove,replace,hash,shift_dates), plus areidentify()round-trip example. Links to it fromdocs/getting-started.md.Every example output was verified by actually running it against the real default model (
OpenMed-PII-SuperClinical-Small-44M-v1), not just read from source or checked against mocked unit tests.Note on #204
The issue asked for a callout documenting the reidentify round-trip limitation tied to #204. That limitation looks already fixed (
test_roundtrip_two_persons_mask, commiteb10454, "Closes #222 (addresses #204)"). I documented it as resolved with a forward link to #204/#222 instead of as a current limitation — already discussed with @maziyarpanahi on the issue.Bonus find
While verifying the
shift_datesexample against the real model, I found it silently falls back to masking instead of shifting dates for the default English model (entity label stays lowercasedate, but the redaction code only matches uppercaseDATE). Filed separately as #408 and linked from the doc rather than showing a fabricated shifted-date output.Verification
mkdocs build --strictpasses.python -crun against the default model, not hand-written.git stash).