add extract_pii/deidentify to BatchProcessor via an operation parameter by thirdwing · Pull Request #61 · maziyarpanahi/openmed

thirdwing · 2026-05-28T07:48:51Z

Summary

This PR adds real multi-document batch support for PII extraction and de-identification in BatchProcessor, addressing #33 for larger record workloads.

It keeps the useful operation= API direction from the original draft and expands it beyond a per-item wrapper:

from openmed import BatchProcessor

pii = BatchProcessor(
    model_name="openai/privacy-filter",
    operation="extract_pii",
    batch_size=16,
)
result = pii.process_texts(texts)

redactor = BatchProcessor(
    model_name="openai/privacy-filter",
    operation="deidentify",
    method="mask",
    batch_size=16,
)
redacted = redactor.process_texts(texts)

What Changed

Added validated BatchProcessor(operation=...) values:
- "analyze_text" for existing clinical NER behavior
- "extract_pii" for batch PII extraction
- "deidentify" for batch redaction/anonymization
Added explicit document-level batch_size chunking for PII operations.
Reused one ModelLoader or privacy-filter pipeline across a batch job instead of rebuilding work per item/chunk.
Added private PII batch helpers that preserve the existing single-call behavior:
- language/default model resolution
- accent normalization
- smart merging
- privacy-filter dispatch
- de-identification methods
- deterministic replacement/date-shift options
Preserved existing process_texts, process_files, process_directory, iter_process, and module-level process_batch(...) surfaces.
Kept continue_on_error=True fallback behavior and continue_on_error=False immediate raise behavior.
Updated result serialization so a BatchItemResult can carry either PredictionResult or DeidentificationResult.
Updated README, docs, changelog, and examples, including examples/pii_batch_processing.py.

Performance

Hot-run benchmark on 64 short PII notes, median of 3 runs after one warmup call. Model load is excluded; the single baseline uses BatchProcessor(..., batch_size=1) and the batch path uses batch_size=16.

Operation	Backend/model	Single	Batch 16	Speedup
`extract_pii`	CPU/Torch `openai/privacy-filter`	8.2088s / 7.80 docs/s	2.5115s / 25.48 docs/s	3.27x
`extract_pii`	MLX `OpenMed/privacy-filter-mlx-8bit`	0.6694s / 95.61 docs/s	0.2889s / 221.54 docs/s	2.32x
`deidentify`	CPU/Torch `openai/privacy-filter`	6.4278s / 9.96 docs/s	1.9006s / 33.67 docs/s	3.38x
`deidentify`	MLX `OpenMed/privacy-filter-mlx-8bit`	0.5966s / 107.28 docs/s	0.2754s / 232.40 docs/s	2.17x

Backend-specific performance work in this branch:

CPU/Torch forwards batch_size to the Transformers privacy-filter pipeline.
MLX privacy-filter list input now pads non-empty records and runs one model forward per chunk instead of looping one document at a time.
BatchProcessor caches the privacy-filter pipeline across chunks for the batch job.

Tests And Docs

Validation run locally:

uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py tests/unit/mlx/test_privacy_filter_mlx.py -q
uv run pytest tests/unit -q
uv run mkdocs build --strict

Results:

Focused batch/PII/privacy-filter/MLX suite: 173 passed, 1 skipped
Full unit suite: 1194 passed, 1 skipped
Strict docs build: passed

Related Issues

Closes #33

…ameter

maziyarpanahi · 2026-05-31T16:04:58Z

Thanks for opening this. I kept the operation= API direction from the draft and pushed a follow-up commit on this branch that fills out the production path for issue #33.

What changed:

Added validated operation values for analyze_text, extract_pii, and deidentify, plus document-level batch_size chunking.
Added private batch helpers for PII extraction and de-identification so BatchProcessor can reuse one loader or privacy-filter pipeline per batch instead of rebuilding per record.
Preserved existing single-call extract_pii/deidentify behavior, including smart merging, language defaults, accent normalization, privacy-filter routing, deterministic replacement, date shifting, and error handling.
Added unit coverage for operation dispatch, chunking, kwargs forwarding, file/directory flows, fallback behavior, single-vs-batch parity, and privacy-filter pipeline reuse.
Updated README/docs/changelog and added examples/pii_batch_processing.py.

Validation run locally:

uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py -q
uv run mkdocs build --strict

maziyarpanahi · 2026-05-31T19:17:19Z

Follow-up benchmark after tightening the batch path in e85889e.

Setup: macOS arm64, 64 short clinical/PII notes, median of 3 hot runs after one warmup call. The single baseline is BatchProcessor(..., batch_size=1), so model load is not included and the comparison isolates single-document chunks vs multi-document chunks with the same cached pipeline.

Operation	Backend/model	Mode	Batch size	Median time	ms/doc	Docs/sec	Speedup
`extract_pii`	CPU/Torch `openai/privacy-filter`	single-doc chunks	1	8.2088s	128.26	7.80	1.00x
`extract_pii`	CPU/Torch `openai/privacy-filter`	batch chunks	16	2.5115s	39.24	25.48	3.27x
`extract_pii`	MLX `OpenMed/privacy-filter-mlx-8bit`	single-doc chunks	1	0.6694s	10.46	95.61	1.00x
`extract_pii`	MLX `OpenMed/privacy-filter-mlx-8bit`	batch chunks	16	0.2889s	4.51	221.54	2.32x
`deidentify`	CPU/Torch `openai/privacy-filter`	single-doc chunks	1	6.4278s	100.43	9.96	1.00x
`deidentify`	CPU/Torch `openai/privacy-filter`	batch chunks	16	1.9006s	29.70	33.67	3.38x
`deidentify`	MLX `OpenMed/privacy-filter-mlx-8bit`	single-doc chunks	1	0.5966s	9.32	107.28	1.00x
`deidentify`	MLX `OpenMed/privacy-filter-mlx-8bit`	batch chunks	16	0.2754s	4.30	232.40	2.17x

CPU/Torch batch now forwards batch_size into the underlying Transformers pipeline. MLX list input now pads non-empty records and runs one model forward for the chunk instead of looping one document at a time. The batch processor also caches the privacy-filter pipeline across chunks, so large jobs do not rebuild the model per chunk.

maziyarpanahi · 2026-05-31T19:23:34Z

Added a visual summary for the hot-run benchmark numbers:

The source SVG and PNG are now on this branch under docs/assets/pii-batch-benchmark.*.

add extract_pii and deidentify to BatchProcessor via an operation par…

756e6c7

…ameter

thirdwing marked this pull request as draft May 28, 2026 07:49

thirdwing mentioned this pull request May 28, 2026

[FEATURE] Batch support for extract PII #33

Closed

Add batch PII processing support

2faed59

Improve privacy-filter batch execution

e85889e

Add PII batch benchmark visual

b6a3475

maziyarpanahi self-assigned this May 31, 2026

maziyarpanahi marked this pull request as ready for review May 31, 2026 19:27

maziyarpanahi self-requested a review May 31, 2026 19:34

Align batch PII confidence defaults

c2a0791

maziyarpanahi approved these changes Jun 4, 2026

View reviewed changes

maziyarpanahi merged commit 466cdc6 into maziyarpanahi:master Jun 7, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add extract_pii/deidentify to BatchProcessor via an operation parameter#61

add extract_pii/deidentify to BatchProcessor via an operation parameter#61
maziyarpanahi merged 5 commits into
maziyarpanahi:masterfrom
thirdwing:issues/33

thirdwing commented May 28, 2026 •

edited by maziyarpanahi

Loading

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thirdwing commented May 28, 2026 • edited by maziyarpanahi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Performance

Tests And Docs

Related Issues

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

maziyarpanahi commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thirdwing commented May 28, 2026 •

edited by maziyarpanahi

Loading