Skip to content

add extract_pii/deidentify to BatchProcessor via an operation parameter#61

Merged
maziyarpanahi merged 5 commits into
maziyarpanahi:masterfrom
thirdwing:issues/33
Jun 7, 2026
Merged

add extract_pii/deidentify to BatchProcessor via an operation parameter#61
maziyarpanahi merged 5 commits into
maziyarpanahi:masterfrom
thirdwing:issues/33

Conversation

@thirdwing

@thirdwing thirdwing commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds real multi-document batch support for PII extraction and de-identification in BatchProcessor, addressing #33 for larger record workloads.

It keeps the useful operation= API direction from the original draft and expands it beyond a per-item wrapper:

from openmed import BatchProcessor

pii = BatchProcessor(
    model_name="openai/privacy-filter",
    operation="extract_pii",
    batch_size=16,
)
result = pii.process_texts(texts)

redactor = BatchProcessor(
    model_name="openai/privacy-filter",
    operation="deidentify",
    method="mask",
    batch_size=16,
)
redacted = redactor.process_texts(texts)

What Changed

  • Added validated BatchProcessor(operation=...) values:
    • "analyze_text" for existing clinical NER behavior
    • "extract_pii" for batch PII extraction
    • "deidentify" for batch redaction/anonymization
  • Added explicit document-level batch_size chunking for PII operations.
  • Reused one ModelLoader or privacy-filter pipeline across a batch job instead of rebuilding work per item/chunk.
  • Added private PII batch helpers that preserve the existing single-call behavior:
    • language/default model resolution
    • accent normalization
    • smart merging
    • privacy-filter dispatch
    • de-identification methods
    • deterministic replacement/date-shift options
  • Preserved existing process_texts, process_files, process_directory, iter_process, and module-level process_batch(...) surfaces.
  • Kept continue_on_error=True fallback behavior and continue_on_error=False immediate raise behavior.
  • Updated result serialization so a BatchItemResult can carry either PredictionResult or DeidentificationResult.
  • Updated README, docs, changelog, and examples, including examples/pii_batch_processing.py.

Performance

Hot-run benchmark on 64 short PII notes, median of 3 runs after one warmup call. Model load is excluded; the single baseline uses BatchProcessor(..., batch_size=1) and the batch path uses batch_size=16.

Batch PII processing benchmark

Operation Backend/model Single Batch 16 Speedup
extract_pii CPU/Torch openai/privacy-filter 8.2088s / 7.80 docs/s 2.5115s / 25.48 docs/s 3.27x
extract_pii MLX OpenMed/privacy-filter-mlx-8bit 0.6694s / 95.61 docs/s 0.2889s / 221.54 docs/s 2.32x
deidentify CPU/Torch openai/privacy-filter 6.4278s / 9.96 docs/s 1.9006s / 33.67 docs/s 3.38x
deidentify MLX OpenMed/privacy-filter-mlx-8bit 0.5966s / 107.28 docs/s 0.2754s / 232.40 docs/s 2.17x

Backend-specific performance work in this branch:

  • CPU/Torch forwards batch_size to the Transformers privacy-filter pipeline.
  • MLX privacy-filter list input now pads non-empty records and runs one model forward per chunk instead of looping one document at a time.
  • BatchProcessor caches the privacy-filter pipeline across chunks for the batch job.

Tests And Docs

Validation run locally:

uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py tests/unit/mlx/test_privacy_filter_mlx.py -q
uv run pytest tests/unit -q
uv run mkdocs build --strict

Results:

  • Focused batch/PII/privacy-filter/MLX suite: 173 passed, 1 skipped
  • Full unit suite: 1194 passed, 1 skipped
  • Strict docs build: passed

Related Issues

Closes #33

@thirdwing thirdwing marked this pull request as draft May 28, 2026 07:49
@maziyarpanahi

Copy link
Copy Markdown
Owner

Thanks for opening this. I kept the operation= API direction from the draft and pushed a follow-up commit on this branch that fills out the production path for issue #33.

What changed:

  • Added validated operation values for analyze_text, extract_pii, and deidentify, plus document-level batch_size chunking.
  • Added private batch helpers for PII extraction and de-identification so BatchProcessor can reuse one loader or privacy-filter pipeline per batch instead of rebuilding per record.
  • Preserved existing single-call extract_pii/deidentify behavior, including smart merging, language defaults, accent normalization, privacy-filter routing, deterministic replacement, date shifting, and error handling.
  • Added unit coverage for operation dispatch, chunking, kwargs forwarding, file/directory flows, fallback behavior, single-vs-batch parity, and privacy-filter pipeline reuse.
  • Updated README/docs/changelog and added examples/pii_batch_processing.py.

Validation run locally:

  • uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py -q
  • uv run mkdocs build --strict

@maziyarpanahi

Copy link
Copy Markdown
Owner

Follow-up benchmark after tightening the batch path in e85889e.

Setup: macOS arm64, 64 short clinical/PII notes, median of 3 hot runs after one warmup call. The single baseline is BatchProcessor(..., batch_size=1), so model load is not included and the comparison isolates single-document chunks vs multi-document chunks with the same cached pipeline.

Operation Backend/model Mode Batch size Median time ms/doc Docs/sec Speedup
extract_pii CPU/Torch openai/privacy-filter single-doc chunks 1 8.2088s 128.26 7.80 1.00x
extract_pii CPU/Torch openai/privacy-filter batch chunks 16 2.5115s 39.24 25.48 3.27x
extract_pii MLX OpenMed/privacy-filter-mlx-8bit single-doc chunks 1 0.6694s 10.46 95.61 1.00x
extract_pii MLX OpenMed/privacy-filter-mlx-8bit batch chunks 16 0.2889s 4.51 221.54 2.32x
deidentify CPU/Torch openai/privacy-filter single-doc chunks 1 6.4278s 100.43 9.96 1.00x
deidentify CPU/Torch openai/privacy-filter batch chunks 16 1.9006s 29.70 33.67 3.38x
deidentify MLX OpenMed/privacy-filter-mlx-8bit single-doc chunks 1 0.5966s 9.32 107.28 1.00x
deidentify MLX OpenMed/privacy-filter-mlx-8bit batch chunks 16 0.2754s 4.30 232.40 2.17x

CPU/Torch batch now forwards batch_size into the underlying Transformers pipeline. MLX list input now pads non-empty records and runs one model forward for the chunk instead of looping one document at a time. The batch processor also caches the privacy-filter pipeline across chunks, so large jobs do not rebuild the model per chunk.

@maziyarpanahi

Copy link
Copy Markdown
Owner

Added a visual summary for the hot-run benchmark numbers:

Batch PII benchmark visual

The source SVG and PNG are now on this branch under docs/assets/pii-batch-benchmark.*.

@maziyarpanahi maziyarpanahi self-assigned this May 31, 2026
@maziyarpanahi maziyarpanahi marked this pull request as ready for review May 31, 2026 19:27
@maziyarpanahi maziyarpanahi self-requested a review May 31, 2026 19:34
@maziyarpanahi maziyarpanahi merged commit 466cdc6 into maziyarpanahi:master Jun 7, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Batch support for extract PII

2 participants