You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added explicit document-level batch_size chunking for PII operations.
Reused one ModelLoader or privacy-filter pipeline across a batch job instead of rebuilding work per item/chunk.
Added private PII batch helpers that preserve the existing single-call behavior:
language/default model resolution
accent normalization
smart merging
privacy-filter dispatch
de-identification methods
deterministic replacement/date-shift options
Preserved existing process_texts, process_files, process_directory, iter_process, and module-level process_batch(...) surfaces.
Kept continue_on_error=True fallback behavior and continue_on_error=False immediate raise behavior.
Updated result serialization so a BatchItemResult can carry either PredictionResult or DeidentificationResult.
Updated README, docs, changelog, and examples, including examples/pii_batch_processing.py.
Performance
Hot-run benchmark on 64 short PII notes, median of 3 runs after one warmup call. Model load is excluded; the single baseline uses BatchProcessor(..., batch_size=1) and the batch path uses batch_size=16.
Operation
Backend/model
Single
Batch 16
Speedup
extract_pii
CPU/Torch openai/privacy-filter
8.2088s / 7.80 docs/s
2.5115s / 25.48 docs/s
3.27x
extract_pii
MLX OpenMed/privacy-filter-mlx-8bit
0.6694s / 95.61 docs/s
0.2889s / 221.54 docs/s
2.32x
deidentify
CPU/Torch openai/privacy-filter
6.4278s / 9.96 docs/s
1.9006s / 33.67 docs/s
3.38x
deidentify
MLX OpenMed/privacy-filter-mlx-8bit
0.5966s / 107.28 docs/s
0.2754s / 232.40 docs/s
2.17x
Backend-specific performance work in this branch:
CPU/Torch forwards batch_size to the Transformers privacy-filter pipeline.
MLX privacy-filter list input now pads non-empty records and runs one model forward per chunk instead of looping one document at a time.
BatchProcessor caches the privacy-filter pipeline across chunks for the batch job.
Tests And Docs
Validation run locally:
uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py tests/unit/mlx/test_privacy_filter_mlx.py -q
uv run pytest tests/unit -q
uv run mkdocs build --strict
Thanks for opening this. I kept the operation= API direction from the draft and pushed a follow-up commit on this branch that fills out the production path for issue #33.
What changed:
Added validated operation values for analyze_text, extract_pii, and deidentify, plus document-level batch_size chunking.
Added private batch helpers for PII extraction and de-identification so BatchProcessor can reuse one loader or privacy-filter pipeline per batch instead of rebuilding per record.
Preserved existing single-call extract_pii/deidentify behavior, including smart merging, language defaults, accent normalization, privacy-filter routing, deterministic replacement, date shifting, and error handling.
Added unit coverage for operation dispatch, chunking, kwargs forwarding, file/directory flows, fallback behavior, single-vs-batch parity, and privacy-filter pipeline reuse.
Updated README/docs/changelog and added examples/pii_batch_processing.py.
Validation run locally:
uv run pytest tests/unit/test_batch.py tests/unit/test_pii.py tests/unit/test_privacy_filter_routing.py -q
Follow-up benchmark after tightening the batch path in e85889e.
Setup: macOS arm64, 64 short clinical/PII notes, median of 3 hot runs after one warmup call. The single baseline is BatchProcessor(..., batch_size=1), so model load is not included and the comparison isolates single-document chunks vs multi-document chunks with the same cached pipeline.
Operation
Backend/model
Mode
Batch size
Median time
ms/doc
Docs/sec
Speedup
extract_pii
CPU/Torch openai/privacy-filter
single-doc chunks
1
8.2088s
128.26
7.80
1.00x
extract_pii
CPU/Torch openai/privacy-filter
batch chunks
16
2.5115s
39.24
25.48
3.27x
extract_pii
MLX OpenMed/privacy-filter-mlx-8bit
single-doc chunks
1
0.6694s
10.46
95.61
1.00x
extract_pii
MLX OpenMed/privacy-filter-mlx-8bit
batch chunks
16
0.2889s
4.51
221.54
2.32x
deidentify
CPU/Torch openai/privacy-filter
single-doc chunks
1
6.4278s
100.43
9.96
1.00x
deidentify
CPU/Torch openai/privacy-filter
batch chunks
16
1.9006s
29.70
33.67
3.38x
deidentify
MLX OpenMed/privacy-filter-mlx-8bit
single-doc chunks
1
0.5966s
9.32
107.28
1.00x
deidentify
MLX OpenMed/privacy-filter-mlx-8bit
batch chunks
16
0.2754s
4.30
232.40
2.17x
CPU/Torch batch now forwards batch_size into the underlying Transformers pipeline. MLX list input now pads non-empty records and runs one model forward for the chunk instead of looping one document at a time. The batch processor also caches the privacy-filter pipeline across chunks, so large jobs do not rebuild the model per chunk.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds real multi-document batch support for PII extraction and de-identification in
BatchProcessor, addressing #33 for larger record workloads.It keeps the useful
operation=API direction from the original draft and expands it beyond a per-item wrapper:What Changed
BatchProcessor(operation=...)values:"analyze_text"for existing clinical NER behavior"extract_pii"for batch PII extraction"deidentify"for batch redaction/anonymizationbatch_sizechunking for PII operations.ModelLoaderor privacy-filter pipeline across a batch job instead of rebuilding work per item/chunk.process_texts,process_files,process_directory,iter_process, and module-levelprocess_batch(...)surfaces.continue_on_error=Truefallback behavior andcontinue_on_error=Falseimmediate raise behavior.BatchItemResultcan carry eitherPredictionResultorDeidentificationResult.examples/pii_batch_processing.py.Performance
Hot-run benchmark on 64 short PII notes, median of 3 runs after one warmup call. Model load is excluded; the single baseline uses
BatchProcessor(..., batch_size=1)and the batch path usesbatch_size=16.extract_piiopenai/privacy-filterextract_piiOpenMed/privacy-filter-mlx-8bitdeidentifyopenai/privacy-filterdeidentifyOpenMed/privacy-filter-mlx-8bitBackend-specific performance work in this branch:
batch_sizeto the Transformers privacy-filter pipeline.BatchProcessorcaches the privacy-filter pipeline across chunks for the batch job.Tests And Docs
Validation run locally:
Results:
173 passed, 1 skipped1194 passed, 1 skippedRelated Issues
Closes #33