Releases · uktrade/idscrub · GitHub

30 Mar 15:25

esoutter

v2.0.15 Latest

Latest

Adds NHS recogniser if the entity is specified
Adds NINO recogniser if the entity is specified
Corrects type hints on (you can't have None as the default value for a list type)
Adds extra boilerplate because of the above correction
Adds tests for NHS and NINO functionality

Assets 2

19 Mar 08:11

esoutter

v2.0.14

Small type, test and README fixes

Assets 2

17 Mar 14:18

esoutter

v2.0.13

Use native Pandas type "string" instead of Python base "str" for DataFrame type conversion

Assets 2

19 Feb 10:27

esoutter

v2.0.12

Adds **kwargs argument to IDScrub.dataframe(), which allows other keyword arguments in IDScrub() to be modified when using IDScrub.dataframe()
Fixes ID column bug in IDScrub.dataframe() when no id_col argument is passed

Assets 2

13 Feb 10:16

esoutter

v2.0.11

Add custom_methods.ipynb to give further examples on customisation and contribution
Add extra information and advice to README

Assets 2

03 Feb 11:51

esoutter

v2.0.1

Handle whitespace-only strings if passed to IDScrub.presidio_entities by stripping whitespace

Assets 2

03 Feb 10:26

esoutter

v2.0.0

This release contains non-backward compatible (breaking) changes. This is required for improving the package.
scrub.scrub() now takes a pipeline dictionary that defines the scrub methods.
Each method now outputs a list of IDEnt (identified entity) objects instead of the scrubbed text. These objects contain all of the information required to find and replace the entity identified.
The list is then passed to a resolve_overlaps , which selects the entity with a higher priority score if multiple entities are identified in the same text e.g. john@madeupemail.com is both an email and a handle (@madeupemail), but we can score the email higher so an email is retracted.
The final de-duplicated list is then passed to scrub_texts , which removes the text and replaces it.
Add an exclude argument to exclude certain strings from being scrubbed.
Improve error handling.
This is a big improvement because every method sees the same text, then the text is scrubbed in one step once all of the conflicts have been reconciled.

Contributors

madeupemail

Assets 2

27 Jan 12:38

esoutter

v1.1.1

Update README to emphasise development
Pin Pandas <3.0 until changes resolved

Assets 2

20 Jan 13:06

esoutter

v.1.1.0

Allows users to specify which SpaCy entities to scrub
Changes all NER scrubbing methods to <method>_entities e.g. IDScrub.spacy_entities()
Adds a uk addresses scrubbing method

Assets 2