Skip to content

Commit 0b8a6ef

Browse files
committed
add persist option
1 parent 9198329 commit 0b8a6ef

1 file changed

Lines changed: 24 additions & 9 deletions

File tree

README.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ As Redactomatic processes each row of an input CSV file, it replaces each recogn
2828

2929
### Anonymization
3030

31-
If the optional command-line `--anonymize` switch is included, Redactomatic will replace all entity type tags with a randomized value.
31+
If the optional command-line `--anonymize` switch is included, Redactomatic will replace all entity type tags with a randomized value. By default, subsequent occurrences of the same entities with the same index number in a given conversation will be assigned the same value. This helps the anonymized conversation retain continuity.
3232

3333
Anonymization functions are supplied and you can add specific ones if required. By default alpha-numerical entity tags are anonymized using a random number/text generator based on patterns (regex).
3434

@@ -73,7 +73,7 @@ entities:
7373
model-class: anonymize.AnonRestoreEntityText
7474
```
7575

76-
If you are interested in how this is implemented, notice how the definition of the *\_IGNORE\_* rule uses the redadctor `redact.RedactorPhraseList` to first redact the words to be ingored, and then uses the anonymizer `anonymize.AnonRestoreEntityText `to restore the text again. In addition to this entityalso needs to be added to the *always-anonymize* section of the configuration to ensure that it is anonymized even when the -`--anonymize` option is not set. Each of these steps are explained in detail later in this document.
76+
If you are interested in how this is implemented, notice how the definition of the *\_IGNORE\_* rule uses the redadctor `redact.RedactorPhraseList` to first redact the words to be ingored, and then uses the anonymizer `anonymize.AnonRestoreEntityText `to restore the text again. In addition to this the entity also needs to be added to the *always -anonymize* section of the configuration to ensure that it is anonymized even when the -`--anonymize` option is not set. Each of these steps are explained in detail later in this document.
7777

7878
## Installation
7979

@@ -361,7 +361,7 @@ always-anonymize:
361361

362362
The `always-anonymize` section lists entities that are anonymized even if the `--anonymize` flag is not set. This allows entities that catch text to be ignored to restore them afterwards even if anonymization is not performed.
363363

364-
This section is optional and if it is omitted then the rule shown above is implemented by default. This is to provide backwards compatibility. It is recommended that this section is included for clarity. If this section is defined then it overrides the default. This means that you should explicitly include the *_IGNORE_* entity if you with to use this entity to protect and restore text.
364+
This section is optional and if it is omitted then the rule shown above is implemented by default. This is to provide backwards compatibility. It is recommended that this section is included for clarity. If this section is defined then it overrides the default. This means that you should explicitly include the *\_IGNORE\_* entity if you with to use this entity to protect and restore text.
365365

366366
The always-anonymize section can be used to anonymize entities with any kind of anonymizer defined. This feature can be used for things other than restoring ignored text. You can also have multiple entities in this section if desired.
367367

@@ -408,7 +408,7 @@ You may be wondering why this is neccessary and how the redacted text can contai
408408

409409
* The label was generated by special redactor processes such as `_SPACY_` that add multiple label types and you want to share anonymizer rules for these labels.
410410

411-
If redactomatic is expecting to anonymize and entity and does not find an entry for it in the anon-map then it will assume that the entity maps to itself. However if an entity is defined in the key of the anon-map then only the entities found in the map will use the anonymizer defined for that entity.
411+
If redactomatic is expecting to anonymize an entity and does not find an entry for it in the anon-map then it will assume that the entity maps to itself. However if an entity is defined in the key of the anon-map then only the entities found in the map will use the anonymizer defined for that entity.
412412

413413
```
414414
anon-map:
@@ -607,15 +607,15 @@ Redactomatic has three built-in Redactor classes. New redactors can be added by
607607
redactor:
608608
model-class: redact.RedactorRegex
609609
text:
610-
regex: '\d{1-3}\.com'
610+
regex: ['\d{1-3}\.com']
611611
regex-id: my-rule-id
612612
group: my-named-group
613613
flags: [ ASCII, IGNORECASE, ... ]
614614
voice:
615615
...
616616
```
617617

618-
The `redact.RedactorRegex` class uses a regular expression to match the entity. The regular expression can be specified via a `regex ` inline pattern, or be a shared rule with the `regex-id` key in the `regex `section. The whole regex pattern must match part or all of the phrase. Then the matching part of the phrase will be redacted with the redaction label (e.g. [MYDOMAIN-23] ). It is possible to redact only part of the matching area of the phrase but specifying the `group ` parameter. This can be an integer group number or a named group (using PCRE naming).
618+
The `redact.RedactorRegex` class uses a regular expression to match the entity. The regular expression can be specified via a set of `regex ` inline patterns, or be a shared rule with the `regex-id` key in the `regex `section. The whole regex pattern must match part or all of the phrase. Then the matching part of the phrase will be redacted with the redaction label (e.g. [MYDOMAIN-23] ). It is possible to redact only part of the matching area of the phrase but specifying the `group ` parameter. This can be an integer group number or a named group (using PCRE naming).
619619

620620
Flags for the regular expression match can be specified via the `flags`value. This is a list of items as given below . By default [ IGNORECASE ] is used.
621621

@@ -631,7 +631,7 @@ Flags for the regular expression match can be specified via the `flags`value. T
631631

632632
- LOCALE, L
633633

634-
It is possible to specify more than one regular expression for the redactor. If a list of regular expressions is specified then the redactor will attempt to match the given text against each of the patterns in turn. The matching is done in the order that the list is defined and any matching text is redacted once it is found. Matching text does not stop any subsequent patterns from also being matched on the text. For example if a pair of patterns is specified then a given text may match one of the patters in one part of the text and the other pattern in another part of the same text. The two matching sections cannot overlap.
634+
Redactomatic expects to be given a list of regular expression for the redactor. If a list of regular expressions is specified then the redactor will attempt to match the given text against each of the patterns in turn. The matching is done in the order that the list is defined and any matching text is redacted once it is found. Matching text does not stop any subsequent patterns from also being matched on the text. For example if a pair of patterns is specified then a given text may match one of the patters in one part of the text and the other pattern in another part of the same text. The two matching sections cannot overlap.
635635

636636
#### redact.RedactorPhraseList
637637

@@ -716,22 +716,37 @@ The `redact.RedactorSpacy` class implements the redaction of text using the Spac
716716

717717
Redactomatic currently has several built-in anonymizer classes. There are four generic anonymizers and a number of custom anonymizers for specific entities.
718718

719+
### Turning off persistence
720+
721+
The *persist* rule is universal to all anonymizer classes. This rule will also be inherited by any custom anonymizers that you add.
722+
723+
```
724+
...
725+
    anonymizer:
726+
        model-class: ..all-model-classes..
727+
        persist: False
728+
```
729+
730+
By default the *persist* rule have the value *True* and does not need to be defined. With this default value then replacements of entites with the same index number will be anonymized with the same value. For example if the redaction label [Name-99] will always be anonymized with the first randome value assigned to it any given conversation regardless of how many times it occurs.
731+
732+
If the persist value is set to False then this entity will always be given a new random value even if the same index is repeated through the conversation. This can be helpful where redacting things that are too generic to reliably be the same entity. For example imagine a redaction rule that redacts all isolated digits. It may not be desirable to anonymize all redacted digits with the same anonymized digit.
733+
719734
#### anonomizer.AnonRegex
720735

721736
```
722737
...
723738
anonymizer:
724739
model-class: anonymize.AnonRegex
725740
text:
726-
regex: '[a-z]{0-16}\.com'
741+
regex: ['[a-z]{0-16}\.com']
727742
regex-id: my-rule-id
728743
limit: 10
729744
     flags: [ IGNORECASE ]
730745
voice:
731746
...
732747
```
733748

734-
The `anonomizer.AnonRegex` class is used to generate random text strings using a regular expression as a generative grammar. The regular expressions can be expressed inline via the `regex` parameter or by reference rules in the `regex `section using the `regex-id` parameter. The `limit `parameter defines the maximum number of repeats that a repeating pattern will be permitted to follow before terminating. This prevents infinite loops and can be used to limit computationally costly patterns. This is set to 10 by default. For more details see [xeger PyPI](https://pypi.org/project/xeger/). The `flags `parameter behaves as described for `redact.RedactorRegex`. This class does not support PCRE.
749+
The `anonomizer.AnonRegex` class is used to generate random text strings using a regular expression as a generative grammar. The regular expressions can be expressed inline via the `regex` parameter or by reference rules in the `regex `section using the `regex-id` parameter. Both are shown in the xample above but only define one of these in each anonymizer. The `limit `parameter defines the maximum number of repeats that a repeating pattern will be permitted to follow before terminating. This prevents infinite loops and can be used to limit computationally costly patterns. This is set to 10 by default. For more details see [xeger PyPI](https://pypi.org/project/xeger/). The `flags `parameter behaves as described for `redact.RedactorRegex`. This class does not support PCRE.
735750

736751
```
737752
entities:

0 commit comments

Comments
 (0)